An Overview of Data Warehousing and OLAP Technology
Usman Ahmad Urfi
Mphil CS Lahore Leads University 2nd Semester, Lahore
Semester, Lahore [email protected]
Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Numerous business items and administrations are currently accessible, and the greater part of the central database administration framework merchants now have contributions in these zones. Choice help puts some fairly extraordinary prerequisites on database innovation contrasted with customary on-line exchange preparing applications. This paper gives an outline of information warehousing and OLAP advances, with an accentuation on their new prerequisites. We depict back end devices for extricating, cleaning and stacking information into an information distribution center; multidimensional information models run of the mill of OLAP; front end customer instruments for questioning and information examination; server augmentations for productive inquiry handling; and devices for metadata administration and for dealing with the stockroom. Notwithstanding reviewing the best in class, this paper additionally recognizes some encouraging examination issues, some of which are identified with issues that the database inquire about group has dealt with for quite a long time, yet others are just barely starting to be tended to.
Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions. The past three years have seen explosive growth, both in the number of products and services offered, and in the adoption of these technologies by industry. According to the META Group, the data warehousing market, including hardware, database software, and tools, is projected to grow from $2 billion in 1995 to $8 billion in 1998. Data warehousing technologies have been successfully deployed in many industries: manufacturing (for order shipment and customer support), retail (for user profiling and inventory management), financial services (for claims analysis, risk analysis, credit card analysis, and fraud detection), transportation (for fleet management), telecommunications (for call analysis and fraud detection), utilities (for power usage analysis), and healthcare (for outcomes analysis). This paper presents a roadmap of data warehousing technologies, focusing on the special requirements that data warehouses place on database management systems (DBMSs).
A data warehouse is a “subject-oriented, integrated, time- varying, non-volatile collection of data that is used primarily in organizational decision making.”1 Typically, the data warehouse is maintained separately from the organization’s operational databases. There are many reasons for doing this. The data warehouse supports on-line analytical processing (OLAP), the functional and performance requirements of which are quite different from those of the on-line transaction processing (OLTP) applications traditionally supported by the operational databases.
OLTP applications typically automate clerical data processing tasks such as order entry and banking transactions that are the bread-and-butter day-to-day operations of an organization. These tasks are structured and repetitive, and consist of short, atomic, isolated transactions. The transactions require detailed, up-to-date data, and read or update a few (tens of) records accessed typically on their primary keys. Operational databases tend to be hundreds of megabytes to gigabytes in size. Consistency and recoverability of the database are critical, and maximizing transaction throughput is the key performance metric. Consequently, the database is designed to reflect the operational semantics of known applications, and, in particular, to minimize concurrency conflicts.
Data warehouses, in contrast, are targeted for decision support. Historical, summarized and consolidated data is more important than detailed, individual records. Since data warehouses contain consolidated data, perhaps from several operational databases, over potentially long periods of time, they tend to be orders of magnitude larger than operational databases; enterprise data warehouses are projected to be hundreds of gigabytes to terabytes in size. The workloads are query intensive with mostly ad hoc, complex queries that can access millions of records and perform a lot of scans, joins, and aggregates. Query throughput and response times are more important than transaction throughput.
To facilitate complex analyses and visualization, the data in a warehouse is typically modeled multidimensionally. For example, in a sales data warehouse, time of sale, sales district, salesperson, and product might be some of the dimensions of interest. Often, these dimensions are hierarchical; time of sale may be organized as a day-month-quarter-year hierarchy, product as a product-category-industry hierarchy. Typical
OLAP operations include rollup (increasing the level of aggregation) and drill-down (decreasing the level of aggregation or increasing detail) along one or more dimension hierarchies, slice_and_dice (selection and projection), and pivot (re-orienting the multidimensional view of data).
Given that operational databases are finely tuned to help known OLTP workloads, endeavoring to execute complex OLAP inquiries against the operational databases would bring about unsatisfactory execution. Besides, choice help requires information that may miss from the operational databases; for example, understanding patterns or making forecasts requires verifiable information, though operational databases store just current information. Choice help ordinarily requires solidifying information from numerous heterogeneous sources: these might incorporate outer sources, for example, securities exchange nourishes, notwithstanding a few operational databases. The distinctive sources may contain information of fluctuating quality, or utilize conflicting portrayals, codes and arrangements, which must be accommodated. At last, supporting the multidimensional information models and activities run of the mill of OLAP requires unique information association, get to strategies, and execution techniques, not by and large gave by business DBMSs focused to OLTP. It is for every one of these reasons that information distribution centers are executed independently from operational databases.
Data warehouses might be implemented on standard or extended relational DBMSs, called Relational OLAP (ROLAP) servers. These servers assume that data is stored in relational databases, and they support extensions to SQL and special access and implementation methods to efficiently
In Section 2, we describe a typical data warehousing architecture, and the process of designing and operating a data warehouse. In Sections 3-7, we review relevant technologies for loading and refreshing data in a data warehouse, warehouse servers, front end tools, and warehouse management tools. In each case, we point out what is different from traditional database technology, and we mention representative products. In this paper, we do not intend to provide comprehensive descriptions of all products in every category. We encourage the interested reader to look at recent issues of trade magazines such as Databased Advisor, Database Programming and Design, Datamation, and DBMS Magazine, and vendors’ Web sites for more details of commercial products, white papers, and case studies. The OLAP Council2 is a good source of information on standardization efforts across the industry, and a paper by Codd, et al.3 defines twelve rules for OLAP products. Finally, a good source of references on data warehousing and OLAP is the Data Warehousing Information Center4.
Research in data warehousing is fairly recent, and has focused primarily on query processing and view maintenance issues. There still are many open research problems. We conclude in Section 8 with a brief mention of these issues.
2. Architecture and End-to-End Process
Figure 1 shows a typical data warehousing architecture.
Monitoring & Admnistration
implement the multidimensional data model and operations. In contrast, multidimensional OLAP (MOLAP) servers are servers that directly store multidimensional data in special data structures (e.g., arrays) and implement the OLAP operations over these special data structures.