Who is the user of olap systems. OLAP reports

03.04.2019 Mistakes

OLAP (OnLine Analytical Processing) is not the name of a specific product, but of an entire online analytical processing technology that involves data analysis and reporting. The user is provided with a multidimensional table that automatically summarizes the data in various sections and allows you to quickly manage the calculations and the form of the report.

Although in some publications analytical processing is called both online and interactive, the adjective "online" most accurately reflects the meaning of OLAP technology. The development of managerial management decisions falls into the category of areas most falsely amenable to automation. However, today there is an opportunity to assist the manager in the development of decisions and, most importantly, to significantly speed up the process of developing decisions, their selection and adoption.

Decision support systems usually have the means to provide the user with aggregate data for various samples from the initial set in a form convenient for perception and analysis. As a rule, such aggregate functions form a multidimensional data set, often called a hypercube or metacube, whose axes contain parameters, and the cells contain aggregate data that depend on them - and such data can also be stored in relational tables, but in this case we are talking O logical organization data, and not about the physical implementation of their storage.

Along each axis, the data can be organized into a hierarchy representing different levels of detail.

According to the dimensions in the multidimensional model, factors that affect the activities of the enterprise are put aside (for example: time, products, company branches, etc.). The resulting OLAP-cube is then filled with indicators of the enterprise's activity (prices, sales, plan, profits, cash flow, etc.). It should be noted that, unlike a geometric cube, the faces of an OLAP cube do not have to have the same size. This filling can be carried out both with real data of operational systems, and predicted based on historical data. The dimensions of a hypercube can be complex, hierarchical, and relationships can be established between them. During the analysis, the user can change the point of view on the data (the so-called operation of changing the logical view), thereby viewing the data in different sections and solving specific problems. Various operations can be performed on cubes, including forecasting and conditional scheduling (what-if analysis).

Thanks to this data model, users can formulate complex queries, generate reports, and receive subsets of data. Operational analytical processing can significantly simplify and speed up the process of preparing and making decisions by management personnel. Online analytical processing serves the purpose of turning data into information. It is fundamentally different from the traditional decision support process, which is based, most often, on the consideration of structured reports.

OLAP technology refers to the type of intellectual analysis and involves 12 principles:

1. Conceptual multidimensional representation . The user-analyst sees the world of the enterprise as multidimensional in nature, respectively, and the OLAP model must be multidimensional at its core.

2. Transparency. The architecture of the OLAP system should be open, allowing the user, wherever he is, to communicate using an analytical tool - the client - with the server.

3. Availability. An OLAP analyst user must be able to perform analysis based on a common conceptual schema containing enterprise-wide data in a relational database as well as data from legacy legacy databases, on common access methods, and on a common analytical model. An OLAP system should only access the data that is really needed, and not apply general principle"kitchen funnel" that entails unnecessary input.

4. Consistent performance in report development. With an increase in the number of dimensions or the size of the database, the analyst user should not experience a significant decrease in performance.

5. Client-server architecture. Most of the data that today needs to be subjected to online analytical processing is contained on mainframes with access to user workstations via LAN. This means that OLAP products must be able to work in a client-server environment.

6. General multidimensionality. Each dimension should be applied regardless of its structure and operational capabilities. Basic Structures data, formulas and reporting formats should not be biased towards any one dimension.

7. Dynamic management of sparse matrices. The physical design of an OLAP tool must be fully adaptable to the specific analytical model in order to optimally manage sparse matrices. Sparsity (measured as a percentage empty cells to all possible) is one of the characteristics of data dissemination.

8. Multi-User Support. An OLAP tool must provide the ability sharing requesting and supplementing multiple analyst users while maintaining integrity and security.

9. Unlimited cross operations. Various operations, due to their hierarchical nature, can represent dependent relationships in the OLAP model, that is, they are cross-functional. Their execution should not require the analyst user to redefine these calculations and operations.

10. Intuitive data manipulation. The analyst user's view of the dimensions defined in the analytical model must contain all necessary information to perform actions on the OLAP model, i.e. they should not require the use of a menu system or other multiple user interface operations.

11. Flexibility receiving reports. Reporting tools should be synthesized data or information resulting from the data model in any possible orientation. This means that the rows, columns, or pages of a report must display multiple dimensions of an OLAP model at the same time, with the ability to display any subset of the elements (values) contained in the dimension, and in any order.

12. Unlimited dimension and number of aggregation levels. Study on a possible number required measurements required in the analytical model showed that up to 19 measurements can be used simultaneously by the user-analyst. This leads to a recommendation about the number of dimensions supported by the OLAP system. Moreover, each of the common dimensions should not be limited by the number of levels of aggregation defined by the user-analyst.

As specialized OLAP systems currently offered on the market, you can specify CalliGraph, Business Intelligence.

To solve simple data analysis problems, it is possible to use budget solution– office Excel applications and Access by Microsoft, which contain elementary means OLAP technologies that allow you to create pivot tables and build various reports based on them.

Information systems of a serious enterprise, as a rule, contain applications designed for complex analysis of data, their dynamics, trends, etc. Accordingly, top management becomes the main consumer of the analysis results. Such analysis is ultimately intended to facilitate decision making. And in order to make any management decision, it is necessary to have the information necessary for this, usually quantitative. To do this, it is necessary to collect this data from all information systems of the enterprise, bring it to a common format, and only then analyze it. To do this, create data warehouses (Data Warehouses).

What is a data warehouse?

Usually - the place of collection of all information of analytical value. The requirements for such storages follow the classic definition of OLAP and will be explained below.

Sometimes the Warehouse has another purpose - the integration of all enterprise data, to maintain the integrity and relevance of information within all information systems. That. the repository accumulates not only analytical, but almost all information, and can issue it in the form of directories back to other systems.

A typical data warehouse is usually different from a typical relational database. First, conventional databases are designed to help users do their daily work, while data warehouses are designed to make decisions. For example, selling a product and issuing an invoice are made using a database designed to process transactions, and analyzing the dynamics of sales over several years, which allows you to plan work with suppliers, using a data warehouse.

Secondly, regular databases are subject to constant changes in the course of work of users, and the data warehouse is relatively stable: the data in it is usually updated according to a schedule (for example, weekly, daily, or hourly, depending on the needs). Ideally, the replenishment process is simply adding new data over a period of time without changing the old information already in storage.

And, thirdly, conventional databases are most often the source of data that enters the repository. In addition, the storage can be replenished by external sources such as statistical reports.

How is storage built?

ETL– basic concept: Three stages:

Extraction - extracting data from external sources in an understandable format;
Transformation - transformation of the source data structure into structures that are convenient for building an analytical system;

Let's add one more stage - data cleaning ( cleaning) - the process of screening out irrelevant or correcting erroneous data based on statistical or expert methods. In order not to generate later reports like "Sales for 20011".

Let's return to the analysis.

What is analysis and why is it needed?

Analysis is the study of data in order to make decisions. Analytical systems are called so - decision support systems ( DSS).

Here it is worth pointing out the difference between working with DSS and a simple set of regulated and non-regulated reports. Analysis in DSS is almost always interactive and iterative. Those. the analyst digs into the data, compiling and correcting analytical queries, and receives reports, the structure of which may not be known in advance. We will return to this in more detail below when we discuss the query language. MDX.

OLAP

Decision support systems usually have the means to provide the user with aggregate data for various samples from the initial set in a form convenient for perception and analysis (tables, diagrams, etc.). The traditional approach of source data segmentation uses the selection of one or more multidimensional data sets (often called a hypercube or metacube) from the source data, the axes of which contain attributes, and the cells contain aggregated quantitative data. (Moreover, such data can be stored in relational tables, but in this case we are talking about the logical organization of data, and not about the physical implementation of their storage.) Along each axis, attributes can be organized as hierarchies representing different levels of detail. Thanks to this data model, users can formulate complex queries, generate reports, and receive subsets of data.

The technology of complex multidimensional data analysis is called OLAP (On-Line Analytical Processing). OLAP is a key component of traditional data warehousing. The concept of OLAP was described in 1993 by Edgar Codd, a renowned database researcher and author relational model data. In 1995, based on the requirements outlined by Codd, the so-called FASMI test (Fast Analysis of Shared Multidimensional Information - fast analysis of shared multidimensional information) was formulated, which includes the following requirements for applications for multidimensional analysis:

providing the user with the results of the analysis in an acceptable time (usually no more than 5 s), even at the cost of a less detailed analysis;
the possibility of carrying out any logical and statistical analysis characteristic of this application, and save it in an accessible end user form;
multi-user access to data with support for appropriate locking mechanisms and authorized access tools;
multidimensional conceptual representation of data, including full support for hierarchies and multiple hierarchies (this is a key OLAP requirement);
the ability to access any necessary information, regardless of its volume and storage location.

It should be noted that OLAP functionality can be implemented different ways, starting with the simplest data analysis tools in office applications and ending with distributed analytical systems based on server products. Those. OLAP is not a technology, but ideology.

Before talking about the various implementations of OLAP, let's take a closer look at what cubes are from a logical point of view.

Multidimensional concepts

We will use the Northwind database included with Microsoft to illustrate OLAP principles. SQL Server and which is a typical database that stores information about the trading operations of a company engaged in wholesale food supplies. Such data includes information about suppliers, customers, a list of supplied goods and their categories, data on orders and ordered goods, a list of company employees.

Cube

Let's take for example the table Invoices1, which contains the company's orders. The fields in this table will be as follows:

Order date
A country
City
Customer name
Delivery company
Product Name
Quantity of goods
Order price

What aggregate data can we get based on this view? Usually these are answers to questions like:

What is the total cost of orders placed by customers from a particular country?
What is the total cost of orders placed by customers from a certain country and delivered by a certain company?
What is the total value of orders placed by customers from a particular country in a given year and delivered by a particular company?

All this data can be obtained from this table with quite obvious SQL queries with grouping.

The result of this query will always be a column of numbers and a list of attributes that describe it (for example, a country) - this is a one-dimensional data set or, in mathematical terms, a vector.

Imagine that we need to get information on the total cost of orders from all countries and their distribution by carrier companies - we will already get a table (matrix) of numbers, where the column headers will list the carriers, the row headers will list the countries, and the cells will contain amount of orders. This - two-dimensional array data. Such a set of data is called a pivot table ( pivot table) or a crosstab.

If we want to get the same data, but in the context of years, then there will be one more change, i.e. the dataset will become three-dimensional (3rd order conditional tensor or 3-dimensional "cube").

Obviously, the maximum number of dimensions is the number of all attributes (Date, Country, Customer, etc.) that describe our aggregated data (amount of orders, quantity of goods, etc.).

So we come to the concept of multidimensionality and its embodiment - multidimensional cube. This table will be called fact table". Dimensions or Cube Axes ( dimensions) are attributes whose coordinates are expressed by the individual values of those attributes present in the fact table. Those. for example, if information about orders was maintained in the system from 2003 to 2010, then this axis of years will consist of 8 corresponding points. If orders come from three countries, then the country axis will contain 3 points, and so on. Regardless of how many countries are included in the Directory of Countries. The points on the axis are called its "members" ( Members).

The aggregated data itself in this case will be called "measures" ( measure). To avoid confusion with "dimensions", it is preferable to refer to the latter as "axes". The set of measures forms another "Measures" axis ( Measures). It has as many members (points) as there are measures (aggregated columns) in the fact table.

Members of dimensions or axes can be grouped together in one or more hierarchies ( hierarchy). Let us explain what a hierarchy is with an example: cities from orders can be combined into districts, districts in a region, regions of a country, countries into continents or other entities. Those. there is a hierarchical structure - the continent country-region-district-city– 5 levels ( Level). For the district, the data is aggregated for all the cities that are included in it. For an area for all districts that contain all cities, etc. Why do we need multiple hierarchies? For example, on the order date axis, we might want to group points (i.e. days) in a hierarchy Year-Month-Day or by Year-Week-Day: in both cases, three levels. Obviously Week and Month groups days differently. There are also hierarchies, the number of levels in which is not deterministic and depends on the data. For example, folders on a computer disk.

Data aggregation can occur using several standard features: sum, min, max, average, count.

MDX

Let's move on to the query language in multidimensional data.
The SQL language was originally designed not for programmers, but for analysts (and therefore has a syntax that resembles natural language). But over time, it became more and more complicated and now few analysts know how to use it well, if at all. It has become a tool for programmers. The MDX query language, rumored to have been developed by our former compatriot Moishe (or Mosha) Posumansky in the wilds of Microsoft Corporation, was also originally supposed to be aimed at analysts, but its concepts and syntax (which vaguely resembles SQL, and completely in vain, because this is just confusing), even more complicated than SQL. Nevertheless, its basics are still easy to understand.

We will consider it in detail because it is the only language that has received the status of a standard within the framework of the general XMLA protocol standard, and secondly, because there is an open-source implementation of it in the form of the Mondrian project from the company Pentaho. Other OLAP analysis systems (for example, Oracle OLAP Option) usually use their own SQL syntax extensions, however, they also declare support for MDX.

Working with analytical data arrays implies only their reading and does not imply writing. That. in the MDX language there are no clauses for changing data, but there is only one selection clause - select.

In OLAP, from multidimensional cubes, you can make slices– i.e. when the data is filtered along one or more axes, or projections- when the cube "collapses" along one or several axes, aggregating data. For example, our first example with the sum of orders from countries - there is a projection of the cube on the axis of the Country. The MDX query for this case would look like this:

Select ...Children on rows from
What is what here?

Select- the keyword is included in the syntax solely for beauty.
is the name of the axis. All proper names in MDX are written in square brackets.
is the name of the hierarchy. In our case, this is the Country-City hierarchy.
is the name of the axis member at the first level of the hierarchy (i.e. country) All is a meta member that combines all members of the axis. There is such a meta-member in every axis. For example, in the years axis there is “All years”, etc.
Children is a member function. Each member has several available functions. such as parent. Level, Hierarchy, returning respectively the ancestor, the level in the hierarchy and the hierarchy itself, to which the member belongs in this case. Children - Returns the set of child members of this member. Those. in our case, countries.
on rows– Specifies how to arrange this data in the summary table. In this case, in the row header. Possible meaning here: on columns, on pages, on paragraphs, etc. It is also possible to specify simply by indexes, starting from 0.
from is an indication of the cube from which the selection is made.

What if we don't need all countries, but only a couple of specific ones? To do this, you can explicitly indicate in the request those countries that we need, and not select all with the Children function.

Select ( ..., ... ) on rows from
The curly braces in this case are the set declaration ( set). A set is a list, an enumeration of members from one axis.

Now let's write a query for our second example - the output in the context of the deliverer:

Select ...Children on rows .Members on columns from
Added here:
- axis;
.Members is an axis function that returns all members on it. The same function is available for the hierarchy and the level. Because there is only one hierarchy in this axis, then its indication can be omitted, because level and hierarchy is also the same, then you can display all members in one list.

I think it's already obvious how we can continue this to our third example with detailing by year. But let's better not detail by year, but filter - i.e. build a cut. To do this, write the following query:

Select ..Children on rows .Members on columns from where (.)
Where is the filtering?

where- keyword
is one member of the hierarchy . The full name, including all terms, would be: .. , but because the name of this member is unique within the axis, then all intermediate name qualifiers can be omitted.

Why is the date member in brackets? Parentheses are a tuple ( tuple). A tuple is one or more coordinates along various axes. For example, to filter along two axes at once, in parentheses, we list two terms from different measurements separated by commas. That is, the tuple defines a "slice" of the cube (or "filtering" if such terminology is closer).

The tuple is used for more than just filtering. Tuples can also be in row/column/page headers, etc.

This is necessary, for example, in order to display the result of a three-dimensional query in a two-dimensional table.

Select crossjoin(...Children, ..Children) on rows .Members on columns from where (.)
Crossjoin is a function. It returns a set of tuples (yes, a set can contain tuples!), resulting from the Cartesian product of two sets. Those. the result set will contain all possible combinations of Countries and Years. Row headers will thus contain a couple of values: Country-Year.

The question is, where is the indication of what numerical characteristics should be displayed? In this case, the default measure specified for this cube is used, i.e. Order price. If we want to display another measure, then we remember that measures are members of the dimension Measures. And we act in the same way as with the rest of the axes. Those. filtering a query by one of the measures will display exactly this measure in the cells.

Question: how is filtering in where different from filtering by specifying the members of the axes in on rows. Answer: practically nothing. It's just that in where a slice is indicated for those axes that do not participate in the formation of titles. Those. the same axis can not be present at the same time on rows, and in where.

Computed Members

For more complex queries, you can declare calculated members. Members of both the attribute axis and the measure axis. Those. You can declare, for example, a new measure that will display the contribution of each country to total amount orders:

With member. as ‘.CurrentMember / ..’, FORMAT_STRING=‘0.00%’ select ...Children on rows from where .
The calculation takes place in the context of a cell that has all of its coordinate attributes known. The corresponding coordinates (members) can be obtained by the CurrentMember function for each of the cube axes. It must be understood here that the expression .CurrentMember / ..' does not divide one term by another, but divides relevant aggregated data cube slices! Those. the slice for the current territory will be divided into a slice for all territories, i.e. the total value of all orders. FORMAT_STRING - sets the format for outputting values, i.e. %.

Another example of a calculated member, but already on the years axis:

With member. as'. - .'
It is obvious that in the report there will be not a unit, but the difference of the corresponding slices, i.e. the difference in the amount of orders in these two years.

Display in ROLAP

OLAP systems are somehow based on some kind of data storage and organization system. When it comes to RDBMS, they talk about ROLAP (we will leave MOLAP and HOLAP for self-study). ROLAP - OLAP on a relational database, i.e. described in the form of conventional two-dimensional tables. ROLAP systems convert MDX queries to SQL. The main computational problem for the database is fast aggregation. In order to aggregate faster, the data in the database is usually highly denormalized, i.e. are not stored very efficiently in terms of disk space and database integrity control. Plus additionally contain auxiliary tables that store partially aggregated data. Therefore, for OLAP, a separate database schema is usually created, which only partially repeats the structure of the original transactional databases in terms of directories.

Navigation

Many OLAP systems offer tools for interactive navigation through an already formed query (and, accordingly, selected data). In this case, the so-called "drilling" or "drilling" (drill) is used. A more adequate translation into Russian would be the word "deepening". But this is a matter of taste. In some environments, the word "drilling" has stuck.

Drill- this is report refinement by reducing the degree of data aggregation, combined with filtering along some other axis (or several axes). Drilling is of several types:

drill-down– filtering by one of the initial axes of the report with the output detailed information by descendants within the hierarchy of the selected filter member. For example, if there is a report on the distribution of orders by Countries and Years, then when you click on the year 2007, a report will be displayed in the context of the same Countries and months of 2007.
drill-aside– filtering under one or more selected axes and removing aggregation along one or more other axes. For example, if there is a report on the distribution of orders by Countries and Years, then when you click on 2007, another report will be displayed in the context of, for example, Countries and Suppliers filtered by 2007.
drill through– removal of aggregation on all axes and simultaneous filtering on them – allows you to see the original data from the fact table, from which the value in the report was obtained. Those. when you click on a cell value, a report is displayed with all the orders that gave that amount. A kind of instant drilling into the very "bowels" of the cube.

That's all. Now, if you decide to devote yourself to Business Intelligence and OLAP, it's time to start reading serious literature.

Tags:

OLAP
Mondrian
business intelligence
MDX

Add tags

After the data is received, cleaned, brought to a single form and placed in the warehouse, they need to be analyzed. For this, OLAP technology is used.

The twelve defining principles of OLAP were formulated in 1993 by E.F. Codd, the "inventor" of relational databases. OLAP is OnLine Analytical Processing, that is, online data analysis. Later, Codd's definition was reworked into the so-called FASMI test (Fast Analysis of Shared Multidimensional Information - fast analysis of shared multidimensional information), which requires an OLAP application to provide the following possibilities quick analysis of shared multidimensional information: high speed; analysis; access sharing; multidimensionality; work with information.

High speed. The analysis should be carried out equally quickly on all aspects of the information. In this case, the permissible response time is no more than 5 seconds.

Analysis. It should be possible to perform basic types of numerical and statistical analysis - either predefined by the application developer or arbitrarily defined by the user.

Sharing access. Access to data should be multi-user, while access to confidential information should be controlled.

Multidimensionality. The main, most essential characteristic of OLAP.

Working with information. The application must be able to access any necessary information, regardless of its volume and storage location.

Multidimensional representation. OLAP provides organizations with the most convenient and fast funds access, view and analyze business information. Most importantly, OLAP provides the user with a natural, intuitive data model by organizing it into multidimensional cubes (Cubes). Axes (dimensions) multidimensional system coordinates are the main attributes of the analyzed business process. For example, for a sales process, this could be a product category, a region, a customer type. Almost always, time is used as one of the measurements. Inside the cube are data that quantitatively characterize the process - the so-called measures (Measures). These can be sales volumes in pieces or in monetary terms, stock balances, costs, etc. The user analyzing the information can "cut" the cube according to different directions, receive summary (for example, by years) or, conversely, detailed (by weeks) data and perform other operations that are necessary for him to analyze.

Storage of OLAP data . First of all, it must be said that, since the analyst always operates with some summary (and not detailed) data, OLAP databases almost always store along with detailed data the so-called aggregates, that is, pre-calculated summary indicators. Examples of aggregates are total sales for a year or average inventory balance. Storing precomputed aggregates is the primary way to speed up OLAP queries.

However, building aggregates can lead to a significant increase in the size of the database.

Another problem with OLAP data storage is the sparseness of multidimensional data. For example, if there were no sales in a certain region in 2000, then there will be no value at the intersection of the corresponding cube dimensions. If the OLAP server stores in this case some missing value, then with significant data sparseness, the number of empty cells (requiring, nevertheless, storage space) may many times exceed the number of filled ones, and as a result, the total volume will unreasonably increase. Solutions offered for this by Microsoft are given below.

Varieties of OLAP. To store OLAP data can be used:

Special multidimensional DBMS (OLAP-servers). In this case, one speaks of MOLAP (Multidimensional OLAP). When executing complex queries that analyze data in various dimensions, multidimensional DBMS provide better performance than relational ones. At the same time, the speed of query execution does not depend on which dimension the "cut" of the multidimensional cube is made on.

Traditional relational DBMS - ROLAP (Relational OLAP). The use of special data structures - "star" and "snowflake" schemas, as well as the storage of computed aggregates, make multidimensional analysis of relational data possible. Relational DBMSs are historically more familiar and heavily invested in, so ROLAP is more common so far.

Combined option - HOLAP (Hybrid OLAP), which combines both types of DBMS. One of the options for combining the two types of DBMS is to store aggregates in a multidimensional DBMS, and detailed data (having the largest volume) in a relational one.

Microsoft offers the following OLAP analysis tools:

Microsoft SQL Server 7.0 includes a full-featured OLAP server - SQL Server OLAP Services. The server, of course, is designed to serve client requests, and this requires some kind of interaction protocol and request language. For example, for client interaction with a server relational DBMS - SQL Server - the ODBC or OLE DB protocols and the language SQL queries. To access the OLAP server, Microsoft developed the OLE DB for OLAP protocol and the query language for multidimensional data - MDX (MultiDimensional eXpression). Just as for simplicity and convenience a layer of ADO (ActiveX Data Objects) objects was developed over OLE DB, ADO MD (MultiDimensional ADO) was built over OLE DB for OLAP.

Data analysis tools in Microsoft office 2000. Microsoft Excel 2000 contains a new PivotTable engine - OLAP PivotTable, which has replaced the previous version of the PivotTable engine of the same name. Along with the previous relational data analysis capabilities, the PivotTable engine now includes OLAP data analysis capabilities, that is, it acts as an OLAP client. Microsoft SQL Server 7.0 can be used as a server, as well as any product that supports the OLE DB for OLAP interface. Consolidated mechanism Excel tables V in full supports the features provided by the PivotTable Services (PTS) described above. Thus, the analyzed OLAP data can be located both in local cubes and on the OLAP server.

Microsoft Office 2000 also contains a set of ActiveX components called Office 2000 Web Components, which allow you to organize the analysis of OLAP data using Web browsing. These include the following four components:

Spreadsheet- implements limited functionality Excel sheet.

PivotTable- "twin" of Excel pivot tables; can work with OLAP Services data.

chart- allows you to build charts based on both relational and OLAP data.

data source- service component for binding other components to the data source.

When working with OLAP data, Web Components access PivotTable Services.

5.5. TECHNOLOGY OF ANALYSIS "DATA MINING"

The emergence of Data Mining technology is associated with the need to extract knowledge from heterogeneous data accumulated by information systems. There was a concept that in Russian began to be called "mining", "extraction" of knowledge. Abroad, the term "Data Mining" has been established.

The methods of mathematical statistics that were widely used in the past turned out to be useful mainly for testing pre-formulated hypotheses (verification-driven data mining) and for “rough” exploratory analysis, which forms the basis of online analytical processing (OLAP).

Key advantage data mining compared with previous methods - the ability to automatically generate hypotheses about the relationship between various parameters or data components. The work of an analyst when working with a traditional data processing package is actually reduced to checking or refining one or two hypotheses generated by him. In cases where there are no initial assumptions, and the amount of data is significant, existing systems lose their efficiency and turn into analyst time wasters.

Another one important feature Data Mining systems the ability to process multidimensional queries and search for multidimensional dependencies. Also unique is the ability data systems mining automatically detect exceptional situations– i.e. data elements that "fall out" of the general patterns.

There are five standard types of patterns that allow you to identify Data Mining methods

association

subsequence

classification

clustering

forecasting

The search for patterns is carried out by methods that are not limited by a priori assumptions about the structure of the sample and the type of distributions of the values of the analyzed indicators. Examples of tasks for such a search when using Data Mining are shown in Table 1.

Table 1 - Comparison of task formulations when using OLAP and Data Mining methods

To solve analytical problems related to complex calculations, forecasting, modeling scenarios "What if ..." the technology of multidimensional data analysis is used - Technology OLAP. The concept of OLAP was first described in 1993 by Edgar Codd, a well-known database researcher and author of the relational data model, in the book "OLAP for Analyst Users: What It Should Be", where he outlined 12 laws of analytical data processing, according to which OLAP developers - products live now:

1. Conceptual multidimensional representation of data.

2. Transparency (transparent access to external data for the user, allowing him, wherever he is, to communicate with the server using an analytical tool).

3. Availability and detail of data.

4. Consistent performance in report development (If the number of dimensions or the size of the database increases, the analyst user should not feel any degradation in performance).

5. Client-server architecture (OLAP is available from the desktop).

6. General multidimensionality.

7. Dynamic control of sparse matrices.

8. Multi-user support. It often happens that multiple analyst users feel the need to work together on the same analytical model or to create different models from the same data. And the OLAP tool must provide sharing (query and append), integrity, and security capabilities.

9. Unlimited cross operations.

10. Intuitive data manipulation.

11. Flexible reporting options.

12. Unlimited dimension and number of levels of aggregation (analytical tool must provide at least 15 dimensions simultaneously, and preferably 20).

The disadvantages of conventional reports for a manager are obvious: the manager does not have time to select the numbers of interest from the report, especially since there may be too many of them. The complexity of reports for understanding, the inconvenience of working with them led to the need to create a new concept of working with data.

When an analyst needs to get information, he independently or with the help of a programmer makes an appropriate SQL query to the database, receives the data of interest to him in the form of a report. Reports can be built on demand or upon the achievement of certain events or times. This raises many problems. First of all, the analyst most often does not have high-level programming skills and cannot independently execute a SQL query to the database. In addition, the analyst needs not one report, but many of them and in real time. Programmers, who can easily make any queries to the database, if they help him, then not all the time, because they also have their own work. Bulk requests to the database server complicate the work of those company employees who constantly work with databases.

The concept of OLAP appeared precisely to solve such problems. OLAP (O n L ine A nalytical P rocessing) is the operational analytical processing of large amounts of data in real time. The purpose of OLAP systems is to make it easier to solve the problems of analyzing large amounts of data and quickly processing complex database queries.

OLAP is:

not a software product

not a programming language

not technology

OLAP is a collection of concepts, principles, and requirements that make it easy for analysts to access data. It is a tool for multidimensional dynamic analysis of large volumes of data in real time.

The task of an analyst is to find patterns in large data sets. The analyst will not pay attention to a single fact, he needs information about several dozen similar events. Single facts in the database are of interest, for example, to an accountant or an employee of the sales department, in whose competence the transaction is located. One record is not enough for an analyst - for example, he may need all the transactions of a given branch or representative office for a month or a year. At the same time, the analyst discards unnecessary details such as the buyer's TIN, his exact address and phone number, contract index, and the like. At the same time, the data that an analyst needs to work necessarily contain numerical values - this is due to the very essence of his activity.

A multidimensional dataset is often represented as an OLAP cube (see Figure 26). The axes of an OLAP cube contain the parameters, and the cells contain the aggregate data that depends on them.

Rice. 26OLAP - cube

OLAP cubes are essentially meta-reports. The advantages of cubes are obvious - data needs to be requested from the database only once - when building a cube. Since analysts, as a rule, do not work with information that is supplemented and changed on the fly, the generated cube is relevant for quite a long time. Thanks to this, interruptions in the operation of the database server are not only eliminated (there are no queries with thousands and millions of response lines), but the speed of access to data for the analyst himself is also dramatically increased.

But there is also a significant drawback: an OLAP cube can take up tens or even hundreds of times more space than the original data.

OLAP - the cube does not have to be three-dimensional at all. It can be both two-dimensional and multidimensional - depending on the problem being solved. Analysts may need more than 20 measurements - serious OLAP products are designed for just such a number. The simpler desktop applications support a maximum of 6 measurements.

Far from all elements of the cube should be filled in: if any information is missing, the value in the corresponding cell will simply not be determined for it. It is also not necessary that an OLAP application store data necessarily in a multidimensional structure - the main thing is that for the user this data looks exactly like that.

The filling of the OLAP cube can be carried out both with real data from operational systems and predicted based on historical data. The dimensions of a hypercube can be complex, hierarchical, and relationships can be established between them. During the analysis, the user can change the point of view on the data (the so-called operation of changing the logical view), thereby viewing the data in different sections and solving specific problems. Various operations can be performed on cubes, including forecasting and conditional scheduling (what-if analysis).

A three-dimensional cube can be easily drawn and imagined. However, it is almost impossible to adequately represent or depict a six- or twenty-dimensional cube. Therefore, before use, ordinary two-dimensional tables are extracted from a multidimensional cube, i.e. sort of "cut" the dimensions of the cube by labels. By cutting OLAP cubes by dimensions, the analyst receives, in fact, the "ordinary two-dimensional reports" that interest him (not necessarily reports in the usual sense of the term - we are talking about data structures with the same functions). This operation is called "cutting" the cube. In this way, the analyst receives a two-dimensional slice of the cube and works with it. The cuts you need are reports.

Interacting with the OLAP system, the user can perform flexible viewing of information, obtain arbitrary data slices, and perform analytical operations of detailing, convolution, end-to-end distribution, and time comparison (see Fig. 27).

Rice. 27 PObtaining arbitrary slices of data whencutting an OLAP cube.

Classification of OLAP products

Data operations are performed by an OLAP machine. OLAP products are classified by the way data is stored and by the location of the OLAP machine.

According to the method of data storage, they are divided into three categories MOLAP, ROLAP and HOLAP:

MOLAP - source and aggregate data are stored in multidimensional database or in a multidimensional local cube.

ROLAP - source data is stored in relational database or in flat local tables on a file server. Aggregate data can be placed in service tables in the same database. The transformation of data from a relational database into multidimensional cubes occurs at the request of an OLAP tool.

HOLAP - original data remains in relational database, and the aggregate data is placed in multidimensional database. An OLAP cube is built at the request of an OLAP tool based on relational and multidimensional data.

Based on the location of the OLAP machine, there are two main classes of OLAP products: OLAP server and OLAP client.

OLAP server receives a request, calculates and stores aggregate data on the server, giving the client application installed on the client computer only the results of queries to multidimensional cubes that are stored on the server. Many modern OLAP servers support all three data storage methods: MOLAP, ROLAP, and HOLAP.

OLAP client builds a multidimensional cube and OLAP calculations not on a separate server, but on the user's client computer itself. OLAP clients are also divided into ROLAP and MOLAP.

It is known that an OLAP server can process more significant amounts of data than an OLAP client with equal computer power. This is because the OLAP server stores hard drives a multidimensional database containing precomputed cubes. Client programs make requests to the server, receiving both the cube and its fragments. The performance characteristics of an OLAP server are less sensitive to data growth.

The OLAP client must have the entire cube in RAM at the time of operation. Therefore, the amount of data processed by the OLAP client is directly dependent on the amount of RAM on the user's computer. The OLAP client generates a query to the database, which describes the filtering conditions and the algorithm for preliminary grouping of primary data. The server finds, groups records and returns a compact selection for further OLAP calculations. The size of this sample can be tens and hundreds of times smaller than the volume of primary, non-aggregated records. Consequently, the need for such an OLAP client in computer resources is significantly reduced.

OLAP server presents minimum requirements to the power of client computers. The requirements of the OLAP client are higher, because it performs calculations in its RAM. If the capacity of client computers is low, then the OLAP client will run slowly or not be able to work at all. Buying one powerful server can be cheaper than upgrading all the computers.

The cost of an OLAP server is quite high, and the implementation and maintenance of an OLAP server requires highly qualified personnel. The cost of an OLAP client is an order of magnitude lower than the cost of an OLAP server.

With the introduction of OLAP, the productivity and efficiency of enterprise management increases significantly. The main person in the data analysis process is expert- Specialist in the subject area. An expert puts forward hypotheses (assumptions) and, in order to analyze them, either looks through some samples in various ways, or builds models to test the reliability of hypotheses.

Analytical tools allow the end user, who does not have special knowledge in the field of IT, to work with large amounts of data. The purpose of analytical business systems: decision support at all levels of enterprise management.

Analytical systems operational level provide management of the enterprise in the "mode of operation", i.e. implementation of a specific production program. Analytical systems strategic level help the management of the enterprise to develop solutions in the "development mode". Strategic management systems are analytical ISs that support the decision key tasks strategic management of the company.

Many articles on OLAP can be found on the site: http://www.olap.ru/basic/oolap.asp

Perhaps, for some, the use of OLAP technology (On-line Analytic Processing) when building reports will seem like some kind of exotic, so the use of OLAP-CUBE for them is not at all one of the most important requirements for automating budgeting and management accounting.

In fact, it is very convenient to use the multidimensional CUBE when working with management reporting. When developing budget formats, one may encounter the problem of multivariate forms (more on this can be found in Book 8 "Technology for setting budgeting in a company" and in the book "Setting and automating management accounting").

This is due to the fact that the effective management of the company requires more and more detailed management reporting. That is, the system uses more and more different analytical slices (in information systems analysts are defined by a set of directories).

Naturally, this leads to the fact that managers want to receive reports in all analytical sections of interest to them. And this means that the reports need to somehow be forced to “breathe”. In other words, we can say that in this case we are talking about the fact that, in terms of meaning, the same report should provide information in various analytical sections. Therefore, static reports no longer suit many modern managers. They need the dynamics that a multidimensional CUBE can provide.

Thus, OLAP technology has already become obligatory element in modern and perspective information systems. Therefore, when choosing a software product, you need to pay attention to whether it uses OLAP technology.

And you need to be able to distinguish real CUBEs from imitations. Pivot tables in MS Excel are one such imitation. Yes, this tool looks like a CUBE, but in fact it is not, since these are static, not dynamic tables. In addition, they have a much worse implementation of the ability to build reports that use elements from hierarchical directories.

To confirm the relevance of using the CUBE when building management reporting The simplest example is the sales budget. In this example, the following analytical slices are relevant for the company: products, branches, and distribution channels. If these three analytics are important for the company, then the sales budget (or report) can be displayed in several ways.

It should be noted that if you create budget lines based on three analytical slices (as in the example under consideration), this allows you to create quite complex budget models and make detailed reports using CUBE.

For example, a sales budget can be compiled using only one analytics (reference book). An example of a sales budget based on a single "Products" analytic is shown in figure 1.

Rice. 1. An example of a sales budget built on the basis of one analytic "Products" in an OLAP-CUBE

The same sales budget can be compiled using two analytics (reference books). An example of a sales budget built on the basis of two analytics "Products" and "Affiliates" is presented on figure 2.

Rice. 2. An example of a sales budget built on the basis of two analytics "Products" and "Affiliates" in the OLAP-CUBE of the "INTEGRAL" software package

If there is a need to build more detailed reports, then the same sales budget can be compiled using three analytics (reference books). An example of a sales budget built on the basis of three dimensions "Products", "Affiliates" and "Distribution channels" is presented in Figure 3.

Rice. 3. An example of a sales budget built on the basis of three analytics "Products", "Affiliates" and "Distribution channels" in the OLAP-CUBE of the "INTEGRAL" software package

It should be recalled that the KUB used to generate reports allows you to display data in a different sequence. On Figure 3 the sales budget is first "deployed" by product, then by branch, and then by distribution channel.

The same data can be presented in a different sequence. On figure 4 the same sales budget is "rolled out" first by product, then by distribution channel, and then by branch.

Rice. 4. An example of a sales budget built on the basis of three analytics "Products", "Distribution channels" and "Affiliates" in the OLAP-CUBE of the INTEGRAL software package

On figure 5 the same sales budget is "rolled out" first by branch, then by product, and then by distribution channel.

Rice. 5. An example of a sales budget built on the basis of three analytics "Branches", "Products" and "Distribution channels" in the OLAP-CUBE of the INTEGRAL software complex

Actually it's not all possible options output of the sales budget.

In addition, you need to pay attention to the fact that the KUB allows you to work with hierarchical structure reference books. In the presented examples hierarchical directories are "Products" and "Distribution Channels".

From the user's point of view, he this example receives several management reports (see Rice. 1-5), and in terms of settings in software product is one report. Just with the help of the CUBE, it can be viewed in several ways.

Naturally, in practice, a very large number of output options for various management reports is possible if their articles are based on one or more analysts. And the set of analytics itself depends on the needs of users for detailing. True, one should not forget that, on the one hand, the more analysts, the more detailed reports can be built. But, on the other hand, it means that the financial model of budgeting will be more complex. In any case, if there is a KUB, the company will be able to view the necessary reporting in various versions, in accordance with the analytical sections of interest.

It is necessary to mention a few more features of the OLAP-CUBE.

There are several dimensions in a multidimensional hierarchical OLAP-CUBE: row type, date, rows, lookup 1, lookup 2 and lookup 3 (see Fig. Rice. 6). Naturally, the report displays as many buttons with directories as there are in the budget line containing the maximum number of directories. If there is not a single directory in any line of the budget, then the report will not contain any buttons with directories.

Initially, the OLAP-CUBE is built on all dimensions. By default, when a report is initially built, the dimensions are located exactly in those areas, as shown in figure 6. That is, such a dimension as "Date" is located in the area of vertical dimensions (dimensions in the area of columns), the dimensions "Rows", "Lookup 1", "Lookup 2" and "Lookup 3" - in the area of horizontal measurements (dimensions in the area rows) and the "Row Type" dimension in the area of "unexpanded" dimensions (dimensions in the page area). If a dimension is in the last area, then the data in the report will not be "expanded" by that dimension.

Each of these dimensions can be placed in any of the three areas. After the measurements are transferred, the report is instantly rebuilt according to the new measurement configuration. For example, you can swap the date and strings with directories. Or you can transfer one of the reference books to the vertical measurement area (see Fig. Rice. 7). In other words, the report in the OLAP-CUBE can be "twisted" and choose the version of the report output that is most convenient for the user.

Rice. 7. An example of rebuilding a report after changing the measurement configuration of the "INTEGRAL" software package

The measurement configuration can be changed either in the main form of the KUB or in the editor of the map of changes (see. Rice. 8). In this editor, you can also drag and drop measurements from one area to another with the mouse. In addition, you can swap measurements in the same area.

In addition, in the same form, you can configure some measurement parameters. For each dimension, you can customize the location of the totals, the sort order of the elements and the names of the elements (see. Rice. 8). You can also specify what name of the elements to display in the report: abbreviated (Name) or full (FullName).

Rice. 8. Editor of the map of measurements of the software complex "INTEGRAL"

Measurement parameters can be edited directly in each of them (see. Rice. 9). To do this, click on the icon located on the button next to the name of the measurement.

Rice. 9. An example of editing a directory 1 Products and services in

With this editor, you can select the elements that you want to show in the report. By default, all items are displayed in the report, but if necessary, some items or folders can be omitted. For example, if you need to display only one product group in the report, then all the rest must be unchecked in the dimension editor. After that, the report will contain only one product group (see Fig. Rice. 10).

You can also sort items in this editor. In addition, elements can be rearranged in various ways. After such a regrouping, the report is instantly rebuilt.

Rice. 10. An example of displaying only one product group (folder) in the report in the "INTEGRAL" software package

In the dimension editor, you can quickly create your own groups, drag elements from directories there, etc. By default, only the Other group is automatically created, but you can create other groups as well. Thus, using the dimension editor, you can configure which elements of reference books and in what order should be displayed in the report.

It should be noted that all such rearrangements are not recorded. That is, after the report is closed or after it is recalculated, all directories will be displayed in the report in accordance with the configured methodology.

In fact, all such changes could have been made initially when setting up the strings.

For example, using restrictions, you can also specify which elements or groups of directories should be displayed in the report, and which should not.

Note: the topic of this article is discussed in more detail at workshops "Business Budget Management" And "Setting up and automation of management accounting" conducted by the author of this article - Alexander Karpov.

If the user almost regularly needs to display only certain elements or directories folders in the report, then it is better to make such settings in advance when creating report lines. If it is important for the user various combinations elements of directories in reports, then when setting up the methodology, no restrictions need to be set. All such restrictions can be quickly configured using the dimension editor.