Creating a data warehouse model based on the corporate data model. What is an Enterprise Data Warehouse and to whom to sell it Enterprise Data Model

02.07.2020 Windows 8

This article will focus on data warehouse architecture. What should be guided by when constructing it, which approaches work - and why.

"The tale is a lie - but there is a hint in it ..."

Grandfather planted ... storage. And the storehouse has grown, great, great. I just didn't really know how it worked. And grandfather started a review. The grandfather called the grandmother, granddaughter, cat and mouse to the family council. And he says the following: “Our storage has grown. Data from all systems flows down, tables are visible and invisible. Users concoct their reports. Everything seems to be good - to live and live. Yes, only one sadness - no one knows how it works. Requires disks apparently-invisibly - you can't save enough! And then the users got into the habit of coming to me with different complaints: either the report freezes, or the data is outdated. And then it’s quite a disaster - we come with reports to the tsar-father, but the numbers do not agree with each other. The hour is not even - the king is angry - then do not take off the head - neither for me, nor for you. So I decided to gather you together and consult: what are we going to do? "

He cast his gaze over the meeting and asks:
- You, grandma, do you know how our storage is arranged?
- No, grandfather, I don’t know. And how would I know? Over there, what brave lads are guarding him! Some of them! You will not approach. I went to see them somehow, baked pies. And they ate the pies, wiped their mustache and say: “Why did you come, grandma? What kind of storage are you? You tell me what kind of report you need - we will do it for you! You should bring the pies more often! Painfully, they are delicious. "
- And you, beloved granddaughter, do you know how our storage is arranged?
- No, grandpa, I don’t know. They gave me access to it somehow. I connected, I look - and there are tables - apparently invisible. And different schemes are hidden. Eyes run up…. At first I was confused. And then I looked closely - some of them are empty, others are full, but only half. And the data seems to be repeated. It is no wonder that you will not be enough of disks, with such redundancy!
- Well, you, cat, what do you say about our storage facility? Is there anything good about it?
- Yes, how not to say, grandfather - I will. At my granddaughter's request, I tried to make a pilot in a separate circuit - a small showcase. In order to understand what kind of trade is profitable for our state - what products are good for merchants, they pay tribute - they replenish the treasury. And which ones are very bad. And I began to select data for myself from this repository. Collected facts. And he began to try to compare them against products. And what, grandfather, I saw - the products seem to be the same, but you look at the plates - they are different! I then began to comb them with my granddaughter's comb. Chesal-scratched - and led to a certain uniformity, caressing the eyes. But early on I rejoiced - the next day I launched my scripts to update the wonderful data in the window - and everything was gone for me! "How so?" - I think - the granddaughter will be upset - today it would be necessary to show our pilot to the minister. How do we go with such data?
- Yes, sad tales, cat, you tell. Well, you, little mouse, really didn’t try to find out about the storage? You are a lively, nimble, sociable girl with us! What will you tell us?
- Yes, how, grandfather, do not try - of course, I am a quiet mouse, yes nimble. Once the cat's granddaughter asked me to get the data model of our state storage. And the cat, of course, came to me - for you, he says, the mouse, all hope! Well, what's a good deed for good people (and cats) not to do? I went to the castle, where the head of the warehouse hides the data model in the safe. And she hid. I waited for him to take that model out of the safe. As soon as he went out for coffee, I jumped onto the table. I look at the model - I can't understand anything! How so? I don’t recognize our storage! We have countless thousands of tables, data streams are irrepressible! And here - everything is harmonious and beautiful ... He looked at this very model - and put it back into the safe.
- Yes, very strange things, you told us, mouse.
The grandfather thought hard.
- What are we going to do, my friends? After all, with such and such a repository you will not live long ... Users will soon lose their patience.

Whatever our grandfather decided from a fairy tale - to build a new storage facility or try to reanimate an existing one - it is necessary to draw conclusions before "rolling up our sleeves" again.
Let's put aside the organizational aspects - such as the danger of concentration of expertise in a certain narrow closed group, the absence of control processes and ensuring the transparency of the architecture of the systems used in the enterprise, etc.
Today I would like to focus on building the architecture of a specific system (or group of systems) - data warehouses. What needs to be kept in focus first of all, when an organization starts building such a complex and expensive system as a storage.

Debriefing

None of us, working on the creation and development of any system, does not want this to be a "temporary house", or a solution that will "wither away" in a year or two, because will be unable to meet the requirements and expectations of the Customers and the Business. No matter how strong the bias towards "flexible methodologies" is observed today, it is much more pleasant for a person to feel like a "master" who makes violins than an artisan who planes sticks for disposable drums.
Our intention sounds natural: to make systems that are solid and high-quality, which will not require us to have regular "night vigils with a file", for which we will not be ashamed in front of end users and which will not look like a "black box" for all "uninitiated" followers.

To begin with, let's throw in a list of typical problems that we regularly encounter when working with repositories. Let's just write down what we have - so far without trying to streamline and formalize.

In principle, we have a good storage: if you leave it alone, then everything works. True, as soon as a change is required, “local collapses” begin.
Data are uploaded daily, according to the regulations, within one large process, within 8 hours. And it suits us. But if suddenly a failure occurs, it requires manual intervention. And then everything can work unpredictably for a long time, tk. will require human participation in the process.
Have rolled up the release - expect problems.
Some one source could not send data on time - all processes are waiting.
The integrity of the data is controlled by the database - so our processes crash when it breaks.
We have a very large storage - 2000 tables in one common schema. And 3000 more in many other schemes. We already have little idea of how they are arranged and for what reason they appeared. Therefore, it can be difficult for us to reuse something. And many tasks have to be solved anew. Because, this is easier and faster (than to understand "someone else's code"). As a result, we have discrepancies and duplicated functionality.
We expect the source to provide good quality data. But it turns out that this is not the case. As a result, we spend a lot of time reconciling our final reports. And they were very successful in this. We even have a streamlined process. True, it takes time. But users are used to ...
The user does not always trust our reports and requires a justification of one figure or another. In some cases he is right, and in others he is not. But it is very difficult for us to justify them, since we have no means of "end-to-end analysis" (or data lineage).
We could bring in additional developers. But we have a problem - how do we include them in the work? What is the most efficient way to parallelize jobs?
How to develop the system gradually, without going into the development of the "core of the system" for a whole year?
The data warehouse is associated with the corporate model. But we know for sure (we saw it in bank XYZ) that building a model can be infinitely long (we went to bank XYZ for six months and discussed business entities, without any movement). Why is she at all? Or maybe it's better without her, if there are so many problems with her? Maybe we can generate it somehow?
We decided to drive the model. But how do you systematically evolve the warehouse data model? Do we need “rules of the game” and what might they be? What will it give us? What if we are wrong with the model?
Should we save the data, or the history of its changes, if "the business does not need it"? I would not want to "store garbage" and complicate the use of this data for real tasks. Should the vault keep history? What is it like? How does storage work over time?
Should we try to unify the data on the storage if we have a master data management system? If there is MDM, does this mean that the whole problem with the master data is now solved?
We are expected to replace key accounting systems soon. Does the datastore need to be ready to change source? How can this be achieved?
Do we need metadata? What do we mean by this? Where exactly can they be used? How can you implement it? Do I need to store them "in one place"?
Our Customers are extremely unstable in their requirements and desires - something is constantly changing. In general, our business is very dynamic. While we are doing something, it already becomes unnecessary. How can we do it in such a way as to give the result as quickly as possible - like hot cakes?
Users demand responsiveness. But we cannot run our main boot processes often, because this loads the source systems (has a bad effect on performance) - therefore, we hang up additional data streams - which will pick up pointwise - what we need. True, there are many streams. And then we will discard some of the data. Moreover, there will be a convergence problem. But there is no other way ...

Quite a lot has already happened. But this is not a complete list - it is easy to supplement and develop it. We will not hide it in the table, but hang it in a conspicuous place - keeping these issues in the focus of our attention in the process of work.
Our task is to come up with a comprehensive solution as a result.

Antifragility

Looking at our list, one conclusion can be drawn. It is not difficult to create a kind of "database for reporting", upload data there, or even build some kind of routine data update processes. The system starts somehow to live, users appear, and with them obligations and SLA, new requirements arise, additional sources are connected, methodologies change - all this must be taken into account in the development process.

After some time, the picture is as follows:
“Here is the vault. And it works if you don't touch it. Problems arise when we have to change something. "

A change arrives at us, the influence of which we are not able to assess and comprehend (since we did not put such tools into the system from the very beginning) - and in order not to take risks, we do not touch what is, but we make one more extension on the side, and another, and also - turning our decision into slums, or, as they say in Latin America, "favelas", where even the police are afraid to enter.
There is a feeling of loss of control over one's own system, chaos. It takes more and more hands to maintain existing processes and solve problems. And changes are getting harder and harder to make. In other words, the system becomes unstable to stress, maladaptive to changes. And besides, there is a strong dependence on characters who "know the fairway", since no one has a "map".

This property of an object - to collapse under the influence of chaos, random events and shocks - Nassim Nicholas Taleb calls fragility ... And also introduces the opposite concept: antifragility – when the object does not collapse from stress and accidents, but directly benefits from it... ("Antifragility. How to Benefit from Chaos")
Otherwise it can be called adaptability or resilience to change .

What does this mean in this context? What are the “sources of chaos” for IT systems? And what does it mean to “capitalize on chaos” in terms of IT architecture?
The first thought that comes to mind is changes that come from the outside. What is the outside world for the system? For storage in particular. Of course, first of all - changes from the side of data sources for the store:

changing the formats of incoming data;
replacement of some data source systems with others;
change of rules / platforms for systems integration;
changing the interpretation of data (formats are saved, the logic of working with data changes);
changing the data model if the integration is done at the data level (parsing the database transaction log files);
growth in data volumes - while there was not much data in the source system, and the load was not high - it was possible to retrieve it at any time, with an arbitrarily heavy request, the data and load increased - now there are strict restrictions;
etc.

The source systems themselves, the composition of information and its structure, the type of integration interaction, as well as the very logic of working with data can change. Each system implements its own data model and approaches to working with them, which meet the goals and objectives of the system. And no matter how hard they try to unify industry models and reference practices, nuances will inevitably emerge. (And besides, the process of industry unification itself, for various reasons, is not making much progress.)
The culture of working with corporate data - the presence and control of information architecture, a unified semantic model, master data management systems (MDM) somewhat facilitate the task of consolidating data in the warehouse, but do not exclude its need.

No less critical changes are initiated by the warehouse consumers (requirements change):

previously, there was enough data to build a report - now it was required to connect additional fields or a new data source;
previously implemented data processing techniques are outdated - you need to rework the algorithms and everything that affects it;
Previously, everyone was satisfied with the current value of the dictionary attribute on the information panel - now the value is required that is relevant at the time of the analyzed fact / event;
there was a requirement for the depth of the data storage history, which did not exist before - to store data not for 2 years, but for 10 years;
previously there was enough data as of the “end of the day / period” - now you need the data state “within the day”, or at the time of a certain event (for example, making a decision on a loan application - for Basel II);
earlier we were satisfied with reporting on data for yesterday (T-1) or later, now we need T0;
etc.

Both the integration interactions with the source systems and the requirements from the consumers of the warehouse data are external factors for the data warehouse: some source systems replace others, data volumes grow, the formats of incoming data change, user requirements change, etc. And all these are typical external changes for which our system - our repository - must be ready. With the right architecture, they shouldn't kill the system.

But that's not all.
Speaking about variability, we, first of all, remember external factors. After all, inside we can control everything, it seems to us so, right? Yes and no. Yes, most of the factors that are outside the zone of influence are external. But there is also “internal entropy”. And precisely because of its presence, we sometimes need to return “to point 0”. Start the game over.
In life, we often tend to start from scratch. Why is this peculiar to us? And is it really that bad?
Applied to IT. For the system itself - this can be very good - the ability to rethink individual decisions. Especially when we can do it locally. Refactoring is the process of unraveling the “web” that periodically appears in the process of system development. Going back to the beginning can be helpful. But it has a price.
With competent architecture management, this price decreases - and the system development process itself becomes more controllable and transparent. A simple example: if the principle of modularity is observed, you can rewrite a separate module without affecting the external interfaces. And this cannot be done with a monolithic structure.

The antifragility of a system is determined by the architecture that is embedded in it. And it is this property that makes it adaptive.
When we talk about adaptive architecture- we mean that the system is able to adapt to changes, and not at all that we are constantly changing the architecture itself. On the contrary, the more stable and stable the architecture, the fewer the requirements that entail its revision, the more adaptive the system.

Solutions involving a revision of the entire architecture will have a much higher price. And you need to have very good reasons for their adoption. For example, such a rationale might be a requirement that cannot be implemented within the existing architecture. Then they say - a requirement has appeared that affects the architecture.
Thus, we also need to know our “limits of antifragility”. Architecture is not developed “in a vacuum” - it is based on current requirements and expectations. And if the situation changes fundamentally - we must understand that we have gone beyond the current architecture - and we need to revise it, work out a different solution - and think over the transition paths.
For example, we assumed that we will always need data in the storage at the end of the day, we will take data every day using standard system interfaces (through a set of views). Then from the risk management department came the demand for the need to receive data not at the end of the day, but at the time of the decision on lending. No need to try to “pull the non-tensioned” - you just need to admit this fact - the sooner the better. And start working on an approach that will allow us to solve the problem.
There is a very fine line here - if we take into account only the "requirements in the moment" and do not look several steps ahead (and several years ahead), then we increase the risk of facing a requirement affecting architecture too late - and the price of our change will be very high. Looking a little ahead - within the boundaries of our horizon - has not hurt anyone yet.

The example of a system from the "storage tale" is just one example of a very shaky system built on fragile design approaches. And if this happens, destruction occurs quite quickly, for this particular class of systems.
Why can I say so? The topic of repositories is not new. The approaches and engineering practices that have been developed during this time were aimed precisely at this - maintaining the viability of the system.
A simple example: One of the most common reasons for the failure of take-off storage projects is trying to build storage over development source systems without agreeing on integration interfaces — trying to fetch data directly from tables. As a result, we went into development - during this time the source database changed - and the loading streams in the repository became inoperative. It's too late to redo something. And if you have not yet secured yourself by making several layers of tables inside the storage, then you can throw everything out and start over. This is just one example, and one of the simple ones.

The Taleb criterion for fragile and antifragile is simple. The main judge is time. If the system withstands the test of time, and shows its "vitality" and "indestructibility" - it has the property of antifragility.
If, when designing a system, we take into account antifragility as a requirement, this will encourage us to use such approaches to building its architecture that will make the system more adaptable both to “chaos from the outside” and to “chaos from within”. And ultimately the system will have a longer lifespan.
None of us wants to make "makeshift houses". And do not deceive yourself, which is no other way today. It is normal for a person to look a few steps ahead at any time, especially during a crisis.

What is a data warehouse and why are we building it

The article on storage architecture assumes that the reader not only knows what it is, but also has some experience with such systems. Nevertheless, I considered it necessary to do this - to return to the origins, to the beginning of the path, because it is there that the "fulcrum" of development is located.

How did people come to the idea that data warehouses are needed? And how are they different from just a "very large database"?
A long time ago, when there were simply "business data processing systems" in the world, there was no division of IT systems into such classes as front-end oltp systems, back-office dss, text processing systems, data warehouses, etc.
It was during this time that the first relational database engine, Ingres, was created by Michael Stonebreaker.
And that was the time when the era of personal computers burst into the computer industry like a whirlwind and forever changed all the ideas of the IT community of that time.

At that time it was easy to find enterprise applications written on the basis of desktop class DBMSs, such as Clipper, dBase, and FoxPro. And the market for client-server applications and DBMS was only gaining momentum. Database servers appeared one after another, which will occupy their niche in the IT space for a long time - Oracle, DB2, etc.
And the term "database application" has been common. What did such an application include? Simplified - some input forms through which users could simultaneously enter information, some calculations that were launched "by button" or "on schedule", as well as some reports that could be seen on the screen or saved as files and sent to seal.
“Nothing special - just a regular application, just a database,” one of my mentors remarked early on in my career. "So nothing special?" - I thought then.

If you look closely, there are still some peculiarities. As users grow, the volume of incoming information increases, as the load on the system increases, its developers, designers, in order to maintain performance at an acceptable level, go to some "tricks". The very first is the division of a monolithic "business data processing system" into an accounting application that supports users on-line, and a separate application for batch processing of data and reporting. Each of these applications has its own database and is even hosted on a separate instance of the database server, with different settings for different types of load - OLTP and DSS. And data streams line up between them.

It's all? It would seem that the problem has been solved. What happens next?
And then the companies grow, their information needs multiply. The number of interactions with the outside world is also growing. And as a result, there is not one large application that fully automates all processes, but several different ones from different manufacturers. The number of systems that generate information - data source systems in the company is increasing. And sooner or later, there will be a need to see and compare the information received from different systems. This is how data warehouses appear in the company - a new class of systems.
The generally accepted definition of this class of systems is as follows.

Data Warehouse (or Data Warehouse)- a subject-oriented information database specially designed and designed for the preparation of reports and business analysis in order to support decision-making in an organization

In this way, consolidation data from different systems, the ability to look at them in a certain "uniform" (unified) way - this is one of the key properties of systems of the class of data warehouses. This is the reason why repositories have emerged during the evolution of IT systems.

Key features of data warehouses

Let's take a closer look. What are the key features of these systems? What makes data warehouses different from other enterprise IT systems?

First, these are large volumes. Very big. VLDB - this is how leading vendors call such systems when they give their recommendations on the use of their products. From all systems of the company, data flows into this large database and is stored there "forever and unchanged", as they say in textbooks (in practice, life turns out to be more complicated).

Second, this is historical data - "Corporate memory" - so called data warehouses. In terms of working with time in repositories, everything is quite interesting. In accounting systems, the data is up-to-date at the moment. Then the user performs some kind of operation - and the data is updated. At the same time, the history of changes may not be saved - it depends on the accounting practice. Take a bank account balance, for example. We may be interested in the current balance at "now", at the end of the day, or at the time of some event (for example, at the time of calculating the score). While the first two are fairly easy to solve, the latter will most likely require special efforts. The user, working with the storage, can refer to the past periods, compare them with the current one, etc. It is these time-related capabilities that significantly distinguish data warehouses from accounting systems - obtaining the state of data at various points on the time axis - to a certain depth in the past.

Third, it is consolidation and data unification ... In order for their joint analysis to become possible, it is necessary to bring them to a common form - unified data model , compare facts with unified reference books. There can be several aspects and difficulties here. Primarily - conceptual - under the same term, different people from different departments can understand different things. And vice versa - to call something differently, which is essentially the same thing. How to provide a “single view” while maintaining the specific vision of a particular user group?

Fourth, this is work with data quality ... In the process of loading data into the storage, they are cleaned, general transformations and transformations are performed. General transformations must be done in one place - and then used to build various reports. This will avoid the inconsistencies that annoy business users — especially executives who are brought to the table with numbers from different departments that do not agree with each other. Poor data quality creates errors and discrepancies in reports, the consequence of which is a decrease in the level user trust to the entire system, to the entire analytical service as a whole.

Architectural concept

Anyone who has come across a repository has most likely observed some kind of "layered structure" - since it is this architectural paradigm that has taken root for systems of this class. And it is no coincidence. Storage layers can be perceived as separate components of the system - with their own tasks, area of responsibility, "rules of the game."
Layered architecture is a means of dealing with the complexity of the system - each subsequent level is abstracted from the complexities of the internal implementation of the previous one. This approach allows you to single out tasks of the same type and solve them in a uniform way, without re-inventing the “wheel” from scratch every time.
The conceptual architectural diagram is schematically shown in the figure. This is a simplified diagram that reflects only the key idea - the concept, but without the "anatomical details" that would arise with deeper elaboration of details.

As shown in the diagram, conceptually select the following layers. The three main layers that contain the data storage area (indicated by the filled rectangle) and the data loading software (conventionally indicated by the arrows of the same color). And also an auxiliary - service layer, which, however, plays a very important connecting role - data load management and quality control.

Primary Data Layer - primary data layer (or staging , or operating layer ) - designed to load from source systems and save primary information, without transformations - in original quality and support a complete history of changes.
The task of this layer- to abstract the subsequent layers of storage from the physical structure of data sources, methods of data collection and methods of separating the delta of changes.

Core Data Layer - core storage - the central component of the system that distinguishes the storage from just a "batch integration platform" or "big data dump", since its main role is data consolidation from different sources, reduction to uniform structures, keys. It is when loading into the kernel that the main work is done with data quality and general transformations, which can be quite complex.
The task of this layer- to abstract their consumers from the features of the logical device of data sources and the need to compare data from different systems, to ensure the integrity and quality of data.

Data Mart Layer - analytical showcases - a component, the main function of which is to convert data to structures that are convenient for analysis (if BI works with showcases, then this is, as a rule, a dimensional model), or according to the requirements of the consumer system.
As a rule, data marts take data from the core - as a reliable and verified source - i.e. use the service of this component to bring data to a single form. We will call such showcases regular ... In some cases, storefronts can take data directly from the staging - operating with primary data (in the source keys). This approach is usually used for local tasks where data consolidation from different systems is not required and where efficiency is needed more than data quality. Such display cases are called operating ... Some analytical indicators can have very complex calculation methods. Therefore, for such non-trivial calculations and transformations, the so-called secondary display cases .
Showcase Layer Task- preparation of data according to the requirements of a specific consumer - a BI platform, a group of users, or an external system.

The layers described above consist of a persistent data storage area, as well as a software module for loading and transforming data. This division into layers and regions is logical. Physically, the implementation of these components can be different - you can even use different platforms for storing or transforming data on different layers, if this is more efficient.
Storage areas contain technical (buffer tables) that are used in the process of data transformation and target tables that the consuming component is referring to. It is good practice to "cover" the target tables with views. This facilitates the subsequent maintenance and development of the system. The data in the target tables of all three layers are marked with special technical fields (meta-attributes), which are used to support data loading processes, as well as to enable informational audit of data flows in the warehouse.

Also, a special component (or a set of components) is distinguished, which provides service functions for all layers. One of its key tasks is the control function - to provide "uniform rules of the game" for the entire system as a whole, leaving the right to use various options for implementing each of the layers described above - incl. use different technologies for loading and processing data, different storage platforms, etc. Let's call it service layer ... It does not contain business data, but it has its own storage structures - it contains a metadata area, as well as an area for working with data quality (and possibly other structures, depending on the functions assigned to it).

Such a clear division of the system into separate components significantly increases the controllability of the development of the system:

the complexity of the task that is posed to the developer of the functionality of this or that component is reduced (he should not simultaneously solve the issues of integration with external systems, and think over data cleansing procedures, and think about the optimal presentation of data for consumers) - the task is easier to decompose, evaluate and perform a small delivery;
you can connect to the work of various performers (and even teams, or contractors) - because this approach allows you to effectively parallelize tasks, reducing their mutual influence on each other;
the presence of persistent staging allows you to quickly connect data sources without designing the entire core or storefronts for the entire subject area, and then gradually finish building the remaining layers according to priorities (moreover, the data will already be in the storage - available to system analysts, which will greatly facilitate the tasks of the subsequent development of the storage);
the presence of a core allows all work with data quality (as well as possible mistakes and errors) to be hidden from storefronts and from the end user, and most importantly - using this component as a single data source for storefronts, you can avoid data convergence problems due to the implementation of common algorithms in one place;
the selection of marts allows you to take into account the differences and the specifics of understanding the data that users of different departments may have, and their design for BI requirements allows not only to issue aggregated figures, but to ensure data validation by providing opportunities to drill down to primary indicators;
the presence of a service layer allows you to perform end-to-end data analysis (data lineage), use unified data audit tools, general approaches to highlighting the delta of changes, work with data quality, load management, monitoring and error diagnostics tools, and accelerates problem resolution.

This approach to decomposition also makes the system more resistant to change (in comparison with the "monolithic structure") - it ensures its antifragility:

changes on the part of source systems are processed at staging - in the kernel, only those streams that are influenced by these staging tables are modified, the effect on storefronts is minimal or absent;
changes in requirements on the part of consumers are processed for the most part in storefronts (if this does not require additional information that is not yet in the store).

Next, we will go through each of the components presented above and take a look at them in a little more detail.

System core

Let's start from the middle - the core of the system or the middle layer. It is labeled as Core Layer. The kernel plays the role of data consolidation - bringing to uniform structures, reference books, keys. This is where the main work with data quality is carried out - cleaning, transformation, unification.

The presence of this component allows you to reuse data streams that transform the primary data received from source systems into a certain unified format, following general rules and algorithms, and not repeat the implementation of the same functionality separately for each application storefront, which, in addition to inefficient use of resources, may entail also discrepancies in the data.
The core of the repository is implemented in a data model, in the general case, different both from the models of source systems, and from the formats and structures of consumers.

Warehouse Core Model and Enterprise Data Model

The main concern of the middle storage layer is stability. That is why the main focus here is on the data model. It is commonly referred to as the "corporate data model". Unfortunately, a kind of aura of myths and absurdities has formed around it, which sometimes lead to a refusal to build it altogether, but in vain.

Myth 1. An enterprise data model is a huge model with thousands of entities (tables).
Actually. In any subject area, in any business domain, in the data of any company, even the most complex, there are few basic entities - 20-30.

Myth 2. There is no need to develop any "own model" - we buy an industry reference model - and we do everything according to it. We spend money - but we get a guaranteed result.
Actually. Reference models can indeed be very useful because contain industry experience in modeling this area. From them you can glean ideas, approaches, naming practices. Check the "depth of coverage" of the area so that something important is not overlooked. But we are unlikely to be able to use such a model out of the box - as it is. This is the same myth as, for example, the purchase of an ERP system (or CRM) and its implementation without any "tightening up for yourself." The value of such models is born in their adaptation to the realities of this particular business, this particular company.

Myth 3. The development of a core repository model can take many months, during which time the project will actually be frozen. Plus, it requires an insane amount of meetings and lots of people.
Actually. The repository model can be developed with the repository iteratively, piece by piece. For uncovered areas, "expansion points" or "stubs" are set. some "universal designs" are applied. At the same time, you need to know when to stop so that you do not get a super-universal thing of 4 tables, into which it is difficult to both "put data" and (even more difficult) to get it. And which is extremely suboptimal in terms of performance.

It really takes time to develop the model. But this is not the time spent on "drawing entities" - this is the time required to analyze the subject area, understanding how the data is arranged. That is why analysts are very closely involved in this process, and various business experts are also involved. And this is done pointwise, selectively. And not by organizing meetings with the participation of an insane number of people, sending out huge questionnaires, etc.
Good business and systems analysis is key in building a core warehouse model. There is a lot to understand: where (in what systems) data is generated, how it works, in what business processes it circulates, etc. Qualitative analysis has never harmed a single system. Rather, on the contrary, problems arise from “white spots” in our understanding.

Developing a data model is not a process of inventing and inventing something new. In fact, the data model already exists in the company. And the design process is more like “excavation”. The model is carefully and carefully extracted from the "soil" of corporate data and put on a structured form.

Myth 4. Our business is so dynamic in our company, and everything is changing so quickly that it is useless for us to make a model - it will become outdated before we put this part of the system into operation.
Actually. Recall that the core factor is stability. And above all, the topology of the model. Why? Because it is this component that is central and influences everything else. Stability is also a requirement for the kernel model. If a model becomes outdated too quickly, then it is incorrectly designed. The wrong approaches and “rules of the game” were chosen for its development. And it is also a matter of qualitative analysis. The key entities of the corporate model rarely change.
But if it occurs to us to make for a company that sells, say, confectionery, instead of the “Products” directory, make “Sweets”, “Cakes” and “Pies”. Then when pizza appears in the list of goods - yes, you will need to enter a lot of new tables. And this is just a matter of approach.

Myth 5. The creation of a corporate model is a very serious, complex and responsible business. And it's scary to make a mistake.
Actually. The core model, although it should be stable, is still not “cast in metal”. Like any other design solution, its structure can be revised and modified. You just do not need to forget about this quality of it. But this does not mean at all that “you cannot breathe on it”. And this does not mean that temporary solutions and "stubs" that should be planned for recycling are unacceptable.

Myth 6. If our data source is, for example, a reference data system (or a master data management system - MDM), then it should already correspond to the corporate model in an amicable way (especially if it was recently designed and did not have time to acquire a "side", "traditions "And temporary huts). It turns out that for this case - we do not need a kernel model?
Actually. Yes, in this case, building the core model of the repository is greatly facilitated - since we follow a ready-made top-level conceptual model. But it is not excluded at all. Why? Because when building a model of a certain system, some of its own rules apply - what types of tables to use (for each entity), how to version the data, with what granularity to keep the history, what meta-attributes (technical fields to use), etc.

In addition, no matter how wonderful and all-embracing system of reference data and MDM we have, as a rule, there will be nuances associated with the existence of local directories "about the same" in other accounting systems. And this problem, whether we want it or not, will have to be solved at the repository - after all, reporting and analytics are collected here.

Primary data layer (or historicized staging or operational layer)

On it is designated as Primary Data Layer. The role of this component: integration with source systems, loading and storing primary data, as well as preliminary data cleaning - checking for compliance with the rules of format-logical control, fixed in the "interaction interface agreement" with the source.
In addition, this component solves a very important problem for the repository - allocating the "true delta of changes" - regardless of whether the source allows you to track changes in the data or not and how (by what criterion they can be "caught"). As soon as the data got into staging - for all other layers, the issue of delta allocation is already clear - thanks to the labeling with meta-attributes.

The data in this layer is stored in structures as close as possible to the source system - in order to preserve the primary data as close as possible to their original form. Another name for this component is "operating layer".
Why not just use the well-established term “staging”? The fact is that earlier, before the "era of big data and VLDB", disk space was very expensive - and often the primary data, if preserved, was only for a limited period of time. And often the name "staging" is called cleanable buffer.
Now technologies have stepped forward - and we can afford not only to store all the primary data, but to historicize them with the degree of granularity that is possible. This does not mean that we should not control the growth of data and does not eliminate the need to manage the lifecycle of information, optimizing the cost of data storage, depending on the "temperature" of use - ie. taking "cold data" that is less in demand to cheaper media and storage platforms.

What does the presence of "historicized staging" give us:

the possibility of making mistakes (in structures, in transformation algorithms, in the granularity of history) - having completely historicized primary data in the availability zone for the storage, we can always reload our tables;
an opportunity to think - we can take our time to work out a large fragment of the kernel in this particular iteration of the storage development, since in our staging, in any case, there will be, and with an even time horizon (there will be one point of "history reference");
the possibility of analysis - we will save even those data that are no longer in the source - they could get overwritten there, go to the archive, etc. - with us, they remain available for analysis;
the possibility of information audit - thanks to the most detailed primary information, we can then figure out how the download worked for us, that we ended up with such figures (for this, we also need to have marking with meta attributes and the corresponding metadata on which the download works - this is decided by the service layer).

What difficulties can arise when building a "historicized staging":

it would be convenient to set requirements for the transactional integrity of this layer, but practice shows that this is difficult to achieve (this means that in this area we do not guarantee the referential integrity of parent and child tables) - integrity alignment occurs on subsequent layers;
this layer contains very large volumes (the most voluminous on the storage - despite all the redundancy of analytical structures) - and you need to be able to handle such volumes - both in terms of load and in terms of requests (otherwise, you can seriously degrade the performance of the entire storage).

What else is interesting to say about this layer.
Firstly, if we move away from the paradigm of “end-to-end loading processes”, then the rule “the caravan moves at the speed of the last camel” no longer works for us, more precisely, we abandon the “caravan” principle and switch to the “conveyor” principle: we took data from source - put in your layer - ready to take the next portion. It means that
1) we do not wait for the processing to happen on other layers;
2) we are not dependent on the schedule for the provision of data by other systems.
Simply put, we schedule a loading process that takes data from one source through a specific way of connecting to it, checks, allocates delta - and puts the data into target staging tables. And that's all.

Secondly, these processes, as you can see, are very simple - one might say trivially, from the point of view of logic. This means that they can be very well optimized and parameterized, reducing the load on our system and speeding up the process of connecting sources (development time).
For this to happen, you need to know very well the peculiarities of the technological features of the platform on which this component works - and then you can make a very effective tool.

Showcase layer

The Data Mart Layer is responsible for preparing and providing data to end users - people or systems. At this level, the requirements of the consumer are taken into account as much as possible - both logical (conceptual) and physical. The service should provide exactly what is needed - no more, no less.

If the consumer is an external system, then, as a rule, it dictates the data structures that it needs and the rules for the collection of information. A good approach is one in which the consumer is responsible for the correct data collection. The data warehouse prepared, formed a showcase, provided the possibility of incremental data collection (marking with meta-attributes for the subsequent highlighting of the delta of changes), and the consumer system then itself controls and is responsible for how it uses this showcase. But there are peculiarities: when the system does not have an active component for data collection - either an external component is needed that will perform the integrating function, or the storage will act as an "integration platform" - and will ensure the correct incremental data upload further - outside the storage. Many nuances emerge here, and the rules of interface interaction must be thought out and understood by both parties (however, as always, when it comes to integration). As a rule, routine data cleansing / archiving is applied to such data marts (it is rarely necessary that this "transit data" be stored for a long time).

The most important from the point of view of analytical tasks are showcases "for people" - more precisely, for the BI tools with which they work.
However, there is a category of "especially advanced users" - analysts, data researchers - who do not need BI tools or regulatory processes for filling external specialized systems. They need some kind of "common storefronts" and "their own sandbox", where they can create tables and transformations at their discretion. In this case, the responsibility of the repository is to ensure that these common storefronts are filled with data in accordance with the regulations.
Separately, we can highlight such consumers as Data Mining tools - deep data analysis. These tools have their own data preparation requirements, and data scientists also work with them. For the storage, the task comes down to - again, to support the service for loading some storefronts of the agreed format.

However, back to the analytical showcases. These are the ones that are of interest from the point of view of storage designers in this data layer.
In my opinion, the best time-tested approach to designing data marts, to which almost all BI platforms are now "sharpened", is the approach of Ralph Kimball. It is known as dimensional modeling - multidimensional modeling. There are many publications on this topic. For example, the basic rules can be found in the publication by Marga Ross. And of course, you can recommend from the multidimensional modeling guru. Another helpful resource is Kimball's Tips
The multidimensional approach to the creation of storefronts is described and worked out so well - both by the "method evangelists" and by the leading software vendors, that it makes no sense to dwell on it in some detail here - the original source is always preferable.

I would like to make only one emphasis. "Reporting and analytics" is different. There is “heavy reporting” - pre-ordered reports that are generated in the form of files and delivered to users through the provided delivery channels. And then there are dashboards - BI dashboards. At their core, these are web applications. And the response time of these applications is the same as for any other web application. This means that the normal time for a BI Panel to refresh is seconds, not minutes. It is important to keep this in mind when designing your solution. How can this be achieved? The standard optimization method: we look at what the response time is composed of and what we can influence. What is the most time wasted? For physical (disk) database readings, for data transmission over the network. How to reduce the amount of data read and transmitted in one request? The answer is obvious and simple: you need to either aggregate the data, or apply a filter to the large tables of the actual tables participating in the query, and exclude joining of large tables (references to fact tables should only go through dimensions).

What is BI for? How is it convenient? Why is the multidimensional model effective?
BI allows the user to run what are called ad hoc queries. What does it mean? This means that we do not know the exact request in advance, but we know which indicators in which aspects the user can request. The user generates such a query by selecting the appropriate BI filters. And the task of the BI developer and the storefront designer is to provide such a logic of the application so that the data is either filtered or aggregated, preventing a situation when too much data is requested - and the application "hangs". Usually, they start with aggregated numbers, then delve deeper into more detailed data, but along the way, install the necessary filters.

It is not always enough to just build the “right star” and get a convenient structure for BI. Sometimes you will need to apply denormalization somewhere (while looking back at how this will affect the load), and somewhere to make secondary storefronts and aggregates. Add indexes or projections somewhere (depending on the DBMS).

Thus, through "trial and error", you can get a structure that is optimal for BI - which will take into account the peculiarities of both the DBMS and the BI platform, as well as the user's requirements for data presentation.
If we take the data from the "core", then such processing of storefronts will be local in nature, without affecting in any way the complex processing of the primary data obtained directly from the source systems - we only "shift" the data into a format convenient for BI. And we can afford to do this many times, in different ways, in accordance with different requirements. It is much easier and faster to do this on kernel data than to collect from the "primary" (the structure and rules of which, as we know, can also "float").

Service layer

The Service Layer is responsible for the implementation of general (service) functions that can be used to process data in various storage layers - load management, data quality management, problem diagnosis and monitoring tools, etc.
The presence of this level provides transparency and structured data flows in the storage.

This layer includes two data storage areas:

metadata area - used for data loading control mechanism;
data quality area - for the implementation of off-line data quality checks (i.e. those that are not directly built into ETL processes).

You can arrange the download management process in different ways. One possible approach is this: we split the entire set of storage tables into modules. The module can include tables of only one layer. The tables included in each module are loaded in a separate process. Let's call it control process ... The start of the control process is set to its own schedule. The control process orchestrates calls to atomic processes, each of which loads one target table, and also contains some general steps.
Obviously, it is enough to simply divide the staging tables into modules - by source systems, or rather by their connection points. But for the kernel, this is already more difficult to do. there we need to ensure the integrity of the data, which means we need to take into account the dependencies. Those. there will be collisions that need to be resolved. And there are different methods to resolve them.

An important point in load management is to develop a consistent approach to error handling. Errors are classified according to their severity level. When a critical error occurs, the process must stop, and as soon as possible, because its occurrence indicates a significant problem that can lead to data corruption in the storage. Thus, load management is not only about starting processes, but also stopping them, as well as preventing untimely (by mistake) starting.

For the operation of the service layer, a special metadata structure is created. This area will store information about loading processes, loaded data sets, checkpoints that are used to maintain an increment (which process has read to which point) and other service information necessary for the functioning of the system.
It is important to note that all target tables in all layers are marked with a special set of meta-fields, one of which is the identifier of the process that updated this row. For tables within a repository, this process marking allows for a consistent way of subsequently highlighting the delta of changes. When loading data into the primary data layer, the situation is more complicated - the delta allocation algorithm for different loaded objects may be different. But the logic of processing accepted changes and their rolling onto target tables for the core and storefronts is much more complicated than for staging, where everything is quite trivial - it is easy to parameterize and think over reusable standard steps (procedures).

I am not setting the task here to fully cover this topic - organizing the download - I just highlight the accents that are worth paying attention to.
This approach is just one of the options. It's pretty responsive. And its "conceptual prototype" was the Toyota conveyor and the just-in-time system. Those. here we are moving away from the widespread paradigm of exclusively "nightly data download", and we download in small portions during the day - as soon as the data is ready in various sources: what came - it was downloaded. At the same time, we have many parallel processes running. And the "hot tail" of fresh data will constantly "blink" - and even out over time. We must take into account such a feature. And, if necessary, form custom showcases with "slices", where everything is already holistic. Those. it is impossible to achieve both efficiency and consistency (integrity) at the same time. We need a balance - somewhere one thing is important, somewhere else.

It is imperative to provide logging and monitoring facilities. It is good practice to use typed events, where you can set different parameters and customize the notification system - subscribing to certain events. Because it is very important that when the intervention of the system administrator is required, he would know about it as early as possible and receive all the necessary diagnostic information. The logs can also be used to analyze post-facto problems, as well as to investigate incidents of system malfunctions, incl. data quality.

Designing and Maintaining Warehouse Data Models

Why is it important to pay attention to the design of data models when developing any system where a database is involved (and especially in a warehouse)? Why not just throw a set of tables anywhere - even in a text editor? Why do we need "these pictures"?
Oddly enough, even experienced developers ask such questions.
Actually, yes, nothing prevents you from sketching tables - and starting to use them. If ... if at the same time in the head (!) The developer has a coherent general picture of the structure that he is sculpting. What if there are several developers? What if someone else is using these tables? And what if time passes - a person leaves this area, and then returns to it again?

Can you figure it out without a model? In principle, you can. And to figure it out, and "figure out the pictures on a piece of paper", and "mop up - settle" the data. But it is much easier, clearer and faster to use a ready-made artifact - a data model. And also understand the "logic of its device" - ie. it would be nice to have general rules of the game.

And the most important thing is not even that. The most important thing is that when designing a model, we are forced (just without options!) To study the subject area more closely and deeply, the features of the data device and their use in various business cases. And those questions that we would have easily “pushed aside” as complex, “blurred” by throwing our signs, without trying to precisely design model - we will be forced to deliver and decide now, when analyzing and designing, and not later - when we will build reports and think about “how to reduce the incompatible” and “reinvent the wheel” every time.

This approach is one of those engineering practices that allows you to create antifragile systems. Since they are clearly arranged, transparent, convenient for development, and also their "fragility limits" are immediately visible - you can more accurately estimate the "scale of disaster" when new requirements appear and the time required for redesign (if needed).
Thus, the data model is one of the main artifacts that must be maintained during the development of the system. In an amicable way, it should be "on the table" of every analyst, developer, etc. - everyone who participates in system development projects.

Designing data models is a large and separate topic. There are two main approaches to storage design.
The approach works well for the kernel Entity-relationship - when a normalized (3NF) model is built on the basis of the study of the subject area, more precisely, its selected area. The same "corporate model" that was discussed above is at play here.

When designing showcases, it is suitable multidimensional model ... This approach fits well with the understanding of business users - because it is a model that is simple and convenient for human perception - people operate with understandable and familiar concepts of metrics (indicators) and sections by which they are analyzed. And this allows you to simply and clearly build the process of collecting requirements - we draw a set of "matrixes of sections and indicators", communicating with representatives of various departments. And then we bring it into one structure - the "analysis model": we form the "measurement bus" and define the facts that are defined on them. Along the way, we are working on hierarchies and aggregation rules.

Then it is very easy to go to the physical model, adding optimization elements taking into account the peculiarities of the DBMS. For example, for Oracle it would be partitioning, a set of indexes, etc. For Vertica, other techniques will be used - sorting, segmentation, sectioning.
Also, special denormalization may be required - when we deliberately introduce redundancy into the data, thanks to which we improve the query performance, but at the same time complicate data updating (since redundancy will need to be taken into account and maintained during the data loading process). Perhaps, in order to improve performance, we will also have to create additional aggregate tables, or use such additional DBMS features as projections in Vertica.

So, when modeling warehouse data, we actually solve several problems:

the task of building a conceptual (logical) model of the kernel - system and business analysis - researching the subject area, going into details and taking into account the nuances of "live data" and their use in business;
the task of building an analysis model - and then a conceptual (logical) storefront model;
the task of building physical models - data redundancy management, optimization taking into account the peculiarities of the DBMS for queries and data loading.

When developing conceptual models, we may not take into account the peculiarities of a particular DBMS, for which we are designing a database structure. Moreover, we can use one conceptual model to create several physical ones - for different DBMS.

Let's summarize.

A data model is not a collection of "pretty pictures", and the process of designing it is not a process of drawing them. The model reflects our understanding of the domain. And the process of compiling it is the process of studying and researching it. This is wasted time. And not at all to "draw and paint."
A data model is a design artifact, a way of exchanging information in a structured way between team members. To do this, it must be clear to everyone (this is provided by notation and explanation) and available (published).
The data model is not created once and frozen, but is created and developed in the process of system development. We set the rules for its development ourselves. And we can change them if we see - how to do it better, easier, more efficiently.
The data model (physical) allows you to consolidate and leverage a set of best practices aimed at optimization - i.e. use the techniques that have already worked for this DBMS.

Features of data warehouse projects

Let us dwell on the specifics of projects within the framework of which the company builds and develops data warehouses. And let's look at them from the point of view of the influence of the architectural aspect. Why is it important for such projects to build an architecture, and from the very beginning? And it is the presence of a well-thought-out architecture that gives flexibility to the data warehouse project, allows you to efficiently distribute work between performers, and also make it easier to predict the result and make the process more predictable.

Data warehouse is custom software

A data warehouse is always a "custom development", not a boxed solution. Yes, there are industry-specific BI applications that include a reference data model, pre-configured ETL processes from common sources (for example, ERP systems), a set of standard BI panels and reports. But in practice, storage is rarely implemented - as a "box". I have been working with repositories for about 10 years and have never seen such a story. There are always some nuances associated with the unique features of the company - both the business and the IT landscape. Therefore, hoping that the architecture will be provided by the "vendor" supplying the solution is somewhat reckless. The architecture of such systems often “matures” within the organization itself. Or it is formed by the specialists of the contractor company, which is the main executor for the project.

Data warehouse is an integration project

The data warehouse loads and processes information from many source systems. And in order to maintain "friendly relations" with them, you need to be extremely careful with them. In particular, it is necessary to minimize the load on the source systems, take into account the "availability and unavailability" windows, select interaction interfaces taking into account their architecture, etc. Then the storage will be able to pick up data as early as possible and with the required frequency. Otherwise, you will be "transplanted" to a backup circuit, which is not updated at the most operational frequency.
In addition, it is necessary to take into account the "human factor". Integration is not only about the interaction of machines. It is also communication between people.

Data Warehouse is a Collaborative Project

In a large company, such a system can rarely be done by just one team. As a rule, several teams work here, each of which solves a specific problem.

The architecture should provide the ability to organize their parallel work, while maintaining its integrity and avoiding duplication of the same functionality in different places, by different people. In addition to unnecessary effort, such duplication can lead to data discrepancies later.

In addition, when so many people and teams, often scattered, are involved in the development of the system, the question inevitably arises: how to build communications and information interaction between them. The more standard and understandable approaches and practices are used, the easier, more convenient and efficient it is to organize such work. And, among other things, it is worth thinking about the composition of "working artifacts", among which for data warehouses # 1 are data models (see the previous section).

The data warehouse has a longer life span than other systems

To clarify - the statement is true for a "live", working storage, integrated with key sources, possessing historical data and providing information and analytical services to many divisions of the company.

What grounds do I have for believing so?
Firstly, building a storage is a very resource-intensive process: in addition to the actual costs of equipment, licenses for the necessary technological software and development, almost all systems and divisions of the company are also involved in this. Repeating this entire process from scratch one more time is a very daring idea.

Secondly, if the storage has the correct architecture, then it can quite easily survive the changes of source systems, and the emergence of new requirements from end users, and the growth of data volumes.
If the architecture is correct, information flows are transparent, then such a system can be developed for a long time without the risk of being stuck in a situation when making changes due to difficulties in assessing the impact.

Gradual iterative development

The last thing that the Customer would like, getting involved in the story with the repository, is to freeze their requirements for a year or two, until a complete corporate data model is designed, all sources are fully connected, etc.

In the eyes of Customers, the data warehouse often looks like an absolute monster - the tasks, goals and development horizon of the system are so voluminous. And often the customer is afraid that "at the expense of his budget" the IT department will solve some "their problems." And again we are faced with the issue of interaction between people and the ability to calmly state our position and negotiate.

Competent architectural approaches allow you to develop the system iteratively, increasing the functionality gradually, without going into "development" for several years before starting to give a result.

Although it should be noted that "miracles do not happen" - and the "start" also takes time. For storages, it can be quite large - since these are large amounts of data, this is historical data - for the old periods, when the rules for processing information could differ from the current ones. Therefore, it takes enough time for analytical work, interaction with source systems and a number of "trial and error", including load tests on real data.

Data warehouses - "multi-project story"

It is difficult to single out a single business customer for a data warehouse. And it is believed (not without reason) that the key factor in the success of the project of building a storage facility is the support of the company's management - directly the first person.
A repository is rarely built and developed as part of a single project. Typically, there are different needs for data consolidation and analytics, behind them are different customers and user groups. Therefore, the repository is often developed within the framework of several parallel projects.

Balance of innovation and proven solutions

Despite the fact that the topic of storage is very "ancient" (if such a word is applicable for such a young industry as IT) and rather conservative. Nevertheless, progress does not stand still - and those limitations that previously existed due to expensive and slow disks, expensive memory, etc. - are now removed. At the same time, the time has come to revise some of the architectural approaches. Moreover, this applies both to technological platforms and to the architecture of the applied systems that are based on them.

It is important to strike a balance here - and maintain a fairly “green” approach to both resources and stored information. Otherwise, you can very quickly turn the storage into a semi-structured "dump", in which, if it will be possible to figure it out, then by quite a lot of effort.
Yes, we have more opportunities, but this does not mean that we need to deny all the accumulated and time-tested practices, which it is clear how and why to use, and "go all bad" only led by the foggy ghost of "innovations".
Keeping a balance means using new methods and approaches where they open up new opportunities, but at the same time using old proven ones - to solve urgent problems that have not been canceled.
What can we do as developers and designers of application solutions? First of all, to know and understand the technological changes of the platforms on which we work, their capabilities, features and application boundaries.

Let's look at the DBMS as the most critical and important technological platform for storage.
Recently, there has been a clear drift of relational databases, created initially as "universal", towards specialization. For a long time, leading vendors have been releasing various options - for applications of different classes (OLTP, DSS & DWH). In addition, additional opportunities appear for working with text, geo-data, etc.

But this was not the end of it - products began to appear that were initially focused on a certain class of tasks, i.e. specialized DBMS. They may or may not use the relational model. It is important that they are initially "sharpened" not just for storing and processing "business information" in general, but for specific tasks.

Apparently, centralization and specialization are two complementary trends that periodically replace each other, ensuring development and balance. As well as evolutionary (gradual) gradual development and cardinal changes. For example, in the 90s, Michael Stonebreaker was one of the authors of the Generation III Database Manifesto, which clearly expressed the idea that the world does not need another revolution in the world of databases. However, 10 years later, he publishes works in which he announces the prerequisites for the beginning of a new era in the world of DBMS - based on their specialization.
He focuses on the fact that common universal DBMSs are built on a "one-size-fits-all" architecture, which does not take into account either changes in hardware platforms or division of applications into classes for which you can come up with a more optimal solution than implementing universal requirements.
And he begins to develop a number of projects in accordance with this idea. One of them - C-Store - is a columnar DBMS designed in the shared nothing (SN) architecture, originally created specifically for systems of the class of data warehouses. This product was then marketed as HP Vertica.

It seems that now the topic of the development of data warehouses has slipped into a new stage of development. New technologies, approaches and tools appear. Their study, testing and intelligent application allows us to create really interesting and useful solutions. And bring them to implementation, enjoying the fact that your developments are used in real work and are useful.

Epilogue

In preparing this article, I tried to focus primarily on architects, analysts and developers who directly work with data warehouses. But it turned out that she inevitably "took the topic a little wider" - and other categories of readers fell into the field of vision. Some points will seem controversial, some are not clear, some are obvious. People are different - with different backgrounds, backgrounds and positions.
For example, typical managerial questions are "when to hire architects?", "When to do architecture?" sound for us (developers, designers) rather strange, because for us the architecture of the system appears with its birth - it does not matter whether we are aware of it or not. And even if there is no formal role of an architect in a project, a normal developer always "includes his own internal architect."

By and large, it doesn't matter who exactly performs the role of the architect - it is important that someone asks similar questions and investigates the answers. If the architect is clearly singled out, this only means that he is primarily responsible for the system and its development.
Why did I find the topic of "antifragility" relevant to this subject?

"The uniqueness of antifragility is that it allows us to work with the unknown, to do something in conditions when we do not understand what exactly we are doing, and to achieve success."/ Nassim N. Talb /

Therefore, the crisis and a high degree of uncertainty are not an excuse in favor of the absence of architecture, but factors that reinforce its need.

Zaitsev S.L., Ph.D.

Repeating groups

Duplicate groups are attributes for which a single instance of an entity can have more than one value. For example, a person can have more than one skill. If, in terms of business requirements, we need to know the skill level for each, and each person can only have two skills, we can create the entity shown in Fig. 1.6. Here is the entity A PERSON with two attributes for storing skills and skill level for each.

Rice. 1.6. This example uses repeating groups.

The problem with repeating groups is that we cannot know exactly how many skills a person might have. In real life, some people have one skill, some have several, and some have none yet. Figure 1.7 shows the model reduced to the first normal form. Note the added Skill ID that each uniquely identifies SKILL.

Rice. 1.7. Model reduced to first normal form.

One fact in one place

If the same attribute is present in more than one entity and is not a foreign key, then this attribute is considered redundant. The logical model should not contain redundant data.

Redundancy requires additional space, but while memory efficiency is important, the real problem lies elsewhere. Ensuring that redundant data is synchronized is overhead, and you always run the risk of conflicting values.

In the previous example SKILL depends on Person ID and from Skill ID. This means that you will not have SKILL until it appears A PERSON, possessing this skill. This also makes it difficult to change the Skill Name. It is necessary to find each entry with the Name of the skill and change it for each Person who owns this skill.

Figure 1.8 shows the model in second normal form. Note that the added entity SKILL, and the attribute TITLE the skill is transferred to this entity. The skill level remained, respectively, at the intersection PERSONS and SKILL.

Rice. 1.8. In second normal form, the repeating group is moved to another entity. This provides the flexibility to add the required number of Skills and change the Skill Name or Skill Description in one place.

Each attribute depends on the key

Each attribute of an entity must depend on the primary key of that entity. In the previous example School name and Geographic area present in the table A PERSON but do not describe the person. To achieve the third normal form, you need to move the attributes to the entity, where they will depend on the key. Figure 1.9. shows the model in third normal form.

Rice. 1.9. In third normal form School name and Geographic region transferred to entity, where their values depend on the key.

Many-to-many relationships

Relationship many-to-many reflect the reality of the surrounding world. Note that in Figure 1.9, there is a many-to-many relationship between PERSONOUS and SCHOOL... The attitude accurately reflects the fact that A PERSON can study in many SCHOOLS and in SCHOOL can learn a lot PERSON. To achieve the fourth normal form, an associative entity is created that eliminates the monogy-to-many relationship by generating a separate entry for each unique combination of school and person. Figure 1.10 shows the model in fourth normal form.

Rice. 1.10. In fourth normal form, a monogo-to-many relationship between PERSONOUS and SCHOOL resolved by introducing an associative entity, in which a separate entry is allocated for each unique combination SCHOOLS and PERSONS.

Formal definitions of normal forms

The following definitions of normal forms can seem daunting. Think of them simply as formulas for achieving normalization. Normal forms are based on relational algebra and can be interpreted as mathematical transformations. Although this book is not devoted to a detailed discussion of normal forms, modelers are encouraged to take a deeper look at the subject.

In a given relation R, the Y attribute functionally depends on the X attribute. In symbolic form, RX -> RY (read as "RX functionally defines RY") - if and only if each value of X in R is associated with exactly one value of Y in R (at any given time). Attributes X and Y can be compound (Date CJ. Introduction to Database Systems. 6th edition. Ed. Williams: 1999, 848 pp.).

The relation R corresponds to the first normal form (1NF) if and only if all domains belonging to it contain only atomic values (Date, ibid.).

A relation R corresponds to second normal form (2NF) if and only if it corresponds to 1NF, and each non-key attribute is completely dependent on the primary key (Date, ibid.).

A relation R corresponds to third normal form (3NF) if and only if it corresponds to 2NF, and each non-key attribute does not transitively depend on the primary key (Date, ibid.).

The relation R corresponds to Boyes-Codd normal form (BCNF) if and only if each determinant is a candidate for use as a key.

NOTE Below is a brief explanation of some of the abbreviations used in Date's definitions.

MVD (multi-valued dependency) is a multi-valued dependency. Used only for entities with three or more attributes. In a multivalued dependency, the value of the attribute depends only on a portion of the primary key.

FD (functional dependency) - functional dependency. With functional dependency, the value of an attribute depends on the value of another attribute that is not part of the primary key.

JD (join dependency) is a join dependency. With a union dependency, the primary key of the parent entity is traced back to at least the third level descendants, while retaining the ability to be used in the union by the original key.

The ratio corresponds to the fourth normal form (4NF) if and only if there is an MVD in R, for example A®®B. In this case, all the attributes of R functionally depend on A. In other words, in R there are only dependencies (FD or MVD) of the form K®X (i.e., the functional dependence of the attribute X on the candidate for use as a key K). Accordingly, R meets the requirements of 4NF if it complies with BCNF and all MVDs are actually FDs (Date, ibid.).

For the fifth normal form, the relation R satisfies the union dependence (JD) * (X, Y,…, Z) if and only if R is equivalent to its projections onto X, Y, ..., Z, where X, Y ,. .., Z is a subset of the set of attributes R.

There are many other normal forms for complex data types and specific situations that are beyond the scope of this discussion. Any model development enthusiast would like to learn other normal forms as well.

Business normal forms

In his book, Clive Finklestein (An Introduction to Information Engineering: From Strategic Planning to Information Systems. Reading, Massachusetts: Addison-Wesley, 1989) took a different approach to normalization. It defines business normal forms in terms of coercion to those forms. Many modelers find this approach more intuitive and more pragmatic.

The first business normal form (1BNF) takes out repeating groups to another entity. This entity gets its own name and primary (composite) key attributes from the original entity and its repeating group.

The second business normal form (2BNF) takes out attributes that are partially dependent on the primary key to another entity. The primary (composite) key of this entity is the primary key of the entity in which it was originally located, along with additional keys on which the attribute depends entirely.

The third business normal form (3BNF) takes attributes that are independent of a primary key into another entity, where they are completely dependent on the primary key of that entity.

The fourth business normal form (4BNF) takes attributes that depend on the value of the primary key or are optional to a secondary entity, where they depend entirely on the value of the primary key, or where they must (necessarily) be present in that entity.

The fifth business normal form (5BNF) appears as a structural entity if there is a recursive or other dependency between instances of a secondary entity, or if a recursive dependency exists between instances of its primary entity.

Completed Logical Data Model

The completed logical model must satisfy the requirements of the third business normal form and include all the entities, attributes, and relationships necessary to support the data requirements and business rules associated with the data.

All entities must have names that describe their content and have a clear, concise, complete description or definition. A future post will cover an initial set of guidelines for the correct formation of entity names and descriptions.

Relationships should include a verb construction that describes the relationship between entities, along with characteristics such as plurality, necessity of existence or possibility of absence of a relationship.

NOTE Plurality relationship describes the maximum number of secondary entity instances that can be associated with an instance of the original entity.Necessity of existence orpossibility of absence relationship is used to determine the minimum number of instances of a secondary entity that can be associated with an instance of the original entity.

Physical data model

Once you have created a complete and adequate logical model, you are ready to make the decision to choose an implementation platform. The choice of platform depends on the requirements for the use of data and the strategic principles of shaping the architecture of the corporation. Choosing a platform is a complex issue beyond the scope of this book.

In ERwin, a physical model is a graphical representation of a real-world database. The physical database will be made up of tables, columns, and relationships. The physical model depends on the platform chosen for implementation and the requirements for using the data. The physical model for IMS will be very different from that for Sybase. The physical model for OLAP reports will look different from the model for OLTP (online transaction processing).

The data modeler and database administrator (DBA) use the logical model, usage requirements, and corporate architecture policy to develop a physical data model. You can denormalize the physical model to improve performance, and create views to support usage requirements. The following sections detail the process of denormalizing and creating views.

This section provides an overview of the process of building a physical model, collecting data usage requirements, defining the components of a physical model, and providing reverse engineering. In the following publications, these issues are covered in more detail.

Collecting data usage requirements

You typically collect data usage requirements early on during interviews and work sessions. At the same time, the requirements should determine as fully as possible the use of data by the user. The superficial attitude and gaps in the physical model can lead to unplanned costs and delay in project implementation. Requirements for use include:

Access and performance requirements

Volumetric characteristics (an estimate of the amount of data to be stored) that allow the administrator to represent the physical volume of the database

Estimating the number of users who need concurrent access to data to help you design your database for acceptable performance levels

Aggregates, pivots, and other calculated or derived data that can be considered candidates for storage in persistent data structures

Requirements for reporting and standard queries to help the database administrator build indexes

Views (persistent or virtual) that will assist the user when performing data aggregation or filtering operations.

In addition to the chairman, secretary, and users, the modeler, database administrator, and database architect must participate in the usage requirements session. The user's historical data requirements should be discussed. The length of time that data is retained has a significant impact on the size of the database. Often, older data is stored in a generalized form, and atomic data is archived or deleted.

Users should bring examples of requests and reports with them to the session. Reports must be strictly defined and must include atomic values used for any summary and summary fields.

Physical data model components

The components of a physical data model are tables, columns, and relationships. Logical model entities are likely to become tables in the physical model. Boolean attributes become columns. Logical relationships will become constraints on the integrity of the relationships. Some logical relationships cannot be implemented in a physical database.

Reverse engineering

When a logical model is not available, it becomes necessary to recreate the model from the existing database. In ERwin, this process is called reverse engineering. Reverse engineering can be done in several ways. The modeler can explore the data structures in the database and recreate tables in a visual modeling environment. You can import data definitions language (DDL) into a tool that supports reverse engineering (such as Erwin). Advanced tools such as ERwin include functions that provide ODBC communication with an existing database to create a model by directly reading data structures. Reverse engineering with ERwin will be discussed in detail in a future post.

Using corporate functional boundaries

When building a logical model for a modeler, it is important to ensure that the new model is consistent with the corporate model. Using corporate functional boundaries means modeling data in terms used within a corporation. The way data is used in a corporation is changing faster than the data itself. In each logical model, data must be presented in a holistic manner, regardless of the business domain it supports. Entities, attributes, and relationships must define business rules at the corporation level.

NOTE Some of my colleagues refer to these corporate functional boundaries as real-world modeling. Real-world modeling encourages the modeler to view information in terms of its actually inherent relationships and relationships.

The use of corporate functional boundaries for a data model that is constructed appropriately provides the basis for supporting the information needs of any number of processes and applications, which enables the corporation to more efficiently exploit one of its most valuable assets - information.

What is an Enterprise Data Model?

Enterprise data model (EDM) contains entities, attributes, and relationships that represent the information needs of a corporation. EDM is usually categorized according to subject areas, which represent groups of entities related to supporting specific business needs. Some subject areas may cover specific business functions such as contract management, while others may include entities that describe products or services.

Each logical model must correspond to the existing domain of the corporate data model. If the logical model does not meet this requirement, a domain model must be added to it. This comparison ensures that the corporate model is improved or adjusted and that all logical modeling efforts are coordinated within the corporation.

EDM also includes specific entities that define the scope of values for key attributes. These entities have no parents and are defined as independent. Independent entities are often used to maintain the integrity of relationships. These entities are identified by several different names such as code tables, reference tables, type tables, or classification tables. We will use the term "corporate business object". An enterprise business object is an entity that contains a set of attribute values that are independent of any other entity. Corporate business objects should be used consistently within a corporation.

Building a corporate data model by augmenting

There are organizations where the corporate model has been built from start to finish as a result of a single concerted effort. On the other hand, most organizations build fairly complete corporate models by scaling up.

Building up means building something sequentially, layer by layer, just like an oyster grows a pearl. Each data model created provides a contribution to the formation of the EDM. Building an EDM in this way requires additional modeling steps to add new data structures and domains or augment existing data structures. This makes it possible to build an enterprise data model by augmenting, iteratively adding levels of detail and refinement.

Modeling methodology concept

There are several visual data modeling methodologies. ERwin supports two:

IDEF1X (Integration Definition for Information Modeling - an integrated description of information models).

IE (Information Engineering).

IDEF1X is a good methodology and the use of its notation is widespread

Integrated description of information models

IDEF1X is a highly structured data modeling methodology that extends the IDEF1 methodology adopted as a FIPS (Federal Information Processing Standards) standard. IDEF1X uses a highly structured set of modeling construct types and results in a data model that requires an understanding of the physical nature of the data before such information can be made available.

The rigid structure of IDEF1X forces the modeler to assign characteristics to entities that may not correspond to the realities of the surrounding world. For example, IDEF1X requires all entity subtypes to be exclusive. This leads to the fact that a person cannot be both a client and an employee. While real practice tells us differently.

Information engineering

Clive Finklestein is often referred to as the father of information engineering, although similar concepts were shared with him by James Martin (Martin, James. Managing the Database Environment. Upper Saddle River, New Jersey: Prentice Hall, 1983.). Information Engineering uses a business-driven approach to information management and uses a different notation to represent business rules. IE serves as an extension and development of the notation and core concepts of the ER methodology proposed by Peter Chen.

IE provides the infrastructure to support information requirements by integrating corporate strategic planning with information systems that are being developed. This integration allows the management of information resources to be more closely aligned with the long-term strategic prospects of the corporation. This business-driven approach has led many modelers to choose IE over other methodologies that tend to focus on short-term development challenges.

IE proposes a sequence of actions that leads a corporation to identify all of its information needs for collecting and managing data and identifying relationships between information objects. As a result, information requirements are clearly articulated based on management directives and can be directly translated into a management information system that will support strategic information needs.

Conclusion

Understanding how to use a data modeling tool like ERwin is only part of the problem. In addition, you must understand when data modeling tasks are being solved and how the information requirements and business rules that should be represented in the data model are collected. Conducting work sessions provides the most conducive environment for collecting information requirements in an environment that includes domain experts, users, and information technology professionals.

Building a good data model requires analyzing and researching the information requirements and business rules collected through work sessions and interviews. The resulting data model should be compared with the enterprise model, if possible, to ensure that it does not conflict with existing object models and includes all the required objects.

The data model consists of logical and physical models that represent information requirements and business rules. The logical model should be reduced to third normal form. The third normal form constrains, adds, updates, and removes data structure anomalies to support the "one fact in one place" principle. The collected information requirements and business rules should be analyzed and researched. They need to be compared to the enterprise model to ensure they do not conflict with existing object models and include all the required objects.

In ERwin, the data model includes both logical and physical models. ERwin implements the ER approach and allows you to create logical and physical model objects to represent information requirements and business rules. Logical model objects include entities, attributes, and relationships. Physical model objects include tables, columns, and constraints on the integrity of relationships.

One of the following publications will cover the issues of identifying entities, defining entity types, choosing entity names and descriptions, as well as some techniques to avoid the most common modeling errors associated with the use of entities.

Entities must have a complete set of attributes, so that every fact about each entity can be represented by its attributes. Each attribute must have a name that reflects its meaning, a Boolean data type, and a clear, short, complete description or definition. In a future blog post, we will look at an initial set of guidelines for properly formatting attribute names and descriptions. Relationships should include a verb construction that describes the relationship between entities, along with characteristics such as plurality, necessity of existence or possibility of absence of a relationship.

NOTE Plurality relationship describes the maximum number of secondary entity instances that can be associated with an instance of the original entity.Necessity of existence or possibility of absence relationship serves to determine the minimum number of instances of a secondary entity that can be associated with an instance of the original

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

1. Relational data model

1.1 The relational data model. Basic definitions
1.2 Operations on relations

2. Corporate information systems
Bibliography

1. Relational data model

1.1 The relational data model. Basic definitions

In mathematical disciplines, the concept of "table" corresponds to the concept of "relation" (relation). The table reflects an object of the real world - an entity, and each of its lines reflects a specific instance of the entity. Each column has a name unique to the table. Strings have no names, their order is not defined, and the number is logically unlimited. One of the main advantages of the relational data model is homogeneity (each row in a table has the same format). It is up to the user to decide whether the respective entities are homogeneous. This solves the problem of model suitability.

Basic concepts:

* A relationship is a two-dimensional table containing some data.

* Entity - an object of any nature, data about which is stored in the database. Attributes are properties that characterize an entity (columns).

* The degree of relationship is the number of columns.

* Relationship schema - a list of attribute names, for example, EMPLOYEE (No., full name, year of birth, position, department).

* Domain - a set of values of the attributes of a relation (data type).

* A tuple is a table row.

* Cardinality (cardinality) - the number of rows in the table.

* Primary key is an attribute that uniquely identifies the rows of a relationship. A multi-attribute primary key is called a composite primary key. The primary key cannot be completely or partially empty (null). Keys that can be used as primary keys are called potential or alternative keys.

* A foreign key is an attribute (s) of one table that can serve as the primary key of another table. References the primary key of another table.

Normalization is a process aimed at reducing the redundancy of information in a database. In addition to the data itself, various names, object names and expressions can also be normalized in the database.

A non-normalized database contains information in one or more different tables; this gives the impression that the inclusion of data in a particular table is not due to any apparent reasons. This state of affairs can have a negative impact on data security, the rational use of disk space, the speed of queries, the efficiency of updating the database and, perhaps most importantly, the integrity of the stored information. The database before normalization is a structure that has not logically been broken down into more manageable, smaller tables yet.

The normal form is a kind of indicator of the level, or depth, of database normalization. The normalization level of the database corresponds to the normal form in which it is located.

1.2 Operations on relations

To bring the table to the first normal form (1NF), two rules must be observed:

1. Atomicity or indivisibility. Each column must contain one indivisible value.

2. The table should not contain duplicate columns or groups of data.

For example, if a table contains in one field the full address of a person (street, city, postal code), it will not meet the 1NF rules, since it will contain different values in one column, which would be a violation of the atomicity rule. Or if the database contains data about movies and it contains the columns Actor1, Actor2, Actor3, it will also not comply with the rules, since the data will be repeated.

Normalization should begin with checking the database structure for compatibility with 1NF. All columns that are not atomic must be split into their constituent columns. If there are duplicate columns in the table, then they need to select a separate table.

To bring the table to first normal form, you should:

* Find all fields that contain multipart pieces of information.

* The data that can be broken down into component parts must be placed in separate fields.

* Move duplicate data into a separate table.

* Check if all tables match the conditions of the first normal form.

To bring the tables to the second normal form (2NF), the tables should be already in 1NF. Normalization should proceed in order.

Now, in second normal form, the condition must be met - any column that is not a key (including foreign) must depend on the primary key. Typically, these columns, which have values that are independent of the key, are easy to identify. If the data contained in the column is not related to the key that describes the row, then they should be separated into their own separate table. The primary key must be returned to the old table.

To bring the base to the second normal form, you need:

* Identify all columns that are not directly dependent on the primary key of this table.

* Create the required fields in the users and forums tables, select from existing fields or create primary keys from new ones.

* Each table needs its own primary key

* Create foreign keys and designate their relationships between tables. The final step of normalization to 2NF will be the allocation of foreign keys for communication with associated tables. The primary key of one table must be a foreign key in another.

Hints:

Another way to convert a schema to 2NF is to look at the relationships between the tables. Ideally, create all one-to-many relationships. Many-to-many relationships need restructuring.

A properly normalized table will never have duplicate rows (two or more rows whose values are not keys and contain the same data).

The database will be in third normal form if it is converted to second normal form and each non-key column is independent of each other. If you follow the normalization process correctly up to this point, there may be no questions about converting to 3NF. You should be aware that 3NF is violated if changing the value in one column requires a change in the other column.

To bring the base to the third normal form, you need:

* Determine which fields of which tables have interdependencies, i.e. fields that depend more on each other than on the row as a whole.

* Create matching tables. If there is a problematic column in step 1, create split tables for it.

* Create or allocate primary keys. Each table must have a primary key.

* Create the required foreign keys that form any of the relationships.

In the fourth normal form, an additional rule is that it is necessary to exclude multivalued dependencies. In other words, all rows in the table must be independent of each other. The presence of some row X should not mean that row Y is also somewhere in this table.

2. Corporate information systems

relational model data system

A system (from the Greek systema - a whole, a compound made up of parts) is a set of elements that interact with each other, forming a certain integrity, unity. Here are some concepts that are often used to characterize a system.

1. A system element is a part of a system that has a specific functional purpose. Complex elements of systems, in turn, consisting of simpler interconnected elements, are often called subsystems.

2. Organization of the system - internal orderliness, consistency of interaction of system elements, manifested, in particular, in limiting the variety of states of elements within the system.

3. The structure of the system - the composition, order and principles of interaction of the elements of the system, which determine the basic properties of the system. If the individual elements of the system are spaced across different levels and the internal connections between the elements are organized only from higher to lower levels and vice versa, then we speak of the hierarchical structure of the system. Purely hierarchical structures are practically rare, therefore, somewhat expanding this concept, the hierarchical structure is usually understood as such structures, where, among other connections, hierarchical relationships are of paramount importance.

4. System architecture - a set of system properties that are essential for the user.

5. Integrity of the system - the fundamental irreducibility of the properties of the system to the sum of the properties of its individual elements (emergence of properties) and, at the same time, the dependence of the properties of each element on its place and function within the system.

Information system is an interconnected set of means, methods and personnel used to store, process and issue information in order to achieve the set goal "

The Federal Law "On Information, Informatization and Information Protection" provides the following definition:

"Information system is an organizationally ordered set of documents (arrays of documents) and information technologies, including the use of computer technology and communications that implement information processes"

Scale classification

In terms of scale, information systems are subdivided into the following groups:

* single;

* group;

* corporate.

A corporate information system is a scalable system designed for the integrated automation of all types of economic activities of large and medium-sized enterprises, including corporations consisting of a group of companies requiring unified management.

A corporate information system can be considered a system that automates more than 80% of the divisions of an enterprise.

Recently, in many publications devoted to the use of information technology in the management of economic objects, the term "corporate information systems" is often used, which in them means the actual automated information systems of economic objects.

An automated information system (AIS) is a combination of various types of support, as well as specialists designed to automate the processing of accounting and analytical information. As a rule, the types of support are homogeneous for different systems in composition, which makes it possible to implement the principle of compatibility of systems in the course of their operation. In the process of studying AIS as a complex system, it is necessary to single out individual parts and elements and consider the features of their use at the stages of creation and operation.

Corporate information systems are an evolution of systems for workgroups, they are focused on large companies and can support geographically dispersed nodes or networks. Basically, they have a hierarchical structure of several levels. Such systems are characterized by a client-server architecture with specialization of servers or a multi-tier architecture. When developing such systems, the same database servers can be used as when developing group information systems. However, in large information systems, the most common servers are Oracle, DB2 and Microsoft SQL Server.

For group and corporate systems, the requirements for the reliability of operation and data safety are significantly increased. These properties are maintained by maintaining data, reference, and transactional integrity in the database servers.

Classification by scope

According to the scope of application, information systems are usually divided into four groups:

* transaction processing systems;

* decision making systems;

* information and reference systems;

* office information systems.

Bibliography

1. Agaltsov, V.P. Database. In 2 volumes. V. 2. Distributed and remote databases: Textbook / V.P. Agaltsov. - M .: ID FORUM, NITs INFRA-M, 2013.

2. Golitsyna, O. L. Databases: Textbook / O.L. Golitsyna, N.V. Maksimov, I.I. Popov. - M .: Forum, 2012.

3. Karpova, I.P. Databases: Textbook / I.P. Karpov. - SPb .: Peter, 2013.

4. Kirillov, V.V. Introduction to relational databases. Introduction to relational databases. Kirillov, G.Yu. Gromov. - SPb .: BHV-Petersburg, 2012.

5. Pirogov, V.Yu. Information systems and databases: organization and design: Textbook / V.Yu. Pirogov. - SPb .: BHV-Petersburg, 2009.

6. G.N. Fedorov. Information Systems. - M .: Academy, 2013.

7. A.E. Satunina, L.A. Sysoeva. Project management of the corporate information system of the enterprise. - M .: Finance and statistics, Infra-M, 2009.

Posted on Allbest.ru

...

Industry data models

The main purpose of models is to facilitate orientation in the data space and help in highlighting the details that are important to business development. In today's environment, for a successful business, it is imperative to have a clear understanding of the links between the various components and to have a good idea of the overall picture of the organization. Identification of all details and relationships using models allows the most efficient use of the time and tools for organizing the work of the company.

Data models are abstract models that describe how data is presented and accessed. Data models define data items and the relationships between them in a particular area. A data model is a navigation tool for both business and IT professionals that uses a specific set of symbols and words to accurately explain a specific class of real-world information. This allows for better communication within the organization and thus creates a more flexible and stable application environment.

The data model uniquely defines the meaning of the data, which in this case is structured data (as opposed to unstructured data such as, for example, an image, binary file, or text, where the meaning can be ambiguous).

As a rule, models of a higher level (and more general in content) and a lower one (respectively, more detailed) are distinguished. The upper level of modeling is the so-called conceptual data models(conceptual data models), which give the most general picture of the functioning of an enterprise or organization. The conceptual model includes the main concepts or subject areas that are critical to the functioning of the organization; usually their number does not exceed 12-15. Such a model describes the classes of entities that are important to the organization (business objects), their characteristics (attributes), and the associations between pairs of these classes (that is, relationships). Since the terminology in business modeling has not yet finally settled down, in various English-language sources, conceptual data models can also be called the subject area model (which can be translated as domain models) or subject enterprise data model (subject corporate data models).

The next hierarchical level is logical data models(logical data models). They can also be called enterprise data models or business models. These models contain data structures, their attributes and business rules, and represent the information used by an enterprise from a business perspective. In such a model, data is organized in the form of entities and relationships between them. The logical model presents data in a way that makes it easy for business users to understand. In a logical model, a data dictionary can be distinguished - a list of all entities with their precise definitions, which allows different categories of users to have a common understanding of all input and information output streams of the model. The next, lower level of modeling is the physical implementation of the logical model using specific software and technical platforms.

The logical model contains a detailed corporate business decision, which usually takes the form of a normalized model. Normalization is a process that ensures that each data item in a model has only one value and is completely and uniquely dependent on the primary key. Data items are organized into groups according to their unique identification. The business rules governing data items must be fully incorporated into the normalized model with prior validation and validation. For example, a data item such as Customer Name is likely to be split into First Name and Last Name and grouped with other related data items into a Customer entity with a primary key Customer ID.

The logical data model is independent of application technologies such as databases, networking technologies, or reporting tools, and the means of their physical implementation. There can be only one Enterprise Data Model in an organization. Logical models typically include thousands of entities, relationships, and attributes. For example, a data model for a financial institution or telecommunications company can contain about 3000 industry concepts.

It is important to distinguish between logical and semantic data model. The logical data model represents an enterprise business solution, and the semantic data model represents an applied business solution. The same corporate logical data model can be implemented using different semantic models, i.e. semantic models can be viewed as the next level of modeling approaching physical models. Moreover, each of these models will represent a separate "slice" of the corporate data model in accordance with the requirements of various applications. For example, in the corporate logical data model, the Client entity will be completely normalized, and in the semantic model for the data mart, it can be represented as a multidimensional structure.

A company can have two ways to create a corporate logical data model: build it independently or use a ready-made one. industry model(industry logical data model). In this case, differences in terms reflect only different approaches to building the same logical model. In the event that a company independently develops and implements its own logical data model, then such a model, as a rule, is called simply a corporate logical model. If an organization decides to use a ready-made product from a professional supplier, then we can talk about an industry logical data model. The latter is a ready-made logical data model that reflects the functioning of a particular industry with a high degree of accuracy. An industry logic model is a domain-specific and integrated view of all the information that must reside in an enterprise data warehouse to answer both strategic and tactical business questions. Like any logical data model, the industry model is independent of application decisions. It also does not include derived data or other calculations for faster data retrieval. As a rule, most of the logical structures of such a model are well embodied in its effective physical implementation. Such models are developed by many suppliers for a wide variety of areas of activity: finance, manufacturing, tourism, healthcare, insurance, etc.

An industry logical data model contains information that is common to the industry and therefore cannot be a comprehensive solution for a company. Most companies have to grow the model by an average of 25% by adding data items and expanding definitions. Out-of-the-box models contain only key data elements, and the rest of the elements must be added to the corresponding business objects during the installation of the model in the company.

Industry logical data models contain a significant amount of abstraction. Abstractions mean the union of similar concepts under common names such as Event or Participant. This adds flexibility and uniformity to industry models. Thus, the concept of an Event is applicable to all industries.

Business Intelligence Specialist Steve Hoberman identifies five factors to consider when deciding whether to acquire an industry data model. The first is the time and money needed to build the model. If an organization needs to achieve results quickly, then the industry model will be beneficial. Using an industry model may not immediately provide a picture of the entire organization, but it can save a significant amount of time. Instead of modeling itself, time will be spent linking existing structures to the industry model and discussing how best to customize it to the needs of the organization (for example, which definitions should be changed and which data items should be added).

The second factor is the time and money required to keep the model in good working order. If the enterprise data model is not part of a methodology that allows you to monitor compliance with its accuracy and compliance with modern standards, then such a model becomes outdated very quickly. The industry data model can prevent this risk from happening as it is kept up to date with external resources. Of course, changes taking place within the organization should be reflected in the model by the company itself, but industry changes will be reproduced in the model by its supplier.

The third factor is experience in risk assessment and modeling. The creation of a corporate data model requires qualified resources from both the business and the IT staff. As a rule, managers are well aware of either the work of the organization as a whole, or the activities of a particular department. Few of them have both broad (company-wide) and deep (within departments) knowledge of their business. Most managers usually know only one area well. Therefore, in order to get the general corporate picture, significant business resources are required. This also increases the demands on the IT staff. The more business resources are required to create and test a model, the more experienced analysts must be. They must not only know how to get information from the business staff, but also be able to find a common point of view in contentious areas and be able to present all this information in an integrated way. The person creating the model (in many cases the same analyst) must have good modeling skills. Building enterprise logic models requires modeling “for the future” and the ability to literally convert complex business “into squares and lines”.

On the other hand, the industry model allows outside expertise to be leveraged. Industry-specific logic models are built using proven modeling methodologies and teams of experienced professionals to avoid common and costly problems that can arise when developing enterprise data models within an organization.

The fourth factor is the existing application infrastructure and supplier relationships. If an organization already uses many tools from the same supplier and has established relationships with him, then it makes sense and the industry model to order from him. This model will be able to work freely with other products from the same supplier.

The fifth factor is intra-industry exchange of information. If a company needs to communicate with other organizations working in the same field, then the industry model can be very useful in this situation. Organizations within the same industry use similar structural components and terminology. Nowadays, in most industries, companies are forced to exchange data in order to successfully conduct business.

The most effective are the industry models offered by professional suppliers. High efficiency of their use is achieved due to the significant level of detail and accuracy of these models. They usually contain many data attributes. In addition, the creators of these models not only have extensive modeling experience, but are also well versed in building models for a particular industry.

Industry data models provide companies with a single, integrated view of their business information. Many companies find it difficult to integrate their data, although this is a prerequisite for most enterprise-wide projects. According to a study by The Data Warehousing Institute (TDWI), more than 69% of organizations surveyed found integration to be a significant barrier to new applications adoption. On the contrary, the implementation of data integration generates tangible income for the company.

The industry data model, in addition to linking to existing systems, provides great benefits for enterprise-wide projects such as Enterprise Resource Planning (ERP), master data management, business intelligence, data quality improvement, and employee development.

Thus, industry logical data models are an effective tool for integrating data and obtaining a holistic view of the business. The use of logical models seems to be a necessary step towards the creation of corporate data warehouses.

Publications

Steve Hoberman. Leveraging the Industry Logical Data Model as Your Enterprise Data Model.
Claudia Imhoff. Fast-Tracking Data Warehousing & Business Intelligence Projects via Intelligent Data Modeling

The purpose of the lecture

After studying the material of this lecture, you will know:

what's happened enterprise data model ;
how to convert enterprise data model into the data warehouse model;
essential elements corporate data model ;
presentation layers of the corporate data model ;
an algorithm for transforming an enterprise data model into a multidimensional data warehouse model ;

and learn to:

develop data warehouse models based on corporate data model organizations;
design a star schema using CASE tools;
partition tables multidimensional model using CASE tools.

Enterprise data model

Introduction

The core of any HD is its data model. Without a data model, it will be very difficult to organize the data in the HD. Therefore, CD developers should spend time and effort on developing such a model. The development of the HD model falls on the shoulders of the HD designer.

Compared with the design of OLTP-systems, the design methodology of CD has a number of distinctive features associated with the orientation of the storage data structures to solve the problems of analysis and information support of the decision-making process. The HD data model should provide an effective solution to precisely these problems.

The starting point in the design of CD can be the so-called enterprise data model(corporate data model or enterprise data model, EDM), which is created in the process of designing an organization's OLTP systems. When designing corporate data model usually an attempt is made to create a data structure based on business operations that would collect and synthesize all the information needs of an organization.

In this way, enterprise data model contains the necessary information to build a CD model. Therefore, at the first stage, if such a model exists in the organization, the HD designer can start the HD design by solving the transformation problem corporate data model into the HD model.

Enterprise data model

How to solve the transformation problem corporate data model into the HD model? To solve this problem, you need to have this model, i.e. corporate data model should be built and documented... And you need to understand what from this model and how should be transformed into the HD model.

Let us clarify from the standpoint of a CD designer the concept corporate data model. Under corporate data model understand a multi-level, structured description of an organization's subject areas, subject area data structures, business processes and business procedures, organizational data flows, state diagrams, data-process matrices, and other model representations that are used in the organization's activities. Thus, in the broadest sense of the word, enterprise data model is a set of models of various levels that characterize (model at some abstract level) the activities of an organization, i.e. content corporate model directly depends on what model constructions were included in it in a given organization.

The main elements corporate data model are:

description of the subject areas of the organization (definition of areas of activity);
relationships between the subject areas defined above;
information data model (ERD -model or entity-relationship model);
description for each subject area:
- entity keys;
- entity attributes;
- subtypes and supertypes;
- relationships between entities;
- grouping attributes;
- relationships between subject areas;
functional or business process model;
data flow diagrams;
state diagrams;
other models.

In this way, enterprise data model contains entities, attributes, and relationships that represent the information needs of an organization. In fig. 16.1 shows the main elements corporate data model.

Presentation levels of the enterprise data model

Enterprise data model subdivided according to subject areas, which represent groups of entities related to supporting specific business needs. Some subject areas may cover specific business functions such as contract management, while others may include entities that describe products or services.

Each logical model must correspond to the existing domain corporate data model... If the logical model does not meet this requirement, a domain model must be added to it.

Enterprise data model usually has several levels of presentation. In fact high level(high level) corporate data model there is a description of the main subject areas of the organization and their relationships at the entity level. In fig. 16.2 is a snippet corporate data model top level.

Rice. 16.2.

The diagram shown in the figure presents four subject areas: "Buyer" ( Customer), "Check" ( account), "Order" ( Order) and "Product" ( Product). As a rule, only direct connections between subject areas, which, for example, record the following fact: the buyer pays the invoice for the order of goods. Details and indirect relationships at this level corporate model not shown.

On the next one, middle level(mid level) corporate data model detailed information about objects of subject areas is shown, i.e. keys and entity attributes, their relationships, subtypes and supertypes, etc. For each domain of the top-level model, there is one middle-level model. In fig. 16.3 shows the middle level of presentation corporate model for a fragment of the "Order" subject area.

From fig. 16.3 it can be seen that the subject area "Order" ( Order) includes several entities, defined through their attributes, and the relationships between them. The presented model allows you to answer such questions as the date of the order, who made the order, who sent the order, who receives the order, and a number of others. From the above diagram, it can be seen that in this organization there are two types of orders - orders for a promotion ( Commersial) and retail orders ( Retail).

notice, that enterprise data model can represent various aspects of the organization's activities and with varying degrees of detail and completeness. If corporate model represents all aspects of the organization's activities, it is also called organization data model(enterprise data model).

From the point of view of designing a CD, an important factor in deciding to create a CD model from corporate data model is the state completeness corporate data model.

Enterprise data model the organization has the characteristic of evolutionary, i.e. it is constantly developing and improving. Some subject areas corporate data model may be well developed, for some the work may not have started yet. If a fragment of the subject area has not been worked out in corporate data model, then there is no way to use this model as a starting point for the design of CD.

Completion degree corporate model can be leveled in the design of CD as follows. Since the HD development process is usually divided in time into a sequence of stages, the process of its design can be synchronized with completion process development of individual fragments corporate data model organizations.

At the lowest presentation layer of the corporate data model information about the physical characteristics of database objects corresponding to logical data model middle presentation layer of the corporate data model.

Creating a data warehouse model based on the corporate data model. What is an Enterprise Data Warehouse and to whom to sell it Enterprise Data Model

"The tale is a lie - but there is a hint in it ..."

Debriefing

Antifragility

What is a data warehouse and why are we building it

Key features of data warehouses

Architectural concept

System core

Warehouse Core Model and Enterprise Data Model

Primary data layer (or historicized staging or operational layer)

Showcase layer

Service layer

Designing and Maintaining Warehouse Data Models

Features of data warehouse projects

Data warehouse is custom software

Data warehouse is an integration project

Data Warehouse is a Collaborative Project

The data warehouse has a longer life span than other systems

Gradual iterative development

Data warehouses - "multi-project story"

Balance of innovation and proven solutions

Epilogue

Send your good work in the knowledge base is simple. Use the form below

1. Relational data model

2. Corporate information systems

relational model data system

A system (from the Greek systema - a whole, a compound made up of parts) is a set of elements that interact with each other, forming a certain integrity, unity. Here are some concepts that are often used to characterize a system.

1. A system element is a part of a system that has a specific functional purpose. Complex elements of systems, in turn, consisting of simpler interconnected elements, are often called subsystems.

2. Organization of the system - internal orderliness, consistency of interaction of system elements, manifested, in particular, in limiting the variety of states of elements within the system.

4. System architecture - a set of system properties that are essential for the user.

5. Integrity of the system - the fundamental irreducibility of the properties of the system to the sum of the properties of its individual elements (emergence of properties) and, at the same time, the dependence of the properties of each element on its place and function within the system.

Information system is an interconnected set of means, methods and personnel used to store, process and issue information in order to achieve the set goal "

The Federal Law "On Information, Informatization and Information Protection" provides the following definition:

"Information system is an organizationally ordered set of documents (arrays of documents) and information technologies, including the use of computer technology and communications that implement information processes"

Scale classification

In terms of scale, information systems are subdivided into the following groups:

* single;

* group;

* corporate.

A corporate information system is a scalable system designed for the integrated automation of all types of economic activities of large and medium-sized enterprises, including corporations consisting of a group of companies requiring unified management.

A corporate information system can be considered a system that automates more than 80% of the divisions of an enterprise.

Recently, in many publications devoted to the use of information technology in the management of economic objects, the term "corporate information systems" is often used, which in them means the actual automated information systems of economic objects.

For group and corporate systems, the requirements for the reliability of operation and data safety are significantly increased. These properties are maintained by maintaining data, reference, and transactional integrity in the database servers.

Classification by scope

According to the scope of application, information systems are usually divided into four groups:

* transaction processing systems;

* decision making systems;

* information and reference systems;

* office information systems.

Bibliography

Similar documents

Industry data models

Publications

Enterprise data model

Introduction

Enterprise data model

Presentation levels of the enterprise data model

Top related articles