Contacts

What is a corporate data warehouse (Data Warehouse) and to whom it is sold. Corporate data model. Corporate Databases Conceptual Corporate Storage Data Model

This article discusses the data warehouse architecture. What to be guided when it is built, what approaches work - and why.

"The fairy tale is a lie - yes in it hint ..."

Put the grandfather ... storage. And the vault is large-predicted. But I didn't really know how it was arranged. And he started his grandfather review. I called the grandfather grandmother, granddaughter, cat and a mouse for the Family Council. And this topic is polvit: "We have grown storage. Data from all systems flocks, the tables are apparently invisible. Users reports their own way. It seems to be all good - living yes live. Yes, only one sadness - no one knows how it is arranged. Discs requires apparently invisible - do not fight! And then there are still users to walk to me with complaints of different: the report freezes, then the data is obsolete. And it's completely trouble - we come with the reports to the king-father, and the numbers are not converged among themselves. Not even an hour - the king is accepted - not to demolish the heads, neither to me. So I decided to collect you and consult: what will we do? ".

He looked at his gaze the meeting and asks:
- Here you are, grandma do you know how it is arranged our storage?
- No, grandfather, I do not know. And how can I know something? Won there what brave lines guard him! Some convince what! Do not go. I went somehow to spend them, pies jailed. And they ate the pies, the mustache was wiped out and say: "Why did you come, grandma? What is the storage? You say - what a report you need - we will do it! You are the main pies more often bring! It hurts you delicious. "
- And you, the granddaughter's favorite, do you know how our storage is arranged?
- No, grandfather, I do not know. Gave me somehow access to it. I connected, I look - and there are apparently invisible to the tables. And in the schemes are different stubborn. Eyes scatter ... I was first confused. And then looked like - some of them are empty, others are filled, but only half. And the data is similar, it seems repeated. No wonder that discs do not fight, with such redundancy!
- Well, you, cat, what do you say about the repository, ours? Is there something good in it?
- Yes, how not to say, grandfather - I will say. I tried to extender the pilot in a separate schire of the pilot - a small shock. In order to understand what trade for our state is beneficial - what products are good for merchants, those tribute are paid - the treasury is replenished. And which - from the hand is bad. And I became from the storage of this data to choose myself. The facts are selected. And began to try to compare them against the products. And what, grandfather, I saw - the products, they seem to be the same, but you look at the plates - different! I got them then combed their scallop. Chesl-Chesl - and led to a certain uniformity, eye caressing. But early I was referred to - on the other day I launched my scripts wonderful data in the shop window to update - and everything went away with me! "How so?" "I think," the granddaughter will be upset - today we would have to show our pilot to the minister. How do we go - with such data?
- Yes, sad tales, cat, tell. Well, you, Mouse-norushka, did not really try to find out about the repository? You have a girl's girl, whiskey, sociable! What do you tell us?
- Yes, like, grandfather, do not try - of course, I am a quiet mouse, and more prompt. Asked someone's granddaughter of the cat model of the data of our state repository to receive. And the cat, of course, came to me - on you, says, mouse, all hope! Well, good deed to good people (and cats) do not do? I went to the castle, where the head of the repository data model in the safe hides. And hid. It was waited when he will find out that model from the safe. Only he came out for coffee - I jump on the table. I look at the model - I can't understand anything! How so? I do not recognize our storage! We have tables thousands of inconspicuous, data are irrepressible! And here - all slightly yes beautiful ... he looked at this model - and back to the safe removed.
- Yes, quite strange things, you, mouse, told.
Thought firmly grandfather.
- What should I do, my friends? After all, with such a repository you will not live for a long time ... Users will soon - completely patience will lose.

Whatever our grandfather decides from a fairy tale - build a new storage or try to reanimate existing ones - it is necessary to draw conclusions before "rolling the sleeves" again.
We will postpone in the direction of organizational aspects - such as the danger of the concentration of expertise in a narrow closed group, the lack of control processes and ensure the transparency of the system architecture used in the enterprise, etc.
Today I would like to focus on building a specific system architecture (or system groups) - data warehouses. What you need to keep attention in focus in the first place when the organization is being built to build such a complex and notable system as a storage.

Debriefing

None of us, working on the creation and development of any system, does not want this to be "time", or the decision that "willing" in a year or two, because It will be unable to meet the requirements and expectations from customers and business. No matter how strong the roll in the direction of "flexible methodologies" now, a man is much more pleasant to feel "master", which makes violins than an artisan who plaschers sticks for disposable drums.
Our intention sounds natural: make systems, good-quality and high-quality, which will not require regular "nightwort with a file", for which we will not be shame before end users and which will not look like a "black box" for all the "uninitiated" followers.

To begin with, we will make a list of typical problems that we regularly face, working with storage. I just write down what is - while without an attempt to streamline and formalize.

  1. In principle, we have a good storage: if you do not touch, then everything works. True, as soon as you want to make a change - "Local collaps" begin.
  2. The data is downloaded daily, according to the regulations, within the framework of a large process, within 8h. And it suits us. But if suddenly fails - it requires manual intervention. And then everything can work unpredictably long, because A person's participation will be required in the process.
  3. Relicaled release - wait for the problems.
  4. Some one source could not pay the data on time - all processes are waiting.
  5. Data integrity controls the database - therefore our processes are falling with an error when it is broken.
  6. We have a very large storage - 2000 tables in one general scheme. And 3000 more in many other schemes. We already weakly imagine how they are arranged and what reason appeared. Therefore, it is difficult for us to re-use something. And many tasks have to decide again. Since it is easier and faster (than to deal with "in someone else's code"). As a result, we have discrepancies and duplicate functionality.
  7. We expect the source to give quality data. But it turns out that this is not. As a result, we spend a lot of time on the reconciliation of your final reports. And I succeeded very much. We even have a well-established process. True, it takes time. But users are used to ...
  8. The user does not always trust our reports and requires an rationale for one or another figure. In some cases, he is right, but in some no. But we are very difficult to justify them, because We do not have the means of "through analysis" (or Data Lineage).
  9. We could attract additional developers. But we have a problem - how do we include them in work? How to most effectively parallery work?
  10. How to develop the system gradually, without going into the development of the "system kernel" for the whole year?
  11. The data warehouse is associated with the corporate model. But we know exactly (saw xyz in the bank) that the model can be built infinitely for a long time (in the XYZ bank, six months went and discussed business entities, without any movement). And why is it at all? Or maybe, better without it, if so many problems with it? Maybe it can somehow generate it?
  12. We decided to keep the model. But how to systemally develop the storage data model? Do we need "Rules of the game" and what can they be? What will this give us? And what if we will be mistaken with the model?
  13. Should we keep the data, or the history of their changes, if the "business is not needed"? I would not want to "store garbage" and complicate the use of this data for real tasks. Should the storage preserve the story? What happens? How does the storage work with time?
  14. Do I need to try to unify data to storage if we have the NSI management system? If there is MDM, does it mean that now is the whole problem with master data solved?
  15. We will soon be replaced by key accounting systems. Should the data store be prepared for a source change? How to achieve this?
  16. Do we need metadata? What will we understand this? Where exactly can they be used? How can I implement? Do you need to store them "in one place"?
  17. Our customers are extremely stable in their demands and desires - constantly changing something. We have a business generally very dynamic. While we do something - this is already becoming unnecessary. How do we do so to give out the result as quickly as possible - like hot cakes?
  18. Users require efficiency. But we can not run our basic boot processes often, because It loads the sources system (badly affects performance) - Therefore, we hang additional data streams - which will pick up point - what we need. True, it turns out a lot of streams. And then we throw out a part of the data. In addition, there will be a problem of convergence. But in no way ...
It has already turned out quite a lot. But this is not a complete list - it is easy to add and develop. We will not hide it on the table, and hang in a prominent place - holding these questions in the focus of your attention in the process of work.
Our task is to develop a comprehensive solution as a result.

Antihrupacity

Looking at our list, you can make one conclusion. It is not difficult to create a kind of "database for reporting", throw data there or even build some regulatory data update processes. The system starts to somehow live, users appear, and with them commitments and SLA, new requirements arise, additional sources are connected, the methodologies are changed - all this should be taken into account in the development process.

After some time the picture is as follows:
"Here is the storage. And it works if it does not touch it. Problems arise when we have to change something. "

The change in us, the influence of which we cannot evaluate and comprehend (since they did not launch such instruments into the system initially) - and in order not to risk, we do not touch what is, and we make another extension on the side, and one more, and More - turning our decision to slums, or as they say in Latin America, "Faverla", where even the police are afraid to go.
There is a feeling of loss of control over its own system, chaos. More and more hands are required to support existing processes and solve problems. And the changes make everything more complicated. In other words, the system becomes unstable to stress, non-adaptive to change. And besides, there is a strong dependence on characters that "know the Farvater", because there are no "cards".

Such property of the object is to collapse under the influence of chaos, random events and shocks - Nicholas Nicholas Talek calls fragility . And also introduces the opposite concept: antihrupacity when the subject does not destroy from stress and randomness, but receives direct benefits from it. ("Antihrupost. How to benefit from chaos")
Otherwise it can be called adaptability or resistance to change .

What does this mean in this context? What are the "sources of chaos" for IT systems? And what does it mean "to benefit from chaos" from the point of view of IT architecture?
The first thought that comes to mind is the changes that come from outside. What is the external world for the system? For the repository in particular. Of course, first of all - changes from the data sources for the repository:

  • changing the formats of incoming data;
  • replacing some data source systems to others;
  • changing the rules / platforms of systems integration;
  • changing the data interpretations (formats are saved, the logic of working with data is changing);
  • changing the data model if the integration is made at the data level (analysis of logs of database transaction files);
  • the increase in data volumes - so far the data in the source system was a bit, and the load was small - it was possible to pick them up anyway, arbitrarily with a difficult query, data and load increased - now there are strict limitations;
  • etc.
Source systems, information and its structure, type of integration interaction, as well as the logic of working with data can be changed. Each system implements its data model and approaches to work with them that meet the objectives and objectives of the system. And as if to unify sectoral models and reference practices - all the same, the nuances will inevitably pop up. (And besides, the process of industry unification itself, for various reasons is not very moving.)
Corporate work with corporate data - the presence and control of information architecture, a single semantic model, master data management (MDM) slightly facilitate the task of consolidating data to the repository, but do not exclude its need.

No less critical changes are initiated by the consumers of the repository (change in requirements):

  • earlier to build a data report, it was enough to connect additional fields or a new data source;
  • previously implemented data processing techniques are outdated - it is necessary to recycle algorithms and everything that does it affect;
  • earlier, all the current value of the directory attribute on the information panel was satisfied - now it takes a value that is currently relevant at the time of the analyzed fact / event;
  • the requirement for the depth of the storage history, which was not previously - to store data not for 2 years, and in 10 years;
  • previously, there was enough data as "at the end of the day / period" - now it is necessary to the status of data "within a day", or at the time of a specific event (for example, making a decision on a credit application - for Basel II);
  • previously, we were satisfied with the reporting on the data for yesterday (T-1) or later, now we need T0;
  • etc.
And integration interactions with source systems, and requirements from consumers of storage data - these are external factors for data warehouse: Some source systems are replaced by others, the amounts of data are growing, the incoming data formats are changing, the user requirements are changing, etc. And all this is typical external changes to which our system is our storage - should be ready. With proper architecture, they should not kill the system.

But that's not all.
Speaking of variability, we, first of all, remember external factors. After all, inside we can all control, it seems true for us, right? Yes and no. Yes, most factors that outside the zone of influence are external. But there is also "inner entropy". And it is because of its presence, we sometimes need to return "to point 0". Start the game first.
In life, we often tend to start from scratch. Why do we have it peculiar? And is it bad?
Applied to IT. For the system itself - it may be very good - the ability to revise individual solutions. Especially when we can do it locally. Refactoring is the process of unwinding the "web", which periodically arises in the process of system development. Return "To the beginning" may be useful. But has the price.
With competent architecture management, this price decreases - and the process of system development itself becomes more controlled and transparent. A simple example: if the principle of modularity is observed - you can rewrite a separate module, not touched on external interfaces. And this can not be done with a monolithic structure.

The anti-libruppiness of the system is determined by the architecture that is laid in it. And it is this property that makes it adaptive.
When we talk about adaptive architecture - We mean that the system is capable of adapting to changes, and not at all that we are constantly changing the architecture itself. On the contrary, the more stable and stable architecture, the fewer requirements that entail its revision - the more adaptive system.

A much higher price will have solutions involving the revision of the entire entire architecture. And for their acceptance you need to have very good reasons. For example, such a reason may require a requirement that cannot be implemented within the framework of the existing architecture. Then they say - a requirement that affects the architecture has appeared.
Thus, we also need to know their "borders of anti farm." The architecture is not developed "in vacuum" - it relies on the current requirements, expectations. And if the situation is fundamentally changing - we must understand that we went beyond the limits of the current architecture - and we need to reconsider it, to develop a different solution - and think over the transition paths.
For example, we laid down in the repository we need data will always be at the end of the day, we will do the data to date every day according to standard system interfaces (through a set of representations). Then the requirements for the need to receive data were not at the end of the day came from the risk management unit, but at the time of making a decision on lending. You do not need to try to "pull not tensioned" - you just need to recognize this fact - the sooner the better. And start working out an approach that will allow us to solve the problem.
There is a very thin line - if we take into account only the "requirements in the moment" and we will not look a few steps forward (and a few years ahead), then we increase the risk to face the requirement affecting the architecture too late - and the price of our Changes will be very high. Look a little forward - in the boundaries of our horizon - no one was harmful.

An example of a system from the "Storage Fairy Tale" is an example of a very riding system built on fragile approaches to design. And if this happens - the destruction comes quite quickly, it is for this class of systems.
Why can I say so? The topic of repositories is not new. Approaches and engineering practices that have been developed during this time were sent to this - preservation of the viability of the system.
A simple example: one of the most frequent reasons for the failure of projects of storage "on take-off" is an attempt to build a repository over source systems that are under development, without coordinating integration interfaces - an attempt to pick up data directly from tables. As a result, we went into development - during this time, the source database changed - and the loading streams in the storage are incomplete. Redo something late. And if they have not been impressed, making several layers of tables inside the repository - then everything can be thrown out and start again. This is just one example, and one of the simplenesses.

The criterion of fragile and anti-farm is simple. The main judge is time. If the system is withstanding the test of time, and shows its "vitality" and "unhappiness" - it has the property of anti-librukness.
If, when designing the system, we will take into account anti-librability as a requirement - it will advocate us to use such approaches to building its architecture that will make the system more adaptive and to "chaos outside", and to "chaos inside". And ultimately the system will have a longer life.
No one of us want to make a "outline." And you do not need to deceive yourself that it is impossible in a different way. Look a few steps forward - this is normal for a person at any time, especially in the crisis.

What is a data warehouse and why we build it

The article dedicated to the storage architecture suggests that the reader is not only aware of what it is, but also has some experience with such systems. Nevertheless, I considered it necessary to do this - return to the origins, to the beginning of the way, because It is there that the "Point of Support" development is located.

How did people come to the fact that the data warehouse is needed? And how do they differ from just a "very large database"?
Long ago, when simply lived in the light of the "Business Data Processing Systems", there was no separation of IT systems to such classes as frontal OLTP systems, back-office DSS, text data processing systems, data warehousing, etc.
It was the time when Michael Stongbreaker was created the first relational DBMS of Ingres.
And it was the time when the era of personal computers swirl broke into the computer industry and forever turned all the ideas of the IT community of that time.

Then it was easy to meet corporate applications written on the basis of the DESKTOP class database - such as Clipper, Dbase and Foxpro. And the market of client-server applications and DBMS only gained momentum. One after another, database servers appeared, which for a long time will occupy their niche in IT space - Oracle, DB2, etc.
And the term "database application" was distributed. What included such an application? Simplified - some input forms through which users could simultaneously enter information, some calculations that were launched "on the button" or "on a schedule", as well as some reports that could be seen on the screen or save as files and send to Print.
"Nothing special is the usual application, only there is a database," so noticed one of my mentors at the early stage of the employment path. "Is anything special?" - I thought then.

If you look like, then there is still a feature. As users grow, the volume of incoming information, as the load on the system increases - its designer developers in order to maintain the speed at an acceptable level go on some "tricks". The very first is the separation of a monolithic "Business Data Processing System" on the accounting application that supports users in ON-LINE mode, and separately allocate the application for Batch-processing data and reporting separately. Each of these applications has its own database and is even placed on a separate instance of the database server, with different settings for a different load character - OLTP and DSS. And between them draws out data streams.

It's all? It would seem - the problem is solved. What happens next?
And then companies grow, their informational needs multiply. The number of interactions with the outside world is growing. And in the end there is not one big application that fully automates all processes, but several different, from different manufacturers. The number of systems generating information - data source systems in the company increases. Sooner or later, the need to see and compare the information obtained from different systems. So the company has a data warehouse - a new class of systems.
The generally accepted definition of this class of systems sounds like this.

Data Warehouse (or Data Warehouse) - an object-oriented information database specifically designed and designed to prepare reports and business analysis in order to support decision-making in the organization
In this way, consolidation Data from different systems, the ability to look at them to some "single" (unified) manner is one of the key property of data warehose class. This is the reason why storages appeared during the evolution of IT systems.

Key features of data warehouses

Let's see more. What key features do these systems have? What distinguishes the data warehouse from other IT systems of the enterprise?

First, it is large volumes. Very big. VLDB. - So refer to such systems leading vendors when they give recommendations on the use of their products. From all the company's systems, these data flock to this large database and are stored there "forever and unchanged," as they write in textbooks (in practice, life is more difficult).

Secondly, it is historical data - "Corporate Memory" - so called data warehouses. In terms of work with time in the repository, everything is completely interesting. In the accounting systems, the data is currently in the moment. Then the user performs some operation - and the data is updated. In this case, the history of changes may not be preserved - it depends on the practice of accounting. Take, for example, the balance on the bank account. We may be interested in the current balance on the "now", at the end of the day or at the time of a certain event (for example, at the time of calculating scoring points). If the first two are solved quite simply, then for the latter, most likely, special efforts will be required. The user working with the repository can turn to last periods, to compare them with the current, etc. It is these possibilities associated with time that significantly distinguish data warehouses from accounting systems - obtaining the status of data at various points of the time axis - to a certain depth in the past.

Thirdly, it is consolidation and data Unification . In order for their joint analysis to be possible, you need to bring them to the general type - unified data model , Match facts with unified reference books. There may be several aspects and difficulties. Primarily - conceptual - Under the same term, different people from different divisions may understand different things. And on the contrary - to call something differently, which is essentially the same thing. How to provide a "single look", and at the same time maintain the specifics of a vision of a particular group of users?

Fourth, this is working with data quality . In the process of loading data in the repository, they are cleaned, general transformations and transformation. Common transformations must be made in one place - and further use to build various reports. This will avoid discrepancies that cause so much irritation among business users - especially the guide to which the "on the table" is brought from different departments that are not converged. Low data quality gives rise to errors and discrepancies in reports, the consequence of which is a decrease in level user trust The whole system, to the whole analytical service as a whole.

Architectural concept

Everyone who came across the repository was most likely observed a certain "layered structure" - because It is this architectural paradigm that passed for the systems of this class. And it's not by chance. The storage layers can be perceived as separate components of the system - with their tasks, the zone of responsibility, the "Rules of the game".
The level architecture is a means of combating the complexity of the system - each subsequent level is abstracted from the difficulties of the internal implementation of the previous one. This approach allows you to select the same type of tasks and solve them uniformly, not inventing every time "bike" from scratch.
A schematic conceptual architectural scheme is presented in the figure. This is a simplified scheme that reflects only a key idea - a concept, but without "anatomical details", which will arise with a deeper work of the parts.

As shown in the diagram, the following layers are conceptually released. Three main layers that contain the data storage area (designated by a painted rectangle) and on data loading (conditionally shown by the arrows of the same color). As well as auxiliary - service layer, which, however, plays a very important binding role - management of data loading and quality control.

PRIMARY DATA LAYER - layer of primary data (or steyjing , or operating layer ) - Designed to download from source systems and maintain primary information, without transformations - in source quality and support for the full history of changes.
The task of this layer - abstract the subsequent storage layers from the physical device of data sources, methods of data collection and methods for allocating the change delta.

Core Data Layer - storage kernel - the central component of the system that distinguishes the repository from simply the "Batch-integration platform", or the "big dump of data", since its main role is consolidation of data From different sources, bringing to unified structures, keys. It is when loading in the kernel, a major work is carried out with data quality and general transformation that can be quite complex.
The task of this layer - Abstragging its consumers from the features of a logical device of data sources and the need to compare data from various systems, ensure the integrity and quality of data.

Data Mart Layer - Analytical Showcases - Component, the main function of which is to convert data to structures, convenient to analyze (if BI works with showcases - then this is usually Dimensional Model), or according to the requirements of the consumer system.
As a rule, showcases take data from the nucleus - as a reliable and verified source - i.e. Use the service of this component to bring data to a single form. We will call such display cases regular . In some cases, showcases can take data directly from Steidzhin - operating primary data (in source keys). Such an approach is usually used for local tasks where data consolidation from different systems is required and where the efficiency is needed more than the quality of the data. Such showcases are called operational . Some analytical indicators may have very complex calculation techniques. Therefore, for such non-trivial calculations and transformations create so-called so-called secondary showcases .
Task layer shop windows - Preparation of data according to the specific consumer requirements - BI platforms, user groups, or external system.

The layers described above consist of a constant storage area, as well as the download and data transformation software module. Such separation on layers and areas is logical. Physically, the implementation of these components may be different - you can even use various platforms for storing or converting data on different layers, if it is more efficient.
Storage area contain technical (buffer tables), which are used in the data transformation process and task tablesTo which the consumer component is drawn. The rule of good tone is the "cover" of the target tables by the views. This facilitates subsequent support and development of the system. Data in the target tables of all three layers is marked with special technical fields (meta attributes), which serve to ensure data loading processes, as well as for the possibility of an information audit of data streams in the repository.

Also allocate a special component (or a set of components), which provides service functions for all layers. One of the key tasks is the control function - to provide "single rules of the game" for the entire system as a whole, leaving the right to use various options for the implementation of each of the layers described above - incl. Use different data loading and data processing technologies, different storage platforms, etc. We will call it service Layer . It does not contain business data, but has its storage structures - contains the area of \u200b\u200bmetadata, as well as the area for working with data quality (and possibly other structures - depending on the functions assigned to it).

Such a clear separation of the system into individual components significantly increases the controllability of the system's development:

  • the complexity of the task is reduced, which is set to the developer of the functionality of a particular component (it should not simultaneously solve the issues of integration with external systems, and produce data cleaning procedures, and think about the optimal data presentation for consumers) - the task is easier to decompose, evaluate and perform a small Delivery;
  • you can connect to the work of various performers (and even teams, or contractors) - because This approach allows you to effectively parallel problems, reducing their mutual influence on each other;
  • the presence of persistent steyjing allows you to quickly connect data sources without designing the entire kernel, or showcases for the entire subject area, and then gradually hold the remaining layers according to priorities (and the data will already be in the repository - accessible by systemic analysts, which will significantly facilitate the tasks of the subsequent development of the repository);
  • the presence of the kernel allows all the data with the quality of data (as well as possible missions and errors) to hide from display cases and from the end user, and most importantly - using this component as a single data source for showcases, you can avoid problems with data convergence due to the implementation of general algorithms in one place;
  • the selection of showcases allows you to take into account the differences and specifics of the understanding of data that users can have from different departments, and their designs for the BI requirement allows not only to issue aggregated figures, but to ensure verification of data reliability by providing Drill DOWN capabilities to primary indicators;
  • the presence of a service layer allows you to perform through data analysis (Data Lineage), use the unified data audit tools, general approaches to the allocation of the change delta, working with data quality, download management, tools for monitoring and diagnosing errors, speeds up resolution of problems.
This approach to decomposition also makes the system more resistant to change (compared to the "monolithic design") - provides its anti-flagity:
  • changes from the System Systems are manifested on Stujming - in the core, only those flows are modified in the core, which influence these steward tables, the effect on the showcase is minimal or absent;
  • changes in consumer requirements are carried out for the most part on the windows (if it does not require additional information, which is not yet in the repository).
Next, we will go through each of the above components and look at them a little more.

Core system

Let's start "from the middle" - the core of the system or the middle layer. On designated as Core Layer. The kernel acts as data consolidation - bringing to unified structures, directories, keys. Here is the main work with data quality - cleaning, transformation, unification.

The presence of this component allows you to re-use the data streams transforming the primary data obtained from source systems into a single format, following the general rules and algorithms, and not repeat the implementation of the same functional separately for each application showcase, which in addition to the inefficient use of resources may entail also divergence of data.
The kernel of the data is implemented in the data model, in general, other than the models of source systems and the formats and consumer structures.

Storage core model and corporate data model

The main task of the average storage layer is stability. That is why the main emphasis here is done on the data model. It is customary called the "corporate data model". Unfortunately, a certain halo of myths and insights have developed around it, which sometimes lead to the abandonment of its construction at all, and in vain.

Myth 1. A corporate data model is a huge model consisting of thousands of entities (tables).
Actually. In any subject area, in any business domain, in the data of any company, even the most difficult, the main entities are a bit - 20-30.

Myth 2. No need to develop any "my model" - we buy a sectoral reference model - and we do everything according to it. We spend money - but we get a guaranteed result.
Actually. Reference models can really be very useful, because Contain industry experience modeling in this area. Of these, you can learn ideas, approaches, naming practices. Check the "depth of coverage" of the region, so as not to miss something important. But we can hardly use such a model "out of the box" - as it is. This is the same myth, such as the purchase of an ERP system (or CRM) and its introduction without any "cheese-to-yourself". The value of such models is born in their adaptation to the realities of this business, it is this company.

Myth 3. Development of the storage kernel model can take many months, and at this time the project will actually be frozen. In addition, it requires a crazy number of meetings and the participation of many people.
Actually. The storage model can be developed together with the storage of iteratively, in parts. "Expansion points" or "plug" are set for non-swollen areas - i.e. Some "universal designs" apply. At the same time, it is necessary to know the measure so that the super-universal thing does not work out of the 4 tables, which is difficult to "put the data" and (even more difficult) to get it. And which is extremely not optimally working in terms of performance.

Time to develop the model is really required. But this is not the time spent on "drawing entities" - this time required to analyze the subject area, understanding the data. That is why, in this process, analysts are very tightly involved, and various business experts are involved. And it is done point, selectively. And not by organizing meetings with the participation of a crazy number of people, mailing losses of huge questionnaires, etc.
Quality business and systemic analysis - this is what is key when constructing a storage kernel model. You need a lot of things to understand: where (in which systems) the data are generated, as they are arranged, in which business processes they circulate, etc. Qualitative analysis has not yet harmed a single system. Rather, on the contrary - problems arise from "white spots" in our understanding.

The development of the data model is not the process of the invention and invent something new. In fact, the data model in the company already exists. And the process of its design is rather similar to "excavations". The model is neatly and carefully removed from the "soil" of corporate data and is associated with a structured form.

Myth 4. In our company, the company is so dynamic, and everything changes so quickly, that it is useless to make a model - it will endure earlier than we introduce this part of the system into operation.
Actually. Recall that the key kernel factor is stability. And above all, the model topologies. Why? Because it is this component that is central and has an impact on everything else. Stability is a requirement to the kernel model. If the model becomes too fast - it means it is incorrectly designed. For its development, not the approaches and "Rules of the game" are selected. And also this is a question of high-quality analysis. The key essences of the corporate model change extremely rarely.
But if we come to mind to do for the company selling, confectionery, instead of the "Products" directory to make "Candy", "Cakes" and "Pies". When the pizza appears in the list of goods - yes, you will need to enter a lot of new tables. And this is just the question of the approach.

Myth 5. Creating a corporate model is a very serious, complex and responsible thing. And terribly make a mistake.
Actually. The kernel model should be stable, but still not "cast in the metal." Like any other design solutions, its structure can be reviewed and modified. You just don't need to forget about it. But this does not mean at all that "you can not breathe". And this does not mean that temporary solutions and "plugs" are unacceptable, which should be scheduled for processing.

Myth 6. If we have a data source - for example, the NSI system (or the master data management system is MDM), then it must be in a good way to comply with the corporate model (especially if it has recently been designed, and did not have time to become "by the way", "traditions "And outlines). It turns out that for this occasion - we do not need a kernel model?
Actually. Yes, in this case, the construction of the model of the kernel of the storage is greatly facilitated - because We follow the finished conceptual top-level model. But it is not excluded at all. Why? Because when building a model of a specific system, some of their own rules apply - what types of tables are used (for each entity), how to verition data, with which granularity to keep the story, which metaitracts (technical fields to use), etc.

In addition, whatever wonderful and comprehensive system of NSI and MDM we have - as a rule, there will be nuances associated with the existence of local reference books "about the same thing" in other accounting systems. And this problem, we want it, or not - you will have to decide on the repository - after all, reporting and analyst collect here.

Primary data layer (or historically staging or operating layer)

It is indicated as Primary Data Layer. The role of this component: integration with source systems, loading and storing primary data, as well as pre-cleaning data - verification on compliance with the rules of format-logical control recorded in the "Agreement on Interface Interface" with a source.
In addition, this component solves a very important task for the repository - allocating a "true delta change" - regardless of whether the source allows you to track changes in data or not and how (according to which criterion they can be "caught"). As soon as the data fell into Stujing - for all other layers, the issue of the delta is already understood - due to the labeling by meta attributes.

The data in this layer is stored in structures as close as possible to the source system - to save the primary data as close as possible to their pristine appearance. Another name of this component is the "operating layer".
Why not just use the established term "staging"? The fact is that earlier, to the "Epoch of Large Data and VLDB", the disk space was very expensive - and often primary data if they persist, then a limited time interval. And often the name "Staging" is called clemented buffer.
Now the technologies stepped forward - and we can afford not only to keep all the primary data, but to historize them with the degree of granularity, which is only possible. This does not mean that we should not control the growth of data and does not cancel the need to control the life cycle of information by optimizing the cost of data storage, depending on the "temperature" of use - i.e. By finding "cold data", which are less in demand, for cheaper storage carriers and platforms.

What gives us a "historically placed Steidzhin":

  • the ability to be mistaken (in structures, in transformation algorithms, in the granularity of history) - having fully historized primary data in the accessibility zone for the repository, we can always make a reboot of our tables;
  • the opportunity to think - we can not hurry with the study of the large fragment of the kernel in this iteration of the development of the repository, because In our Stajing, in any case, there will be, and with a smooth temporary horizon (the point of "counting of history" will be one);
  • the ability to analyze - we will keep even those data that is no longer in the source - they could get lost there, go to the archive, etc. - We also remain available for analysis;
  • the ability to informational audit - thanks to the most detailed primary information, we can then understand - how we worked for the download that we eventually got such numbers (for this you need to have labeling by meta attributes and the corresponding metadata on which the download works is solved on the service layer).
What difficulties may arise when building the "historized Steidzhinburg":
  • it would be convenient to set the requirements for the transactional integrity of this layer, but the practice shows that it is difficult to achieve (this means that in this area we do not guarantee the reference integrity of parent and child tables) - integrity leveling occurs on subsequent layers;
  • this layer contains very large volumes (the most voluminous on the repository - despite the over the redundancy of analytical structures) - and you need to be able to handle such volumes - both from the point of view of download and from the point of view of requests (otherwise you can seriously degrade the performance of the entire repository).
What else can I say about this layer.
First, if we move out of the paradigm of "through-loading processes" - then for us no longer works the rule "Caravan comes with the speed of the last camel", more precisely, we refuse the principle of "caravan" and proceed to the "conveyor" principle: took the data from Source - put in his layer - ready to take the next portion. It means that
1) We do not wait until the processing on other layers will happen;
2) We are not dependent on the granting schedule by other systems.
Simply put, we put on the schedule of the boot process, which takes data from one source through a certain way to connect to it, checks, selects the delta - and puts data into the Task Tables of Statery. And that's it.

Secondly, these processes, as can be seen, are very simple - you can say trivially, in terms of logic. And this means - they can be very well optimized and parameterized, reducing the load on our system and speeding up the process of connecting sources (development time).
To happen, you need to know very well the features of the technological features of the platform on which this component works - and then you can make a very effective tool.

Layer of analytical shop windows

Data Mart Layer is responsible for preparing and providing these to end users - people or systems. At this level, the requirements of the consumer are maximally taken into account - both logical (conceptual) and physical. The service must provide exactly what is necessary - no more, no less.

If the consumer is an external system, then as a rule, it dictates those data structures that they need and regulations for information fence. A good approach is considered to be such in which the consumer itself is responsible for the correct data fence. The repository data prepared, formed a showcase, provided the possibility of incremental data intimidation (marking by meta-attributes for the subsequent allocation of the change delta), and the system-consumer further controls and is responsible for how it uses this showcase. But there are features: when the system does not have an active component for data collection - you need either an external component that will perform an integrating function, or the storage will perform as "integration platforms" - and will provide correct incremental shipment of data below - beyond the repository. Many nuances pop up here, and the rules of interface interaction should be thought out and understandable to both parties (however, as always - when it comes to integration). To similar display shows, as a rule, the regulatory cleansing / archiving of data is applied (it is rarely necessary that these "transit data" are stored for a long time).

The highest value from the point of view of analytical tasks is the "For people" - more precisely for the BI tools with which they work.
However, there is a category of "highly advanced users" - analysts, data researchers - which are not needed either BI instruments nor the regulatory processes of filling external specialized systems. They require some "common shop windows" and "its sandbox", where they can create tables and transformations at their discretion. In this case, the storage responsibility is to ensure that these common showcases are filling into compliance with the regulations.
Separately, such consumers can be distinguished as Data Mining tools - deep data analysis. Such tools have their own data preparation requirements, and data research experts are also working with them. For the storage facility, the task is reduced - again to the support of the service for loading some showcases of the agreed format.

However, back to analytical shop windows. It is they are of interest from the point of view of storage designer developers in this data layer.
In my opinion, the best approach to the design of the data showcases, tested by the time, which is now "sharpened" by almost all Bi platforms is the Ralph Kimball approach. He is famous called dimensional Modeling - Multidimensional modeling. There are a great many publications on this topic. For example, the main rules can be found in the publication of Margi Ross. And of course, it can be recommended from the guru of multidimensional modeling. Another useful resource - "Kimballa Tips"
The multidimensional approach to the creation of the showcases is described and worked as well - both from the "Evangelists of the method" and from the leading vendors of the software that it makes no sense here somehow it is always preferable to stop on it - the source is always preferable.

I would like to make only one accent. "Reporting and analytics" is different. There are "Heavy Reporting" - pre-ordered reports that are formed as files and delivered users using the provided delivery channels. And there are information panels - Bi Dashboards. In essence, it is a Web application. And by the response time of these applications, the same requirements are presented as for any other Web application. This means that the normal time of updating the Bi-panel is seconds, not a minute. It is important to remember when developing a solution. How to achieve this? Standard optimization method: look, what is the response time and what we can influence. What is the most time to spend time? On physical (disk) reading database, data transfer over the network. How to reduce the scope of read and transmitted data for one request? The answer is obvious and simple: it is necessary to either aggregate, or apply a filter to large table fact tables participating in the query, and exclude the connection of large tables (access to the fact tables should only go through measurements).

What is BI? What is it convenient? Why is the multidimensional model?
BI allows the user to perform the so-called "non-elected requests". What does it mean? This means that we do not know the request in advance exactly, but we know what indicators in what cuts the user can request. The user forms such a request by selecting the corresponding BI filters. And the task of the developer of the BI and the designer of the showcase is to provide such logic of the application, so that the data is either filtered, or aggregated, not allowing the situation when the data is requested too much - and the application "hung". Usually begin with aggregated numbers, further deepening to more detailed data, but simultaneously installing the desired filters.

It is not always enough to simply build a "correct star" - and get a convenient structure for BI. Sometimes it will take somewhere to apply denormalization (looking around at the same time, as it will affect the download), and somewhere to make secondary showcases and aggregates. Somewhere add indexes or projections (depending on the DBMS).

Thus, by "samples and errors", it is possible to obtain a structure optimal for BI - which will be consistent with the features of both the DBMS and BI-platforms, as well as the data requirements for data presentation.
If we take the data from the "nucleus", then such processing of the showcase will be local in nature, without affecting the complex processing of primary data obtained directly from source systems - we only "shift" data into a convenient format for BI. And we can afford to make it many times, in various ways, in line with various requirements. On the nucleus data, it is much easier and faster than collecting from the "primary" (the structure and rules of which, as we know, can be "swim").

Service layer

The service layer (- Service Layer) is responsible for the implementation of common (service) functions that can be used to process data in various storage layers - download management, data quality management, problems diagnostics and monitoring tools, etc.
The presence of this level provides transparency and structuredness of data streams in the repository.

This layer includes two storage areas:

  • the metadata area is used for the data loading mechanism;
  • data quality area - to implement off-line data quality checks (i.e., those that are not integrated directly into ETL processes).
You can build a download management process differently. One of the possible approaches is: we divide all sets of storage tables on modules. The module can include tables of only one layer. Tables that are part of each module are loaded within a separate process. Let's call it managing Process . The launch of the control process is placed on its schedule. The control process orchestures calls atomic processes, each of which downloads one target table, and also contains some common steps.
Obviously, it is enough to simply divide the tables staging on the modules - by source systems, or rather their connection points. But for the nucleus it is already more difficult to do - because There we need to ensure the integrity of the data, which means you need to take into account the dependencies. Those. Collisses will arise that must be resolved. And there are different methods for their permission.

An important point in the download management is the development of a single approach to error processing. Errors are classified in terms of criticality. If a critical error occurs, the process must stop, and as quickly as possible, because Its emergence speaks of a significant problem that can lead to damage in the repository. Thus, the download management is not only the start of the processes, but also their stop, as well as the prevention of late launch (by mistake).

For the service layer, a special structure of metadata is created. In this area, information on booting processes, downloaded data sets, control points that are used to maintain increment (which process to which point read) and other service information needed for system functioning are also read.
It is important to note that all the target tables in all layers are marked with a special set of meta-fields, one of which is the process identifier that updated is a string. For tables inside the storage, this marking process allows you to use a unified way of subsequent identification of the change delta. When loading data in the layer of primary data, it is more complicated - the algorithm for selecting a delta for different downloadable objects may be different. But the logic of processing the changes received and their counterpart on the target tables for the kernel and showcases are much more complicated than for steining, where everything is quite trivial - it is easy to parameterize and think over the standard steps (procedures) reused.

I do not put the task here to completely highlight this topic - the download organization is only by setting the accents to which attention should be paid.
The given approach is only one of the options. He is quite adaptive. And his "concept prototype" was the Toyota conveyor and the system "exactly during" (Just-in-Time). Those. We are leaving the widespread paradigm here exclusively "night download", and you load in small portions during the day - as data is ready in various sources: what came and uploaded. At the same time, we work with many parallel processes. A "hot tail" of fresh data will constantly "blink" - and after a while to align. We must take into account such a feature. And if necessary, form custom shop windows "cuts", where everything is already holistic. Those. It is impossible to simultaneously achieve both efficiency and consistency (integrity). We need a balance - one thing is important somewhere, somewhere else.

It is extremely important to provide instruments of journaling and monitoring. Good practice is the use of typed events where you can set different parameters and configure the notification system to subscribe to certain events. Because It is very important that when the system administrator intervention is required - he would have learned about it as early as possible and received all the necessary diagnostic information. Magazines can also be used to analyze the problems of "post-facts", as well as to investigate incidents of system performance disorders, incl. Data quality.

Design and maintenance of storage data models

Why, when developing any system, where the database is involved (and in the repository - especially), it is important to pay attention to the design of data models? Why not throw up a set of tables, anywhere - at least in a text editor? Why do we need "these pictures"?
Oddly enough, such questions even put experienced developers.
Actually, yes, nothing prevents the tables to sketch - and start using them. If ... while in the head (!) The developer has a slim overall picture of the structure that he is. And what if the developers are somewhat? And what if these tables use someone else? And what if time passes - a person will leave this area, and then will it return to it?

Is it possible to deal without a model? In principle, you can. And figure out, and "fuck pictures on a piece of paper", and "Walking - settle" data. But it is much easier, more precisely and faster to use the finished artifact of the data model. And also to understand the "logic of its device" - i.e. It would be nice to have general rules of the game.

And most importantly, not even this. The most important thing is that when designing the model, we are forced (just without options!) More tightly and deeply learn the subject area, features of the data device and their use in various business cases. And those questions that we would be easily "moved" as complex, "clouded", throwing our signs, without trying to design Model - We will be forced to put and decide now, when analyzing and designing, and not later - when we build reports and think about how to make an incompatible "and every time" to invent a bike ".

Such an approach is one of those engineering practices that allow you to create anti-year systems. Since they are clearly arranged, transparent, convenient for development, and immediately visible their "borders of fragility" - you can more accurately appreciate the "disaster scale" when new requirements and the time required for redesign (if it is necessary).
Thus, the data model is one of the main artifacts, which should be maintained in the process of system development. In a good way, it should be "on the table" of each analytics, developer, etc. - All those who participate in the project development projects.

Designing data models is a separate, very extensive topic. When designing storages, two main approaches are used.
For the kernel, the approach is good "Essence-Communication" - When a normalized (3NF) model is built on the basis of precisely the study of the subject area, or rather the area selected. Here the same "corporate model" plays, which was discussed above.

When designing analytical showcases suitable multidimensional model . This approach goes well on understanding business users - because This is a model, simple and convenient for human perception - people operate with understandable and familiar concepts of metrics (indicators) and cuts, for which they are analyzed. And this allows you to simply and clearly build the process of collecting requirements - we draw a set of "matrices of cuts and indicators", communicating with representatives of various divisions. And then we reduce into one structure - "Analysis Model": We form a "tire of measurements" and determine the facts that are defined on them. Along the way, we work on hierarchies and rules of aggregation.

Next, it is very easy to go to the physical model, adding elements of optimization, taking into account the features of the DBMS. For example, for Oracle, it will be partitioning, a set of indexes, etc. For Vertica other techniques will be used - sorting, segmentation, partitioning.
Special denormalization may also be required - when we deliberately make redundancy into data, thanks to which you improve the speed of requests, but at the same time complicate the data update (because redundancy will need to be taken into account and maintain during the data loading process). It is possible, in order to improve speed, we will also have to create additional aggregate tables, or use such additional DBMS capabilities as projection in Vertica.

So, when modeling the storage data, we actually solve several tasks:

  • task Construction of a conceptual (logical) model of the nucleus - system and business analysis - the study of the subject area, deepening in the details and accounting of the nuances of "living data" and their use in business;
  • the task of constructing an analysis model - and further conceptual (logical) model of showcases;
  • the task of building physical models is to control the redundancy of data, optimization, taking into account the features of the DBMS for queries and data loading.
When developing conceptual models, we may not take into account the features of a specific DBMS for which we design the structure of the database. Moreover, we can use one conceptual model to create several physical - for different DBMS.

We summarize.

  • The data model is not a set of "beautiful pictures", and the process of its design is not the process of drawing them. The model reflects our understanding of the subject area. And the process of its preparation is the process of its study and research. That is spent time. And not at all to "draw and paint".
  • The data model is a project artifact, a method for sharing information in a structured form between team members. For this, it must be understood everything (this is ensured by notation and explanation) and is available (published).
  • The data model is not created once and frozen, but is created and developed in the process of system development. We ourselves ask the rules for its development. And we can change them if we see - how to make it better, easier, more efficient.
  • The data model (physical) allows you to consolidate and use the set of best practices aimed at optimization - i.e. Use those techniques that have already worked for this DBMS.

Features of data warehouse projects


Let us dwell on the features of projects, in which the data storage facilities are built and developed. And let's look at them from the point of view of the influence of the architectural aspect. Why is it important to build architecture for such projects, and from the very beginning. And it is the presence of a well-thought-out architecture that gives the flexibility of the data warehouse project, it allows you to effectively distribute the work between the performers, and it is also easier to predict the result and make the process more predictable.

Data warehouse is a custom

Data warehouse is always "custom-made development", and not a boxing solution. Yes, there are sectoral BI applications that include a reference data model, prefitted ETL processes from common sources (for example, ERP systems), a set of typical BI panels and reports. But in practice, the storage is extremely rarely introduced - as a "box". I work with repositories about 10 years, and has never seen such history. Always pop up their nuances associated with the unique features of the company - both business and IT landscape. Therefore, it is hoped that the architecture will provide "Vendor", which delivers a solution somewhat rashly. The architecture of such systems often "matures" within the organization itself. Either it is formed by the specialists of the contractor company, which is the main performer on the project.

Data warehouse is an integration project.

The data warehouse downloads and processes information from many source systems. And in order to keep with them "friendly relationships" you need to be extremely careful to them. Including, it is necessary to minimize the load on the sources system, take into account the "accessibility and unavailability" windows, select the interfaces of the interaction, taking into account their architecture, etc. Then the repository will have the ability to take the data as early as possible and with the desired frequency. Otherwise you will be "transplanted" to the reserve contour, which is not updated with the most prompt periodicity.
In addition, it is necessary to take into account the "human factor". Integration is not only the interaction of machines. These are also communications between people.

Data warehouse is a collective project.


In a large company, such a system is rarely made by the forces of an exclusively single command. As a rule, several teams work here, each of which solves a certain task.

The architecture should provide the ability to organize their parallel work, and at the same time maintain its integrity and avoid duplication of the same functional in different places, different people. In addition to unnecessary labor, such duplication may subsequently lead to discrepancies in the data.

In addition, when such many people and teams and teams are involved in the system of development of the system, often the question of how to build communication and informational interaction between them. The more standard and understandable approaches and practices are used - the easier it is more convenient and more efficient to establish such work. And including it is worth thinking about the composition of the "workers of artifacts", among which it is data models for data warehouses No. 1 (see the previous section).

Data warehouse has a longer service life compared to other systems.

I will clarify - the statement is true for the "living", working storage, integrated with key sources with historical data and providing information and analytical services by many divisions of the company.

What are my foundations so to assume?
First, building a repository is a very resource-cost process: in addition to the cost of equipment, licenses for the necessary technological software and on development, almost all systems and divisions of the company are also involved. Repeat the whole process from scratch again - this is a very bold vent.

Secondly, if the storage has the right architecture, it can easily survive and change the source systems, and the emergence of new requirements from the end users, and the growth of data volumes.
If the architecture is correct, information flows are transparent - then such a system can be developed enough to develop without risk in a situation of stupor when making changes due to difficulties with an impact assessment.

Gradual iterative development

The last thing I would like the customer, imposing in history with the storage, is to freeze my requirements for the year or another until the full model of corporate data is designed, all sources are connected in full, etc.

The data warehouse in the eyes of customers often looks like an absolute monster - such a volume of tasks, objectives and horizon of the system development. And often the customer is afraid that "due to his budget" the IT unit will be solved by some "its tasks". And again we are faced with the question of the interaction between people and the ability to calmly express their position and negotiate.

Competent architectural approaches allow to develop the system iteratively, increasing the functionality gradually, without going into the "Development" for several years before starting to give the result.

Although it should be noted that "Miracles does not happen" - and the time to "start" also need time. For storage, it is quite large - since these are large amounts of data, these are historical data for old periods when information processing rules may differ from the current ones. Therefore, it takes enough time to analytical elaboration, interaction with source systems and a number of "samples and errors", including load tests for real data.

Data Warehouse - Multi-Project Story

For data warehouse, it is difficult to highlight a single business customer. And it is believed (not without reason) that the key factor in the success of the repository construction project is to support the company's leadership - directly first person.
The repository is rarely built and developed within a single project. As a rule, there are various needs for data consolidation and analytics, they have different customers and groups of users. Therefore, the repository often develops within several parallel-running projects.

Balance of innovation and proven decisions

Despite the fact that the theme of the repository is very "ancient" (if such a word is applicable for such a young industry as IT) and quite conservative. Nevertheless, progress does not stand still - and those restrictions that previously existed due to expensive and slow disks, dear memory, etc. - Now removed. At the same time, it's time to revise some architectural approaches. Moreover, this refers to both technological platforms and the architecture of the Applied Systems, which are based on them.

It is important to keep the balance - and keep enough "eco-friendly" approach to both resources and stored information. Otherwise, you can very quickly turn the repository into a low-resistant "garbage", in which if it can be understood, then by quite a big effort.
Yes, we have more opportunities, but this does not mean that it is necessary to deny all the troubled and proven practices, which are clear how and why to use, and "start in all the grave" only driven by the foggy ghost "innovations".
Observe the balance - it means to use new methods and approaches where they open up new opportunities, but at the same time use old proven - to solve the urgent tasks that no one has canceled.
What can we do as developers and designers of applied solutions? First of all, to know and understand the technological changes in the platforms, on which we work, their capabilities, features and borders of the application.

Let's look at the DBMS - as the most critical and important technological platform.
Recently, the deletion databases created initially as "universal", in the direction of specialization. Long leading vendors produce various options - for various class applications (OLTP, DSS & DWH). In addition, additional opportunities appear to work with text, geo-data, etc.

But this is not limited to this, the products were originally focused on a certain class of tasks - i.e. Specialized DBMS. They can use the relational model, and may not. It is important that they are initially "sharpened" not just for storing and processing "business information" in general, but under certain tasks.

Apparently, centralization and specialization are two complementary trends that periodically replace each other, ensuring development and balance. As well as evolutionary (gradual) gradual development and cardinal changes. So, in the 90s, Michael Stunbreaker was one of the authors of the manifesta database III generation, in which the thought clearly sounded that the world does not need another revolution in the world databases. However, after 10 years, he publishes work in which the prerequisites for the start of the new era in the world of the DBMS are precisely based on their specialization.
It focuses on the fact that common universal DBMSs are built on a "dimensionless" architecture, which does not take into account any changes in hardware platforms, nor the separation of applications to classes for which you can come up with a more optimal solution than realizing Universal requirements.
And begins to develop a number of projects in line with this idea. One of which is a C-store - a column DBMS designed in the Shared Nothing architecture (SN), originally created specifically for data warehousing class systems. Next, this product received commercial development as HP Vertica.

It seems that now the topic of the development of data warehousing slid to a new round of development. New technologies, approaches and tools appear. Their study, approbation and reasonable use allows us to create really interesting and useful solutions. And bring them to the introduction, enjoying that your developments are used in real work and benefit.

Epilogue

When preparing this article, I tried to navigate first of all on architects, analysts and developers who directly work with data warehouses. But it turned out that inevitably "took the topic a little wider" - and other readers categories came across the sight. Some moments will seem controversial, some are not clear, some are obvious. People are different - with different experiences, Bekground and position.
For example, typical issues of managers - "When to attract architects?", "When you need to engage in architecture?", "Architecture - Wouldn't it be too expensive?" It sounds for us (developers, designers) are quite strange, because for us the architecture of the system appears with her birth - it does not matter, we realize this, or not. And even if there is no formal role architect in the project, the normal developer always "includes its internal architect."

By a large account, it does not matter - who exactly acts as an architect - it is important that someone puts such questions and examines answers to them. If the architect is clearly allocated - it only means that the responsibility for the system and its development bears, above all, he.
Why did I seek the topic of "anti-librability" relevant relative to this subject?

"The uniqueness of anti farmhouse is that it allows us to work with the unknown, do something in conditions when we do not understand what exactly we are doing, - and seek success" / Nasim N. Table /
Therefore, the crisis and a high degree of uncertainty is not an excuse for the lack of architecture, but factors that enhance its need.

Zaitsev S.L., K.F.-M.N.

Repeating groups

Repeating groups are attributes for which the only instance of the entity can have more than one value. For example, a person can have more than one skill. If, in terms of business requirements, we need to know the level of ownership of the skill for everyone, and each person can have only two skills, we can create the essence shown in Fig. 1.6. Here is essentially present A PERSONwith two attributes for storing skills and level of skills for each.

Fig. 1.6. This example uses repeated groups.

The problem of repetitive groups is that we cannot know exactly how much skills can have a person. In real life, some people have one skill, some have several, and some have not yet. Figure 1.7 shows the model given to the first normal form. Pay attention to the added Skill identifier which uniquely determines each SKILL.

Fig. 1.7. The model given to the first normal form.

One fact in one place

If the same attribute is present in more than one entity and is not an external key, then this attribute is seen as excess. The logical model should not contain redundant data.

Redundancy requires additional space, however, although the efficiency of memory use is important, the actual problem is another. Guaranteed synchronization of redundant data requires overhead, and you always work under the risk of conflict value.

In the previous example SKILLdepends on Person identifierand from Identifier skill.This means that you will not appear SKILLuntil it appears A PERSON,possessing this skill. It also complicates the change in the name of the skill. It is necessary to find each entry with the name of the skill and change it for every person who owns this skill.

Figure 1.8 shows the model in the second normal form. Note that the essence is added. SKILL, and attribute NAMEskill moved to this entity. The skill level remained, respectively, at the intersection Persons and skill.

Fig. 1.8. In the second normal form, the recurring group is put into another entity. This provides flexibility when adding the required amount of skills and change the name of the skill or description of the skill in one place.

Each attribute depends on the key

Each entity attribute should depend on the primary key of this entity. In the previous example School nameand Geographic areapresent in the table A PERSONBut do not describe the person. To achieve a third normal form, you must move the attributes in the essence where they will depend on the key. Figure 1.9. Shows the model in the third normal form.

Fig. 1.9. In the third normal form School name and Geographic region transferred to the essence where their values \u200b\u200bdepend on the key.

Many-co-many relations

Relations many-co-manyreflect the reality of the surrounding world. Please note that in Figure 1.9 there is a ratio of many-to-many between Personand School. The attitude accurately reflects the fact that A PERSONcan learn in many Schoolsand B. Schoolcan learn a lot Persons.Associative essence is created to achieve a fourth normal form, which eliminates the relation of the monogy-to-many due to the formation of a separate entry for each unique combination of school and person. Figure 1.10 shows a model in the fourth normal form.

Fig. 1.10. In the fourth normal form, the relation of the monogy-co-many between Person and Schoolit is allowed by introducing an associative entity in which a separate entry is assigned for each unique combination. Schools and Persons.

Formal definitions of normal forms

The following definitions of normal forms may seem intimidating. Consider them simply as formulas to achieve normalization. Normal forms are based on relational algebra and can be interpreted as mathematical transformations. Although this book is not devoted to a detailed discussion of normal forms, models developers are recommended to study this question more deeply.

In a specified ratio, the attribute y functionally depends on the X attribute. In the character form Rx -\u003e Ry (read as "Rx functionally defines RY") - in that and only if each value x in R is associated strictly with one value y R (at every specific point in time). Attributes X and Y can be composite (Date K. J. Introduction to the database systems. 6th edition. Ed. Williams: 1999, 848 p.)

The ratio R corresponds to the first normal form (1NF) if and only if all domains belonging to it contain only atomic values \u200b\u200b(diet, ibid).

The rat ratio corresponds to the second normal form (2NF) if and only if it corresponds to 1NF, and each neat attribute completely depends on the primary key (diet, ibid).

The ratio R corresponds to the third normal form (3NF) if and only if it corresponds to 2NF, and each neutral attribute does not transitively depend on the primary key (date, ibid).

The ratio R corresponds to the normal form of Boys-Codd (BCNF) if and only if each determinant is a candidate for use as a key.

NOTE Below is a brief explanation of some abbreviations used in the definitions of the Date.

MVD (MULTI-VALUED DEPENDENCY) - multi-valued dependence. Used only for entities with three and more attributes. With multi-valued dependencies, the attribute value depends only on the part of the primary key.

FD (Functional Dependency) - Functional Dependency. With a functional dependency, the attribute value depends on the value of another attribute that is not part of the primary key.

JD (Join Dependency) - addiction to unification. With the dependence of the union, the primary key of the parent entity can be traced to the descendants at least the third level, while maintaining the ability to be used in the union by the source key.

The ratio corresponds to the fourth normal form (4NF) if and only if there are MVD, for example, A®®B. In this case, all attributes R functionally depend on A. in other words, only the dependences (FD or MVD) of the K®X form (i.e., the functional dependence of the X attribute from the candidate for using K) is present. Accordingly, R meets the requirements of 4NF if it corresponds to BCNF and all MVD is actually FD (diet, ibid.).

For the fifth normal form, the ratio R satisfies the dependences of the combination (JD) * (x, y, ..., z) if and only if R is equivalent to its projections on x, y, ..., z, where x, y,. .., z subsets of a set of attributes R.

There are many other normal forms for complex data types and specific situations that go beyond our discussion. Each enthusiast of the models is desirable to study other normal forms.

Business Normal Forms

In your book, Clive Finklestein (FinkLestein Cl. An Introduction to Information Engineering: from Strategic Planning to Information Systems. Reading, Massachusetts: AdDison-Wesley, 1989) Applied another approach to normalization. It defines business normal forms in terms of bringing these forms. Many models developers consider this approach intuitively clearer and pragmatic.

The first business normal form (1BNF) makes repeated groups to another entity. This entity gets its own name and primary (composite) key attributes from the initial essence and its recurring group.

The second business normal form (2BNF) makes attributes that partially depend on the primary key to another entity. Primary (composite) The key of this entity is the primary key of the essence in which it was originally located, along with additional keys, from which the attribute depends entirely.

The third business standard form (3BNF) makes attributes that are independent of the primary key, to another entity where they are completely dependent on the primary key of this entity.

The fourth business normal form (4BNF) makes attributes that depend on the value of the primary key or are optional, into the secondary essence, where they are completely dependent on the primary key value or where they should (must) be present in this entity.

The fifth business normal form (5BNF) is manifested as a structural essence, if there is a recursive or other dependence between copies of the secondary entity, or if the recursive dependence exists between the instances of its primary essence.

Completed logical data model

The completed logical model must meet the requirements of the third business normal form and include all the entities, attributes and communications necessary to support data requirements and business rules associated with data.

All entities must have names describing the content and clear, short, complete description or definition. One of the following publications will consider the original set of recommendations for the proper formation of the names and descriptions of entities.

Essences must have a complete set of attributes, so that each fact relative to each entity can be represented by its attributes. Each attribute must have a name reflecting its values, a logical data type and a clear, short, full description or definition. In one of the following publications, we will consider the source set of recommendations for proper formation of names and descriptions of attributes.

Communications should include a verb design that describes the relationship between entities, along with such characteristics as the multiplicity, the need for existence or the possibility of lack of communication.

NOTE Multiple Communications describes the maximum number of copies of the secondary entity that can be associated with an instance of the original entity.The need for existence orpossibility of absence Communications serves to determine the minimum number of copies of the secondary entity that can be associated with an instance of the initial entity.

Physical data model

After creating a complete and adequate logical model, you are ready to decide on the choice of the implementation platform. The selection of the platform depends on the requirements for the use of data and strategic principles for the formation of the corporation architecture. The choice of platform is a difficult problem that goes beyond the framework of this book.

In Erwin, the physical model is a graphical representation of a really implemented database. The physical database will consist of tables, columns and connections. The physical model depends on the platform selected for implementation, and the requirements for the use of data. The physical model for IMS will seriously differ from the same model for Sybase. The physical model for OLAP reports will look different than the model for OLTP (operational transaction processing).

Data Model Developer and Database Administrator (DBA - Database Administrator) use a logical model, customer requirements and strategic principles for the formation of a corporation architecture to develop a physical data model. You can denormalize the physical model to improve performance, and create views to support custom requirements. In subsequent sections, the process of denormalization and presentation creation is considered in detail.

This section provides an overview of the process of building a physical model, collecting data requirements, is given to the determination of the components of the physical model and reverse design. In the following publications, these issues are covered in more detail.

Collecting data requirements

Usually you collect requirements for using data in the early stages during the interviews and work sessions. At the same time, the requirements should most fully determine the use of data by the user. Surface attitude and lacuna in the physical model can lead to unscheduled costs and delaying the timing of the project implementation. Requirements for use include:

    Requirements for access and performance

    Volumenical characteristics (assessment of the amount of data to be stored), which allow the administrator to submit a physical database

    Assessment of the number of users who need simultaneous access to data that helps you design a database, taking into account an acceptable level of performance

    Total, summary and other calculated or derivative data that can be considered as candidates for storage in long-term data structures

    Requirements to form reports and standard queries that help the database administrator to form indexes

    Presentations (long-term or virtual) that will help the user when performing the operations of combining or filtering data.

In addition to the Chairperson, the secretary and users in the session on the use requirements, the developer of the model, the database administrator and the database architect should be involved. The discussion must be subject to the user to historical data. The duration of the time period during which data is stored, has a significant effect on the database size. Often, older data is stored in generalized, and atomic data is archived or removed.

Users should bring with you for session examples of requests and reports. Reports must be strictly defined and should include atomic values \u200b\u200bused for any total and summary fields.

Components of the physical data model

Components of the physical data model are tables, columns and relationships. The essence of the logical model is likely to become tables in the physical model. Logical attributes will become columns. Logical relations will become restrictions on the integrity of relations. Some logical relationship cannot be implemented in a physical database.

Reverse design

When the logical model is not available, there is a need to recreate the model from the existing database. In Erwin, this process is called reverse design. Reverse design can be made in several ways. The model developer can explore the data structures in the database and recreate the tables in the visual simulation environment. You can import a data description language (DDL - Data Definitions Language) to a tool that supports reverse design (for example, ERWIN). Developed tools, such as ERWIN, include functions that provide communication via ODBC with an existing database, to create a model by direct reading data structures. Reverse design using ERWIN will be discussed in detail in one of the following publications.

Use of corporate functional boundaries

When building a logical model for a developer model, it is important to make sure that the new model corresponds to the corporate model. The use of corporate functional boundaries means modeling data in terms used within the Corporation. The method of using data in the corporation varies faster than the data itself. In each logical model, the data should be presented integrity, regardless of the subject area of \u200b\u200bthe business it supports. Essences, attributes and relationships should define business rules at the Corporation level.

NOTE Some of my colleagues call these corporate functional boundaries in real world modeling. Modeling the real world encourages the developer of the model to consider information in terms of relationships and relationships that really inherent in it.

The use of corporate functional boundaries for a data model, built accordingly, provides the basis for supporting the information needs of any number of processes and applications, which makes it possible to efficiently exploit one of its most valuable assets - information.

What is a corporate data model?

Corporate Data Model (EDM - Enterprise Data Model)contains essences, attributes and relationships that represent the information needs of the corporation. EDM is usually divided into compliance with subject areas that represent groups of entities belonging to the support of specific business needs. Some subject areas can cover such specific business functions as contract management, others - combine entities describing products or services.

Each logical model must comply with the existing subject area of \u200b\u200bthe corporate data model. If the logical model does not comply with this requirement, a model determining the subject area should be added to it. This comparison ensures that the corporate model has been improved or adjusted, and within the Corporation, all efforts on logical modeling are coordinated.

EDM.also includes specific entities that determine the range of definition of values \u200b\u200bfor key attributes. These entities do not have parents and are defined as independent. Independent entities are often used to maintain the integrity of the relationship. These entities are identified by several different names, such as code tables, reference tables, type tables, or classification tables. We will use the term "corporate business object". A corporate business object is an entity that contains a set of attribute values \u200b\u200bthat do not depend on any other entity. Corporate business objects within the Corporation should be used uniformly.

Building a corporate data model by extension

There are organizations where the corporate model from beginning to end was built as a result of uniform agreed efforts. On the other hand, most organizations create fairly complete corporate models by increasing.

Building means building something consistently, layer behind the layer, just as the oyster grows the pearl. Each created data model ensures the contribution to the formation of EDM. Building an EDM in this method requires additional modeling actions to add new data structures and subject areas or expand existing data structures. This makes it possible to build a corporate data model by building, iteratively adding the levels of detail and refinement.

The concept of modeling methodology

There are several methodologies for visual data modeling. Erwin supports two:

    IDEF1X (Integration Definition For Information Modeling is an integrated description of information models).

    IE (Information Engineering - Information Engineering).

IDEF1X - good methodology and use of its notation is widespread

Integrated description of information models

IDEF1x- highly structured data modeling methodology that extends the IDEF1 methodology adopted as the FIPS standard (Federal Information Processing Standards is the Federal Information Processing Standards). IDEF1X uses a strictly structured set of types of modeling designs and leads to a data model that requires an understanding of the physical nature of the data before such information can be accessible.

The hard structure IDEF1X forces the developer of the model to prescribe the entities of the characteristics that may not be responsible to the realities of the world. For example, IDEF1x requires that all subtypes of entities are exclusive. This leads to the fact that the person cannot be simultaneously a client and an employee. While the real practice tells us another.

Information engineering

Clive Finkleshtein is often referred to as the father of information engineering, although similar concepts made together with him and James Martin (Martin, James. Managing The Database Environment. Upper Saddle River, New Jersey: Prentice Hall, 1983.). Information engineering uses an approach to business management to manage information, and applies other notation to present business rules. IE serves as the expansion and development of the notation and basic concepts of the ER methodology proposed by Peter Chen.

IE provides information support infrastructure by integrating corporate strategic planning with information systems developed. Such integration allows you to more closely link the management of information resources with long-term strategic perspectives of the corporation. This approach sent to the requirements of the business leads many models developers to the choice of IE instead of other methodologies that, mainly, focus on solving momentum development tasks.

IE offers a sequence of actions that lead to a corporation to define all its information needs for collecting and managing data and identify interconnections between information facilities. As a result, information requirements are clearly formulated on the basis of management directives and can be directly translated into the management information system that will support strategic information needs.

Conclusion

Understanding how to use a data modeling tool similar to ERWIN is only part of the problem. In addition, you must understand when data modeling tasks are solved and how the requirements for information and business rules are collected, which must be presented in the data model. Conducting working sessions provides the most favorable conditions for collecting information requirements in a medium, including experts of the subject area, users and specialists in the field of information technology.

To build a good data model, analysis and research requirements for information and business rules collected during work sessions and interviews are required. The resulting data model must be compared with the corporate model, if possible, to ensure that it does not conflict with existing object models and includes all the necessary objects.

The data model consists of logical and physical models displaying information and business rules. The logical model must be given to the third normal form. The third normal form limits, adds, updates and removes the abnormal data structures to support the "one fact in one place" principle. The collected requirements for information and business rules must be analyzed and investigated. They must be compared to the corporate model to ensure that they do not conflict with the existing models of objects, and include all the necessary objects.

The ERWIN data model includes both logical and physical models. Erwin implements the ER approach and allows you to create logical and physical models objects for presenting information requirements and business rules. Logic model objects include entities, attributes and relationships. The objects of the physical model include tables, columns and restrictions on the integrity of relations.

In one of the following publications, issues of identification of entities, identify entity types, selection of entities and descriptions, as well as some techniques that avoid the most common modeling errors associated with the use of entities are avoided.

Essences must have a complete set of attributes, so that each fact relative to each entity can be represented by its attributes. Each attribute must have a name reflecting its values, a logical data type and a clear, short, full description or definition. In one of the following publications, we will consider the source set of recommendations for proper formation of names and descriptions of attributes. Communications should include a verb design that describes the relationship between entities, along with such characteristics as the multiplicity, the need for existence or the possibility of lack of communication.

NOTE Multiple communications describes the maximum number of copies of the secondary entity that can be associated with an instance of the original entity.The need for existence or absence communication serves to determine the minimum number of copies of the secondary entity that can be associated with an instance of the original

Increasingly, IT specialists seek their attention to data management solutions based on standard sectoral data models and business solutions templates. Ready-free comprehensive physical data models and reports of business analysts for specific areas of activity allow us to unify the information constituent activities of the enterprise and significantly speed up the implementation of business processes. Decision templates allow service providers to use non-standard information capabilities hidden in existing systems, thereby reducing projects, costs and risks. For example, real projects show that the data model and business solutions patterns can reduce the volume of labor costs for the development by 50%.

The sectoral logic model is an object-oriented, integrated and logically structured presentation of all information that should be in a corporate data warehouse to receive answers to both strategic and tactical business issues. The main purpose of the models is to facilitate orientation in the data space and assistance in the allocation of parts important for business development. In modern conditions, to successfully conduct business, it is absolutely necessary to have a clear understanding of the relationship between different components and is good to imagine the overall picture of the organization. The identification of all parts and connections using models allows you to most effectively use the time and tools for organizing the company's work.

Under data models, abstract models describing how data representation and access to them are understood. Data models define data elements and links between them in a particular area. The data model is a navigation tool for both business and for IT professionals, which uses a specific set of characters and words to accurately explain a certain class of real information. This allows you to improve mutual understanding within the organization and, thus, create a more flexible and stable environment for applications.


An example of the GIS model for the authorities and local governments.

Today, software and service providers are strategically important to be able to quickly respond to changes in the industry related to technological novelties, withdrawing state restrictions and complications of supply chains. Together with the changes in the business model, the complexity and cost of information technologies needed to support the company's activities are growing. Especially the management of data is difficult in an environment where corporate information systems, as well as functional and business requirements, are constantly changing.

Help relief and optimize this process, in the translation of the IT approach to the modern level, sectoral data models are designed.

Industry data models from the companyEsri.

The data models for the ESRI ArcGIS platform are working templates for use in GIS projects and creating data structures for different application areas. The formation of the data model includes the creation of a conceptual design, logical and physical structure, which can then be used to build a personal or corporate geodatabase. ArcGIS provides tools for creating and managing a database schema, and data model templates are used to quickly launch the GIS project for different applications and industries. ESRI specialists together with the users community spent a significant amount of time to develop a number of templates that can provide the possibility of a quick start of designing a geodatabase of the enterprise. These projects are described and documented on the support.esri.com/datamodels website. Below, in the order of their mention on this site, is presented with meaningful translation of the names of sectoral models ESRI:

  • Address registry
  • Agriculture
  • Meteorology
  • Basic spatial data
  • Biodiversity
  • Interior buildings
  • Accounting for greenhouse gases
  • Department of administrative borders
  • Armed forces. Intelligence service
  • Energy (including the new ArcGIS Multispeak protocol)
  • Environmental structures
  • Ministry of Emergency Situations Fire security
  • Forest Cadastre
  • Forestry
  • Geology
  • GIS National Level (E-GOV)
  • Underground and wastewater
  • Health
  • Archeology and Protection of Memorial Places
  • National security
  • Hydrology
  • International Hydrographic Organization (IHO). S-57 format for ENC
  • Irrigation
  • Land Registry
  • Municipal government
  • Marine navigation
  • State Cadastre
  • Oil and gas structures
  • Pipelifiers
  • Raster storage
  • Batymetry, relief of the seabed
  • Telecommunications
  • Transport
  • Water supply, sewage, housing and communal services

These models contain all the necessary signs of the industry standard, namely:

  • are freely access;
  • do not have bindings to the "Favorite" manufacturer's technology;
  • created as a result of realization of real projects;
  • created with the participation of sectoral specialists;
  • are intended to provide information interaction between different products and technologies;
  • do not contradict other standards and regulatory documents;
  • used in implemented projects around the world;
  • are designed to work with information on the entire life cycle of the system being created, and not the project itself;
  • expandable customer needs without loss of compatibility with other projects and / or models;
  • accompanied by additional materials and examples;
  • used in the methodological instructions and technical materials of various industrial companies;
  • a large community of participants, while accessing the community is open to all;
  • a large number of references to data models in publications in recent years.

ESRI specialists are included in the expert group of independent bodies who recommend using various sectoral models, such as PODS (PIPELINE OPEN DATA STANDards - open standard for the oil and gas industry; currently there is a PODS implementation as a geodatabase of ESRI PODS ESRI Spatial 5.1.1) or The geodatabase (BGD) base from ArcGIS for Aviation, which takes into account the recommendations of ICAO and FAA, as well as the EXM 5.0 Navigation Data Exchange Standard. In addition, there are recommended models, strictly relevant to existing sectoral standards, such as S-57 and ArcGIS for Maritime (sea and coastal objects), as well as models created by the results of completed ESRI Professional Services and are "de facto" standards in the appropriate Areas. For example, GIS for the Nation and Local Goverment ("GIS for state authorities and local governments") influenced NSDI and Inspire standards, and Hydro and Groundwater (hydrology and groundwater) are actively used in freely affordable ARCHYDRO professional package and commercial products. Third firms. It should be noted that ESRI supports and "De-Facto" standards, such as NHDI. All proposed data models are documented and ready to use in IT processes of the enterprise. The accompanying materials for models include:

  • UML diagrams of entity relationships;
  • data structures, domains, reference books;
  • ready geodatabase templates in ArcGIS GDB format;
  • data examples and application examples;
  • examples of data loading scripts, examples of analysis utilities;
  • references according to the proposed data structure.

ESRI summarizes its experience in building industry models in the form of books and localizes published materials. The company Esri Cis is localized and the following books have been published:

  • Geospatial service-oriented architecture (SAO);
  • Design of geodatabases for transport;
  • Corporate geoinformation systems;
  • GIS: new energy of electrical and gas enterprises;
  • Oil and gas on a digital map;
  • Modeling our world. ESRI management on the design of the geodatabase;
  • Thinking about GIS. GIS Planning: Manual for managers;
  • Geographical information systems. Basics;
  • GIS for administrative and economic management;
  • Web GIS. Principles and application;
  • System design strategies, 26th edition;
  • 68 issues of ArcReview magazine with publications of companies and GIS systems;
  • ... and many other thematic notes and publications.

For example, the book " Modeling our world ..."(Translation) is a comprehensive guide and a guide to modeling data in GIS in general, and on the data model of geodatabase in particular. The book shows how to generate the correct data modeling solutions, solutions that participate in each aspect of the GIS project: from the design of the base data and data collection to spatial analysis and visual representation. It is described in detail how to design a geographical database, a corresponding project, configure the functionality of the database without programming, control the stream of work in complex projects, simulate a variety of network structures, such as river, transport or electrical networks, Implement data to the process of geographic analysis and display, as well as create 3D data models GIS. BOOK " Design geodatabases for transport"Contains methodological approaches tested on a large number of projects and fully relevant to the legislative requirements of Europe and the United States, as well as international standards. And in the book" GIS: New Energy of Electric and Gas Enterprises"With the use of real examples, the advantages that corporate GIS can give an energy supplier company, including aspects such as customer service, network operation and other business processes.


Some of the books, translation and original issues published in Russian by Esri Cis and Data +. They are affected both conceptual issues related to GIS technology and many applied aspects of modeling and deploying GIS of a different scale and destination.

Applying sectoral models Consider on the example of the BISDM data model (Building Interior Space Data Model, the information model of the internal space of the building) version 3.0. BISDM is the development of a more general model BIM (Building Information Model, the information model of the building) and is intended for use in the tasks of design, construction, operation and conclusion from the operation of buildings and structures. Used in GIS, it allows to effectively exchange geodan with other platforms and interact with them. Refers to the general group of FM tasks (organization of the organization's infrastructure). We list the main advantages of the BISDM model, the use of which allows:

  • organize the exchange of information in a heterogeneous environment according to the Unified Regulations;
  • get the "physical" embodiment of the concept of BIM and recommended rules for the construction of a construction project;
  • maintain a single storage facility in the entire life cycle of the building (from the project to output from operation);
  • coordinate the work of various professionals in the project;
  • visualize the laid calendar plan and stages of construction for all participants;
  • give a preliminary assessment of the cost and time limits (4D and 5D data);
  • monitor the progress of the project;
  • ensure high-quality operation of the building, including maintenance and repair;
  • become part of the asset management system, including the functions of analyzing the efficiency of areas of use (rental, warehouse, employee management);
  • conduct the calculation and managing the tasks of the energy efficiency of the building;
  • model the movement of human streams.

BISDM defines the rules for working with spatial data at the level of the internal premises in the building, including the purpose and uses of the use, laid communications, installed equipment, repairing repairs and maintenance, logging of incidents, relationships with other assets of the company. The model helps to create a single storage of geographic and non-geographic data. The experience of leading global companies was used to allocate entities and modeling at the level of BGD (geodatabase base) of the spatial and logical relationships of all physical elements forming both the building itself and its inner premises. Following the principles of BISDM makes it possible to significantly simplify the integration tasks with other systems. At the first stage, this is usually integrating with CAD. Then, during the operation of the building, data exchange is used with ERP and EAM systems (SAP, TRIRIGA, MAXIMO, etc.).


Visualization of BISDM structural elements by ARCGIS.

In the case of using BISDM, the Customer / Object owner receives through the exchange of information from the idea of \u200b\u200bcreating an object before developing a full project, monitoring construction with relevant information to the time of commissioning, control of parameters during operation, and even during reconstruction or output of the object from operation. Following the BISDM paradigm, GIS and the BGD generated by it becomes a common data storage for related systems. Often, data created and operated by third-party systems are in BGD. This must be taken into account when designing the architecture of the system being created.

At a certain stage, the accumulated "critical mass" of information allows you to go to a new quality level. For example, at the end of the design stage of the new building, in the GIS it is possible to automatically visualize overview 3D models, draw up a list of installed equipment, calculate the kilometer of the engineering networks paved, perform a series of verification and even give a preliminary financial estimate of the project costs.

We note again, when using BISDM and ArcGIS, it is possible to automatically build 3D models according to the accumulated data, since BGD contains a complete description of the object, including Z-coordinates, belonging to the flood, types of elements connections, equipment installation methods, material, Traveling staff, functional purpose of each element, etc. etc. It is necessary to consider that after the initial import of all design materials in BISDM BGD, there is a need for additional information filling for:

  • prostanovka in the designated places of 3D models of objects and equipment;
  • collecting information about the value of materials and the order of their laying and installation;
  • control of passability on the dimensions of the installed non-standard equipment.

Due to the use of ArcGIS, the import of additional 3D objects and reference books from external sources is simplified, because The ArcGIS Data Interoperability module allows you to create procedures for importing similar data and their correct placement inside the model. All formats used in this industry are supported, including IFC, AutoCAD Revit, Bentlye MicroStation.

Industry data models from IBM

IBM provides a set of tools and data storage control models for various areas of activity:

  • IBM Banking and Financial Markets Data Warehouse (Finance)
  • IBM Banking Data Warehouse
  • IBM Banking Process and Service Models
  • IBM HEALTH PLAN DATA Model (Healthcare)
  • IBM INSURANCE INFORMATION WAREHOUSE (insurance)
  • IBM Insurance Process and Service Models
  • IBM Retail Data Warehouse (Retail)
  • IBM Telecommunications Data Warehouse (Telecommunications)
  • InfoSphere Warehouse Pack:
    - For Customer Insight (for customer understanding)
    - For Market and Campaign Insight (for understanding the company and the market)
    - for Supply Chain Insight (for understanding suppliers).

For example, the model IBM.Banking.and.FINANCIAL.Markets.Data.Warehouse. Designed to solve specific problems of the banking industry in terms of data, and IBM.Banking.Process.and.Service.Models - from the point of view of processes and soa (service-oriented architecture). The telecommunications industry presents models IBM.InformationFramework (IFW) and IBM.TelecommunicationsData.Warehouse (TDW). They help to significantly speed up the process of creating analytical systems, as well as reduce the risks associated with the development of business analysis applications, the management of corporate data and the organization of data warehouses, taking into account the specifics of the telecommunications industry. The possibilities of IBM TDW cover the entire range of telecommunication services market - from Internet providers and operators of cable networks offering wired and wireless telephony services, data transmission and multimedia content, to transnational companies providing telephone, satellite, intercity and international services, as well as organizations Global networks. To date, TDW is used by large and minor service providers of wired and wireless communication worldwide.

Tool called INFOSPHERE WAREHOUSE PACK FOR CUSTOMER INSIGHT It is a structured and easy-to-implement business content for an increasing number of business projects and industries, including banking, insurance, finance, health insurance programs, telecommunications, retail and distribution. For business users INFOSPHERE WAREHOUSE PACK FOR MARKET AND CAMPAIGN INSIGHT It helps to maximize the effectiveness of measures for analyzing the market and marketing campaigns due to the step-by-step process of developing and taking into account the specifics of the business. Via InfoSphere Warehouse Pack For Supply Chain Insight Organizations have the ability to receive current information on supply chain operations.


ESRI position inside IBM solutions architecture.

Special attention is paid to the IBM approach for electricity companies and housing and utilities enterprises. In order to satisfy the growing consumer requests, more flexible architecture is required compared to the currently used today, as well as the standard sectoral object model, which simplifies the free exchange of information. This will increase the communicative opportunities of energy companies, providing interaction in more economical mode, and will provide new systems with the best visibility of all necessary resources, regardless of where they are located within the organization. The base for this approach serves as a (service-oriented architecture), a component model that establishes the correspondence between the functions of divisions and services of various applications that can be repeatedly used. "Services" of such components are exchanged by data via interfaces without hard binding, hiding from the user the complexity of the systems standing behind them. In such mode, enterprises can easily add new applications regardless of the software provider, the operating system, programming language or other internal characteristics of software. Based on SAO implemented concept Safe (Solution Architecture for Energy), it allows the company's electricity industry to obtain a holistic presentation of its infrastructure based on the standards.

Esri ArcGIS® - a world-world software platform for geo-information systems (GIS), providing the creation and management of digital assets of electric power, gas transmission, distribution, and telecommunication networks. ArcGIS allows you to carry out the most complete inventory of the components of the electrical distribution network, taking into account their spatial location. ArcGIS significantly expands the IBM SAFE architecture, providing tools, applications, workflows, analytics and information and integration capabilities necessary for managing intellectual energy enterprise. ArcGIS The IBM Safe allows you to receive information about infrastructure, assets, customers and employees with accurate data on their location from various sources, as well as create, store and handle geographic information about enterprise assets (support, pipelines, wires, transformers, cable sewage etc.). ArcGIS inside the SAFE infrastructure allows you to dynamically combine the main business applications, combining data from GIS, SCADA and customer service systems with external information, such as traffic intensity, weather conditions or satellite images. Energy enterprises use such combined information for various purposes, from S.O.R. (the overall picture of the operational situation) before inspecting objects, maintenance, analysis and network scheduling.

The information components of the power supplying enterprise can be simulated using several levels that are ranked from the lowest - physical - to the upper, most complex level of logic of business processes. These levels can be integrated to ensure compliance with typical sectoral requirements, for example, with automated measurement registration and managing control system control and data collection (SCADA). Having built the SAFE architecture, energy supplying companies make significant steps in promoting a secured open object model called "General Information Model for Energy Companies" (CIMFY AND UTILITIES). This model provides the necessary base for promoting multiple enterprises to a service-oriented architecture, since it encourages the use of open standards to structuring data and objects. Due to the fact that all systems use the same objects, confusion and inelasticity associated with various implementations of the same objects will be reduced to a minimum. Thus, the definition of the "client" object and other important business objects will be unified in all systems of the energy supplying enterprise. Now, using CIM, suppliers and consumers of services can use the common data structure, facilitating the output of expensive business components to outsourcing, since CIM sets a common base on which you can build an exchange of information.

Conclusion

Complex sectoral data models provide companies with a single integrated presentation of their business information. Many companies are not easy to integrate their data, although this is a prerequisite for the majority of general corporate projects. According to the study of the Institute of Data Warehouses (TDWI), more than 69% of the surveyed organizations have found that integration is a significant barrier when implementing new applications. On the contrary, the implementation of data integration brings a tangible income and increase in efficiency.

A properly constructed model definitely determines the value of the data that is in this case are structured data (as opposed to unstructured data, such as, for example, an image, a binary file or text, where the value may be ambiguous). The most effective sectoral models offered by professional suppliers (vendors), including ESRI and IBM. High returns from the use of their models is achieved due to the significant level of their detail and accuracy. They usually contain many data attributes. In addition, ESRI and IBM specialists not only have extensive modeling experience, but also work well in building models for a particular industry.


Architecture DB

The CMD scheme is a description of the data model structure from the point of view of the administrator.

The NMD scheme is a description of the inner or physical model. Here is a description of the physical location of data on media. The scheme stores direct instructions for placing data in memory (volumes, disks).

The CMD scheme describes the data structure, records and fields.

All DBMS support three main types of data models:

1. Hierarchical model. It assumes some root record. From the roots go branches.

Not all objects are conveniently described in this way. There are no connections in the hierarchy and is characterized by a great redundancy of information.

2. Network model. Allows you to correctly display all the complexity of relationships.

The model is convenient for presenting links with the data of the external environment, but less convenient for the description in the database, which leads to the additional work of the user to study the navigation for relations.

3. Relational model. The basis is based on the mathematical term Relation - the ratio, and simply the table. For example, rectangular two-dimensional.

The relational data structure was developed in the late 60s near the researchers, from which the most significant contribution was made by an IBM employee Edgar Codd. With a relational approach, the data is presented in the form of two-dimensional tables - the most natural to humans. At the same time, for data processing, the code proposed to use the device of set theory - association, intersection, difference, Cartesian work.

Data type - This concept has the same meaning as in programming languages \u200b\u200b(i.e., the data type determines the internal representation in the computer's memory and the method of storing the data instance, as well as the set of values \u200b\u200bthat an instance of data and a set of permissible data transactions). All existing modern databases support special data types designed to store the integer type data, fractional floating point, symbols, and strings, calendar dates. Many database servers also implement other types, for example, the InterBase server has a special data type for storing large binary information arrays (BLOB).

Domain - This is a potential set of simple data type values, it has similarities with a subtype of data in some programming languages. The domain is determined by two elements - a data type and a logical expression that applies to the data. If the result of this expression is equal to the value of "truth", the data instance belongs to the domain.

Attitude - This is a two-dimensional table of a special type consisting of a header and body.

Title - This is a fixed set of attributes, each of which is defined on some kind of domain, and between attributes and defining domains there is a mutually unambiguous compliance.


Each of the attributes is defined on its domain. The domain is the type of data "Whole", and the logical condition - n\u003e 0. The title is unchanged in time, in contrast to the body of the relationship. Body relations - This is a combination tupleEach of which is a pair of "attribute - value."

Power relationship called the number of its tuples, and degree of relationship - number of attributes.

The degree of relation is for this ratio of the value of constant, while the power of relationship changes over time. The capacity of the relationship is also called a fundamental number.

The above concepts are theoretical and used in the development of language tools and software systems of relational DBMS. In everyday work, their informal equivalents are used instead:

attitude - table;

attribute - column or field;

consignment - recording or string.

Thus, the degree of relationship is the number of columns in the table, and the cardinal number is the number of rows.

Since the attitude is a set, and in the classical theory of sets by definition, the set cannot contain the coinciding elements, then the relationship cannot be two identical tuples. Therefore, for this relationship there is always a set of attributes, unambiguously identifying a motorcade. Such a set of attributes is called key.

The key must meet the following requirements:

· Must be unique;

· Must be minimal, that is, the removal of any attribute from the key leads to a violation of uniqueness.

As a rule, the number of attributes in the key is less than the degree ratio, however, in the extreme case, the key may contain all attributes, since the combination of all attributes satisfies the condition of uniqueness. Usually the attitude has several keys. From all the keys of the relationship (they are also called "possible keys") one is chosen as primary key. When choosing primary keypreference is usually given to the key with the smallest number of attributes. It is impractical to also use keys with long string values.

In practice, a special numeric attribute is often used as a primary key - auto-microchetate nole, the value of which can be generated by a trigger (trigger is a special procedure that is caused at the time of making changes in the database) or special means defined in the DBMS mechanism.

The basic concepts described in this chapter do not relate to any particular implementation of the database, and are common to them all. Thus, these concepts are the basis of a certain general model, which is called a relational data model.

The founder of the relational approach date found that the relational model consists of three parts:

· Structural;

· Manipulating;

· Holistic.

In the structural part of the model, relations are recorded, as the only data structure used in the relational model.

Two basic mechanisms for manipulating relational bases are recorded in the manipulation part - relational algebra and relational calculus.

Under an integral part, a certain mechanism for ensuring not destroyed data is understood. The holistic part contains two main requirements for the integrity of relational databases - integrity of entities and integrity on the links.

Demand integrity integrity It is that any tuples of any relationship should be distinguished from any other tavern of this relationship, that is, in other words, any attitude must have a primary key. This requirement should be performed if the basic properties of the relationship are performed.

In the data manipulation language, as well as in the query language, a mathematical apparatus is executed, called algebra relationship, for determined the following actions:

1. Standard operations: - intersection, - Association, \\ - Difference, X - Cartesovo Work.

2. Specific: projection, restriction, compound, division.

a. An association.

SD SM EI HP

R 1 (SIFRES parts, material cipher, units of measurement, flow rate)

R 2 (SD, SM, EI, HP)

Need to find

It is assumed to attach the sets R 1 and R 2. In this operation, the degree is preserved, and the power of the resulting set

b. Crossing.

Allocation of coinciding lines.

c. Difference.

Exception from R 1 tuples coinciding with R 2.

d. Cartesian work.

Here is a concatenate tuples.

Each line of one set is concatenates with each line of the other.

Two sets are given:

Cartesian work has the following form:

In this case, the S-degree is equal, and, i.e. It will turn out 12 lines and 5 columns.

The corporate database is the central link of the corporate information system and allows you to create a single information space of the corporation. Corporate databases


Share work on social networks

If this job does not come up at the bottom of the page there is a list of similar works. You can also use the Search button.

Topic V. Corporate databases

V. .one. Data organization in corporate systems. Corporate databases.

V. .2. DBMS and structural solutions in corporate systems.

V .3. Internet / Intranet technology and corporate access solutions to databases.

V. .one. Data organization in corporate systems. Corporate databases

Corporate base The data is the central link of the corporate information system and allows you to create a single information space of the corporation. Corporate databases (Fig. 1.1).

There are various database definitions.

Under the database (database) Understand the combination of information logically related in such a way as to compile a unified set of data stored in the storage devices of the computing machine. This set acts as the source data of the tasks solved during the operation of automated control systems, data processing systems, information and computing systems.

It is possible a term database to briefly formulate as a set of logically related data intended for sharing.

Under the database A combination of data stored together with such minimal redundancy, which allows them to use the optimal manner for one or more applications.

The purpose of creating databases as data storage form Building a data system that do not depend on the accepted algorithms (software) used technical means, the physical location of data in the computer. The database involves multipurpose use (multiple users, many forms of documents and requests for one user).

Basic requirements for databases:

  • Fullness of data presentation. Data in the database should adequately submit all information about the object and should be enough for SOD.
  • Database integrity. The data should be maintained when processing their soda and in any situations arising during the work.
  • The flexibility of the data structure. The database should allow changing the data structures without disturbing its integrity and completeness when changing external conditions.
  • Realizability. This means that there must be an objective representation of a variety of objects, their properties and relationships.
  • Availability. It is necessary to ensure the distinction of data access.
  • Redundancy. The database must have a minimal redundancy of data representing any object.

Understanding knowledge A combination of facts, patterns and heuristic rules, with the help of which you can solve the task.

Knowledge Base (BZ)  The set of databases and the rules used from decision makers. The knowledge base is an element of expert systems.

It should be distinguished various data presentation methods.

Physical data - these are the data stored in the memory of the computer.

Logical view of data complies with the user presentation of physical data. The difference between the physical and appropriate logical representation of data is that the latter reflects some important relationships between physical data.

Under corporate database Understand the database, which combines all the necessary data and knowledge of the automated organization in one form or another. In corporate information systems, the most concentrated expression has found such a thing asintegrated databases, in which the principle of one-time input and multiple use of information is implemented.

Fig. 1.1. The structure of the interaction of divisions with information resources of the Corporation.

Corporate databases are focused (centralized) and distributed.

Concentrated (centralized) database - This is a database that is physically stored in storage devices of one computing machine. In fig. 1.2 presents a diagram of a server application to access databases in various platforms.

Fig.1.2. Scheme heterogeneous centralized database

The centralization of information processing made it possible to eliminate the disadvantages of traditional file systems such as incompleteness, inconsistency and redundancy of data. However, as databases increase and, especially when used in geographically divided organizations, problems appear. For example, for concentrated databases located in the telecommunication network node, with which various units of the organization receive access to data, with an increase in the amount of information and the number of transactions, the following difficulties arise:

  • Large data exchange stream;
  • High traffic on the network;
  • Low reliability;
  • Low overall performance.

Although the concentrated base is easier to ensure the safety, integrity and consistency of information during updates, the listed problems create certain difficulties. As a possible solution to these problems, decentralization of data is proposed. When decentralization achieved:

  • A higher degree of processing simultaneity due to load distribution;
  • Improving the use of data on the ground when performing remote (remote) requests;
  • Smaller costs;
  • Easy managing local bases.

The costs of networking, in the nodes of which are workstations (small computers), are much lower than the cost of creating a similar system using a large computer. Figure 1.3 shows a logical diagram of a distributed database.

Fig.1.3. Distributed corporation database.

Let's give the following definition of a distributed database.

Distributed Database - this set of information, files (relationships) stored in different nodes of the information network and logically connected in such a way as to constitute a single set of data (communication can be functional or through copies of the same file). Thus, this is a set of databases related to each other logically, but physically located on several machines included in one computer network.

The most important requirements for the characteristics of a distributed database are as follows:

  • Scalability;
  • Compatibility;
  • Support for various data models;
  • Portability;
  • Transparency location;
  • Autonomy of a distributed database nodes (Site Autonomy);
  • Processing distributed queries;
  • Perform distributed transactions.
  • Support for a homogeneous security system.

The location transparency allows users to work with databases, without knowing anything about their location. The autonomy of the nodes of a distributed database means that each base can occur independently of others. A distributed query is such a request (SQL-proposal), during the execution of which access to objects (tables or views) of different databases takes place. When performing distributed transactions, the Concurrency Control (Concurrency Control) of all involved databases is carried out. Oracle7 uses two-phase information technology technology to perform distributed transactions.

Databases that make up a distributed database should not necessarily be homogeneous (i.e., one database is conducted) or processed in an environment of the same operating system and / or on computers of the same type. For example, one database may be the Oracle database on the Sun computer with the Sun OS operating system (UNIX), the second database of DB2 on the IBM 3090 mainframe with the MVS operating system, and SQL / DS DBMS can also be used to maintain a third database Mainframe IBM, but with the VM operating system. Be sure to only one condition - all machines with databases must be available over the network to which they enter.

The main task of a distributed database - Distribution of data over the network and provide access to it. There are the following ways to solve this problem:

  • Each node is stored and its own dataset is used for remote requests. This distribution is divided.
  • Some data frequently used on remote nodes can be duplicated. This distribution is called partially dubbed.
  • All data is duplicated in each node. This distribution is called completely dubbed.
  • Some files can be cleaved horizontally (a subset of entries) or vertically (a subset of attribute fields) is allocated, while the selected subsets are stored in various nodes along with non-unsolved data. This distribution is called split (fragmented).

When creating a distributed database at a conceptual level, the following tasks have to be solved:

  • It is necessary to have a single conceptual scheme of the entire network. This will ensure logical transparency of data for the user, as a result of which it will be able to form a request to the entire database, being behind a separate terminal (it works with a centralized database).
  • A diagram is required that determines the location of data on the network. This will ensure the transparency of data placement, due to which the user may not specify where to forward the request to get the required data.
  • It is necessary to solve the problem of the inhomogeneity of distributed databases. Distributed bases may be homogeneous and inhomogeneous in the sense of hardware and software. The problem of heterogeneity is relatively easily solved if the distributed database is inhomogeneous in the sense of hardware, but uniform in the sense of software (identical DBMS in nodes). If various DBMSs are used in the nodes of the distributed system, the means of converting data structures and languages \u200b\u200bare necessary. This should ensure transparency of conversion in the nodes of a distributed database.
  • It is necessary to solve the problem of dictionaries. To ensure all types of transparency in a distributed database, programs manage numerous dictionaries and reference books are needed.
  • It is necessary to define the methods for performing queries in a distributed database. Methods for performing queries in a distributed database differ from similar methods in centralized bases, since individual parts of the requests must be performed at the location of the appropriate data and transmit partial results to other nodes; At the same time, all processes should be coordinated.
  • It is necessary to solve the problem of parallel execution of queries. In a distributed base, a complex mechanism of simultaneous processing is needed, which, in particular, should provide synchronization during information updates, which guarantees the consistency of the data.
  • A developed methodology of distribution and placement of data is necessary, including splitting, is one of the basic requirements for a distributed database.

To one of the actively developing new directions of the architecture of computing systems, which is a powerful means of non-informous information processing, are database machines. Database machines are used to solve non-vocal tasks, such as storage, search and transformation of documents and facts, work with objects. Following the definition of data as digital and graphical information about the objects of the surrounding world, in the concept of data with numerical and non-numeric processing invested. With numeric processing, objects such as variables, vectors, matrices, multidimensional arrays, constants, and so on, while with incalculating processing objects can be files, entries, fields, hierarchies, networks, relationships, etc. Non-handling is interested in directly information about objects (for example, a specific employee or group of employees), and not the file of employees as such. The file of the employees is not indexed to select a particular person; Here more interests the content of the desired entry. Non-handling is usually undergoing huge amounts of information. In various applications above these data, such operations can be performed:

  • raise the salary to all employees of the company;
  • calculate the bank interest on the accounts of all customers;
  • make changes to the list of all goods in stock;
  • find the desired abstract of all texts stored in the library or in the bibliographic information and search system;
  • find a description of the desired contract in the file containing legal documents;
  • view all files containing patent descriptions and find a patent (if any), similar to the proposed again.

To implement the database machine, parallel and associative architecture as an alternative to single-processorfonnamanovskaya Structure that allows you to work with large amounts of information real-time.

Database machines are important in connection with the study and application of artificial intelligence concepts, such as the presentation of knowledge, expert systems, logical conclusion, recognition of images, etc.

Information storages. Today, many recognize that several databases are already operated in most companies and, for successful work with information, not just different databases, but different generations of DBMS. According to statistics, an average of 2.5 different DBMS is used in each organization. It became an obvious need to "isolate" business companies, or rather, people dealing with this business from the technological features of databases, provide users with a single view of corporate information, regardless of where it is physically stored. It stimulated the emergence of information storage technology (Data Warehousing, DW).

The main goal of DW - creating a single logical representation of the data contained in different ways, or, in other words, a unified model of corporate data.

The new round of development DW became possible due to the improvement of information technologies in general, in particular the emergence of new types of databases based on parallel processing of requests, which in turn relied on achievements in the field of parallel computers. Were created request program designers With intuitive graphical interface, which allowed easy to build complex requests to the database. Diverseintermediate layer (MidleWare) Provided communicationbetween different type databasesand finally cheaper chefed sharplyinformation storage devices.

A data bank may be present in the corporation structure.

Database - Functional and organizational component in automated control systems and information and computing systems, carrying out centralized information support of the user team or the set of tasks solved in the system.

Database consider as a information and reference system, the main purpose of which consists of:

  • in the accumulation and maintenance in the working state of the set of information that make up the information base of the entire automated system or a set of tasks solved in it;
  • in the issuance of the required task or user data;
  • in providing collective access to stored information;
  • in ensuring the necessary management of the use of information contained in the information base.

Thus, a modern data bank is a complex software and technical complex, which includes technical, system and network means, databases and DBMS, information and search systems for various purposes.

V. .2. DBMS and structural solutions in corporate systems

Database and Knowledge Management Systems

An important component of modern information systems is the database management systems (DBMS).

DBMS - A complex of software and language means designed to create, maintain and use databases.

The database management system provides access to data processing systems to databases. As noted, the important role of the DBMS acquires when creating corporate information systems and, a particularly important role, when creating information systems using distributed information resources, based on modern network computer technology.

The main feature of modern DBMS is that modern DBMS supports such technologies as:

  • Client / server technology.
  • Support database languages. itschema definition language Bd (SDL - Schema Definition Language),data Manipulation Language (DML - Data Manipulation Language), Integrated LanguagesSQL (Structured Queue Language), QDB (QUERY - BY - EXAMPLE) and QMF (QUERY MANAGEMENT FACILITY ) - Developed Peripheral Requests Specification and Report Generation forDB 2, etc.;
  • Direct data management in external memory.
  • Management of RAM buffers.
  • Transaction management. OLTP - Technology (On -line Transaction Processing), OLAP -technology (On -line Analysis Processing)for DW.
  • Provide protection and integrity of data. The use of the system is allowed only to users entitled to data access. When performing users with operations on data, consistency of the stored data (integrity) is supported. This is important in corporate multiplayer information systems.
  • Journalization.

Modern DBMSs should ensure the requirements for databases listed above. In addition, they must satisfy the following principles:

  • Data independence.
  • Universality. The DBMS must have powerful tools for supporting a conceptual data model to display custom logical representations.
  • Compatibility. The DBMS should maintain performance in the development of software and hardware.
  • Data disadvantage. In contrast to file systems, the database must be a single set of integrated data.
  • Data protection. The DBMS must provide protection against unauthorized access.
  • Data integrity. The DBMS should prevent database violation by users.
  • Management of simultaneous work. The DBMS must prevent the database from the mismatch in collective access mode. To ensure an agreed base status, all user requests (transactions) must be performed in a certain order.
  • The DBMS must be universal. It must maintain different data models on a single logical and physical basis.
  • The DBMS must support both centralized and distributed databases and, thus, become an important link of computing networks.

Considering the DBMS as a class of software-oriented software products in automated database systems, two most significant features that define the types of DBMS can be distinguished. According to them, the DBMS can be viewed from two points of view:

  • their capabilities in relation to distributed (corporate) bases;
  • their relationship to the type of data model implemented in the DBMS.

In relation to corporate (distributed) databases, the following types of DBMS can be distinguished:

  • DBMS "Desktop". These products are primarily focused on working with personal data (data "Desktop"). They have sets of teams to share common databases, but small (like a small office). First of all, it is a DBMS of the type of assets, Dwa, Ragadokh, Eochrgo. Why Asssess, Dwaja, Ragadoch, Eochrgo have unsatisfactory access to corporate data. The fact is that there is no simple way to overcome the barrier between personal and corporate data. And the essence is not even that the mechanism of the DBMS of personal data (or small office) is focused on accessing data through many gateways, firewall products, etc. The problem is that these mechanisms are usually associated with full transmission of files and lack of branched index support, as a result of which queues to the server almost stop in large systems.
  • Specialized high-performance multiplayer DBMS. Such DBMSs are characterized by the presence of a multi-user kernel of the system, the language manipulation language and the following functions characteristic of developed multiplayer DBMS:
  • organization buffer pool;
  • presence of a transaction queue processing system;
  • the presence of multiplayer data blocking mechanisms;
  • conducting transaction log;
  • the presence of access disarming mechanisms.

This is an Oracle type DBMS, DB2, SQL / Server, Informix, Sybase, Adabas, Titanium and others provide a wide service for processing corporate databases.

When working with databases, the transaction mechanism is used.

Transaction - This is a logical unit of work.

Transaction - This is a sequence of manipulation operators data executingas a whole (all or nothing) and translating the databasefrom one holistic state to another holistic condition.

The transaction has four important properties known as the properties of the ASID:

  • (A) atomicity . The transaction is performed as an atomic operation - either the entire transaction is performed entirely, or it is not fully implemented.
  • (C) consistency. The transaction translates the database from one agreed (holistic) state to another agreed (holistic) state. Inside the transaction, the database consistency may violate.
  • (And) isolation . Transactions of different users should not interfere with each other (for example, as if they were carried out strictly in turn).
  • (E) durability. If the transaction is executed, the results of its operation must be saved in the database, even if the system fails at the next moment.

The transaction usually starts automatically from the moment of attaching the user to the DBMS and continues until one of the following events occurs:

  • The Commit Work command has been filed (securing the transaction).
  • The Rollback Work command is filed (roll back the transaction).
  • There was a detachment of the user from the DBMS.
  • There was a failure of the system.

For the user she wears usually atomic nature. In fact, this is a complex user interaction mechanism (application) - database. Software corporate systems use real-time transaction transaction mechanism (ON-LINETRANSACTION PROCESSING SYSTEMS, OLTP), in particular, accounting programs, software reception and processing client applications, financial applications, produce a lot of information. These systems are calculated (and appropriately optimized) on the processing of large amounts of data, performing complex transactions and intensive read / write operations.

Unfortunately, information posted in the databases of OLTP systems is little suitable for use by ordinary users (due to the high degree of normalization of tables, specific data presentation formats and other factors). Therefore, data from various information conveyors are sent (in the sense of copied) storage warehouse, sorting and subsequent delivery to the consumer. In information technology, the role of warehouses playinformation storages.

Delivery of information to the end user - are engaged in real-time analytical data processing systems (On-line Analytical Processing, OLAP)that provide exceptionally easy access to data from convenient tools for generating requests and analyzing results. In OLAP systems, the value of the information product increases through the use of a variety of methods of analysis and statistical processing. In addition, these systems are optimized from the point of view of data extraction rate, collecting generalized information and are oriented towards ordinary users (have an intuitive interface). If aOLTP system gives answers to simple questions like "What was the level of sales of goods n in the region M in January 199h?",OLAP system Ready to more complex user requests, for example: "Give an analysis of sales of goods N for all regions according to the plan for the second quarter in comparison with the two previous years."

Architecture Client / Server

In modern systems distributed information processing, the central place is occupied by technologyclient / server. In system architecture client server Data processing is divided between the client computer and the computer-server, the connection between which occurs over the network. This division of data processing processes is based on grouping functions. As a rule, the database computer is allocated to perform operations with databases, and the computer client performs application programs. Figure 2.1 shows a simple client-server architecture system, which includes a computer acting as a server, and another computer acting as its client. Each machine performs various functions and has its own resources.

Database

Computer server.

Net

IBM compatible PC

IBM compatible PC

IBM compatible PC

Applications

Fig. 2.1. Client Server Architecture System

The main function of the client computer is to execute an application (user interface and presentation logic) and communicating with the server when this requires an application.

Server (Server) - This is an object (computer) that provides services to other objects on their requests.

As follows from the term, the main function of the server computer is to maintain customer needs. The term "server" is used to designate two different groups of functions: file server and database server (hereby these terms mean depending on the context or the software that implements the specified groups of functions or computers with this software). File-server servers are not intended to perform operations with databases, their main function - separation of files between multiple users, i.e. Providing simultaneous access of many users to files on a computer - file server. An example of a server file is NEVELL Netware operating system. The database server can be installed and activated on a computer - file server. The Oracle DBMS in the form of NLM (Network Loadable Module) is performed in the NetWare environment on a file server.

The local network server must have the resources corresponding to its functional purpose and network needs. Note that in connection with the orientation on the approach of open systems, it is more correct to talk about logical servers (meaning a set of resources and software providing services over these resources), which are not necessarily on different computers. A feature of a logical server in an open system is that if the server is advisable to move to a separate computer for the effectiveness of efficiency, it can be done without a need for any refinement, both its own and using its application programs.

One of the important requirements for the server is that the operating system, in which the database server is located, should be multitasking (and, preferably, but not necessarily multiplayer). For example, the Oracle DBMS installed on a personal computer with the MS-DOS (or PC-DOS) operating system that does not satisfy the requirement of multitasking cannot be used as a database server. And the same Oracle DBMS installed on a computer with multi-tasking (although not multiplayer) OS / 2 operating system may be a database server. Many varieties of UNIX, MVS, VM systems and some other operating systems are multitasking, and multiplayer.

Distributed calculations

The term "distributed calculations" is often used to designate two different, albeit complementary concepts:

  • Distributed database;
  • Distributed data processing.

The application of these concepts makes it possible to organize access to information stored on several machines, for end users using different means.

There are many types of servers:

  • Database server;
  • Print server;
  • Remote Access Server;
  • Fax server;
  • Web server, etc.

Based on basic technology client / server These basic technologies are lying as:

  • Operating system technologies, the concept of interaction of open systems, the creation of object-oriented programs of the functioning of programs;
  • Telecommunication technologies;
  • Network technologies;
  • Graphical user interface technology (GUI);
  • Etc.

Advantages of client-server technology:

  • Technology client / server allows calculations on inhomogeneous computing environments. Platform Independence: Access to the heterogeneous network media, which includes computers of different types with different operating systems.
  • Independence from data sources: Access to information of heterogeneous databases. Examples of such systems - DB2, SQL / DS, Oracle, Sybase.
  • {!LANG-de10185144bf9f1446f719512da73e9e!}
  • {!LANG-ab8586026aa0da6bdb7d886ea732a4af!}
  • {!LANG-f03752b8b224a2c47965554c76a95fb3!}
  • {!LANG-b0dc6224c64b63987a8998aff0241150!}{!LANG-939354d7dc2f23054e09100e63f8faf1!}
  • {!LANG-d696387a04f4cfd16b282c9395b38b5a!}
  • {!LANG-af96fd2f4c0415bff97f8bb474e671db!}
  • {!LANG-480f2a66d373006962b801f1959838af!}
  • {!LANG-93cc6a6ed4450b14ac4387653bab3fee!}
  • {!LANG-4635c5f137eb67f11e0e02da8dec2bd0!}

{!LANG-88a02794448510d3b69e1c3c9a5694ed!}

{!LANG-06919794a6b4d6c3d184a765c261cc41!}

  • {!LANG-6e52285b7ad6a4c194c340e136c6f77f!}

{!LANG-d87eba68978d066be983a28bb98b1c56!}

{!LANG-744c86dd773e7dc9bf302e3ef26bf4d1!}

{!LANG-294f8c580e4b9efad9658c81db289b95!}

  • {!LANG-1c8377ea4592463ad37fdc196d098e21!}
  • {!LANG-b27c6f4844bb676516f700b00fca87ba!}
  • {!LANG-aa0996fa3fb43a9a7b9e726c525bfc89!}
  • {!LANG-02f1cf378c35abdeda05917561d24017!}

{!LANG-d7c911920c4f970f53a8940ea11e23cc!}

{!LANG-db7e80c3db68f647c12d46b6ca840431!}

{!LANG-2f4ac88f1c20b6566544a4f2a52efdba!}

{!LANG-d6bc25c9e3d6e12906997095afe461dd!}

  • {!LANG-5fa77e1d43b74d8bf9fa956ae8be8806!}
  • {!LANG-a3d054ff80b21c383f0afa2f18abf394!}
  • {!LANG-c578f360f708b37695b95f32729e7597!}
  • {!LANG-4a3e2e38a0c38629672a8e6338fb3ff9!}
  • {!LANG-17351d797da379a91d3febd62875b291!}

{!LANG-94704186d7b3cf12a35a41d57ff3f891!} {!LANG-69fd9d38b737ce46456c01eb748a309c!}{!LANG-6ee5f75b61c11acebbc1007af5dd0a94!} {!LANG-35cb20c214861c3b8d82c0b61cad4038!}{!LANG-5efa902df4499e745ad9aa30d606fbaf!} {!LANG-89f77ef99b83e54353ddd45bb0777cba!}IBM. {!LANG-5a780caeeddfa079fb1ac762c3d17a27!}{!LANG-23f4adfa98d3ac183320b8018f27a137!} {!LANG-5dbbb94afdcfe82b65050eb41f6d6871!}{!LANG-5efa902df4499e745ad9aa30d606fbaf!} {!LANG-be03f2a46cc28ae2d9306e55be919257!}{!LANG-b98800d0078b52500dc6d2138ff93b18!} {!LANG-a7d5450c6b380232cf68f95a2029f009!}{!LANG-5efa902df4499e745ad9aa30d606fbaf!} {!LANG-b8b521bc4d0ba02428936ca2ea7afd3b!}{!LANG-5efa902df4499e745ad9aa30d606fbaf!} {!LANG-8c0e99dff1edaffb09b4fcfe7c3449dd!}

{!LANG-afa02d78c7dc6b955d29442046b1169f!}

{!LANG-229860f83b2a2e998474f0f344705a1b!}

{!LANG-c2d32f223a741de90e61f7f2395d03dd!}

  • {!LANG-dbc7f8d4b9534984fc5eae1c22c39130!}
  • {!LANG-baf9fe16dc2c1237d969ff83ac342664!}
  • {!LANG-1010603a3f729b270526b8515adc3fcb!}
  • {!LANG-daa5fb259cf93cf7d428c1a9f64090fe!}
  • {!LANG-ec0d281ec52d710f3928969ea8668f46!}

{!LANG-bc15032c98b0015d9f7ada582bd46073!}

{!LANG-c13f2576144432dffe618d760429dce3!}

  • {!LANG-f8d1d11d66c4e8688e80135175049f3b!}{!LANG-1fa58f3cede28d8229af13a1ee8e7366!}
  • {!LANG-ec4023295b6dffbcc59d6043b9fc64f9!}{!LANG-508fe17283d1d7de05b3f3331c03e320!}
  • {!LANG-a3bc6f9f0fe8990487dc70d3af5b4f75!}{!LANG-7aed2d8d8eaa8af666f922c01dbb022c!}
  • {!LANG-8687b350f3364f8c5f4509ce6d4ce3da!} {!LANG-48509ce7b0f2e547eaea3227e0d389b1!}
  • {!LANG-23874c19fba1acff1f6619f0e3829a99!} {!LANG-ad85e0cc6ebf5de30f3a303ab574a562!}

{!LANG-ff1bdf6fe2920ff7acb8cf30995d3044!}

  • {!LANG-c0102fb0e1216d18cedbcd85c311d2ee!}
  • {!LANG-000158e3e5a29d56f74af20421fe42fb!}
  • {!LANG-f5e3b260e44980d7ba6df7df276e5335!}
  • {!LANG-fb668aa6620a236549a5f6f443d364f0!}
  • {!LANG-9adcabc9f0fa62a906c5615cd42d2f25!}

{!LANG-158b861146233a3b0ae05bcf458dd79d!} {!LANG-f597b9ccbebe1badc4aa8b6bbc62af31!}

{!LANG-5dbd45586db21e0e5f5618b5f3ac254b!}

{!LANG-0feacbd4e40a364677e9733dab001747!}

{!LANG-eddf4961bfb69bc503ba762edfb8f359!}

{!LANG-85f3070ce95211386beeeb47329cdbdd!}

{!LANG-e9ed4c58c30deecefa944ea8da503a5e!}

6914. {!LANG-d077e7d010bb204111f7555bfe5afe7d!} {!LANG-0851be393b0003d186cb164cf8fef734!}
{!LANG-9bfa49e84cc550bac85a7ed871f8294f!}
8064. {!LANG-fddf9b9565416acbbe0ac728517de976!} {!LANG-f2a8badee2b307dc3409730f4d16dca1!}
{!LANG-d4ea30b0fe75138334534eee9c748e7d!}
20319. {!LANG-0ce12bb253a3a18610a16901853e622f!} {!LANG-2cfc18a8c044c8a419821e3a7154d48f!}
{!LANG-d64d1b98fb65c8f88cb7562e53aa51a7!}
5031. {!LANG-b4c0a1c15394543ea9ead84aeb8957f8!} {!LANG-602ef51ead414c3052262826618d8ca2!}
{!LANG-9476eb574942a299b3c54404d3c19f63!}
13815. {!LANG-1a3866b7b0fac7c7131a33a9177c5ec0!} {!LANG-9a6eab91af3ccc7b24278f77f4f22bf1!}
{!LANG-a2cfe887ec1b5013420bdf3e07ba0a87!}
14095. {!LANG-2257814a7200c21c6a6adff0d553c97f!} {!LANG-602ef51ead414c3052262826618d8ca2!}
{!LANG-771fcb065dd21f8319bfa996da83150a!}
5061. {!LANG-6ca9b484d2ede5519b65674b4152a1b9!} {!LANG-530f3f5f798a83dc6c0ca52eee96e4c9!}
{!LANG-57aa69d0357abce1c2f5a6445a7dacd4!}
13542. {!LANG-f78b1e5e1d6e0e7f6ef8a71c01ab319d!} {!LANG-02b5452e6d3a4ca3282db1b0d416893c!}
{!LANG-24b4a6884248f33fc3bf3ea917e187ce!}
9100. {!LANG-da400482e509fb4aa23fe097930b962b!} {!LANG-33a8872611742be056edb5103f973f40!}
{!LANG-9e6dc2bd9b065ece7daa41b7f69d8c55!}
5240. {!LANG-bce39512d4f003f97046c7eb13647789!} {!LANG-61fed3aaafd24343fb987df6119f9635!}
{!LANG-c226e864bc61cda6f7a6d1be21d4735d!}

{!LANG-340a5fccadaf7318378c500176fd2cb7!}

{!LANG-b206c9b9823a397f9df19f1571726fc3!}

{!LANG-a7802024bac45056205f9a945a82a1ed!}

{!LANG-8eeb97c2187d8bccbc7e8118b86f0782!}

{!LANG-2552730e84a5ede14449c2d1269cf37d!} {!LANG-ce4301a7066682ec76895d0a0d59a900!}{!LANG-23098db53b5ad63ce8994419aa1be419!}

{!LANG-1c18e0c3c5e43b49d6033d0ba08363da!} {!LANG-9e159444221b8a8503bb523b8cfdc04a!}{!LANG-2fff1c50378e99adb30f6c1946f21396!}

{!LANG-f7025a39c08d4196f6eed4dfc68fc2c3!}

{!LANG-1910c6baf135649ec9b994922959f445!}

{!LANG-fe1058ba2fcd469ef516789072da6572!}

{!LANG-5b154e55ea85c8e5f7b1da0a5cb6fdd4!} {!LANG-073f09e46893633cad514ca0b2c35793!}{!LANG-3d247cda5ffc82d8f8e8745b98dc0be0!}

{!LANG-7e1587a85193d33725ddc86b30d990b9!}

{!LANG-238a7a7f3c1f3238a0d1056768ba8b6d!}

{!LANG-63c9713626ecdb5c97d0474d01841a74!}

{!LANG-3aff4736fc757e4bc68fe98bd2e043a9!}

{!LANG-32a28f14e4d76567f2e03b9be4472714!}

{!LANG-54cd22abe2866dd143609ab6faa9d63a!}

{!LANG-0b3e9f410671487124253e5533e668cd!}

{!LANG-0c5001853634b3ae7a65dc962f9ba920!}

{!LANG-29814fe474314520f8250929cb67a816!}

{!LANG-6b969f9ef23e2d285408eadf45fce8f5!}

{!LANG-a7a3b64beb2f118a2cfcc35c3f3c16df!}

{!LANG-882760825b721c29ca67007a2453448a!}

{!LANG-d7cdbc19021fb9eec893a7e147fed6bf!}

  1. {!LANG-5e9596be651a70c2dfeb03bdd1712207!}
  2. {!LANG-2f9f14a4878fefaf5efc75055a6a341c!}

Zaitsev S.L., K.F.-M.N.

Repeating groups

Repeating groups are attributes for which the only instance of the entity can have more than one value. For example, a person can have more than one skill. If, in terms of business requirements, we need to know the level of ownership of the skill for everyone, and each person can have only two skills, we can create the essence shown in Fig. 1.6. Here is essentially present A PERSONwith two attributes for storing skills and level of skills for each.

Fig. 1.6. This example uses repeated groups.

The problem of repetitive groups is that we cannot know exactly how much skills can have a person. In real life, some people have one skill, some have several, and some have not yet. Figure 1.7 shows the model given to the first normal form. Pay attention to the added Skill identifier which uniquely determines each SKILL.

Fig. 1.7. The model given to the first normal form.

One fact in one place

If the same attribute is present in more than one entity and is not an external key, then this attribute is seen as excess. The logical model should not contain redundant data.

Redundancy requires additional space, however, although the efficiency of memory use is important, the actual problem is another. Guaranteed synchronization of redundant data requires overhead, and you always work under the risk of conflict value.

In the previous example SKILLdepends on Person identifierand from Identifier skill.This means that you will not appear SKILLuntil it appears A PERSON,possessing this skill. It also complicates the change in the name of the skill. It is necessary to find each entry with the name of the skill and change it for every person who owns this skill.

Figure 1.8 shows the model in the second normal form. Note that the essence is added. SKILL, and attribute NAMEskill moved to this entity. The skill level remained, respectively, at the intersection Persons and skill.

Fig. 1.8. In the second normal form, the recurring group is put into another entity. This provides flexibility when adding the required amount of skills and change the name of the skill or description of the skill in one place.

Each attribute depends on the key

Each entity attribute should depend on the primary key of this entity. In the previous example School nameand Geographic areapresent in the table A PERSONBut do not describe the person. To achieve a third normal form, you must move the attributes in the essence where they will depend on the key. Figure 1.9. Shows the model in the third normal form.

Fig. 1.9. In the third normal form School name and Geographic region transferred to the essence where their values \u200b\u200bdepend on the key.

Many-co-many relations

Relations many-co-manyreflect the reality of the surrounding world. Please note that in Figure 1.9 there is a ratio of many-to-many between Personand School. The attitude accurately reflects the fact that A PERSONcan learn in many Schoolsand B. Schoolcan learn a lot Persons.Associative essence is created to achieve a fourth normal form, which eliminates the relation of the monogy-to-many due to the formation of a separate entry for each unique combination of school and person. Figure 1.10 shows a model in the fourth normal form.

Fig. 1.10. In the fourth normal form, the relation of the monogy-co-many between Person and Schoolit is allowed by introducing an associative entity in which a separate entry is assigned for each unique combination. Schools and Persons.

Formal definitions of normal forms

The following definitions of normal forms may seem intimidating. Consider them simply as formulas to achieve normalization. Normal forms are based on relational algebra and can be interpreted as mathematical transformations. Although this book is not devoted to a detailed discussion of normal forms, models developers are recommended to study this question more deeply.

In a specified ratio, the attribute y functionally depends on the X attribute. In the character form Rx -\u003e Ry (read as "Rx functionally defines RY") - in that and only if each value x in R is associated strictly with one value y R (at every specific point in time). Attributes X and Y can be composite (Date K. J. Introduction to the database systems. 6th edition. Ed. Williams: 1999, 848 p.)

The ratio R corresponds to the first normal form (1NF) if and only if all domains belonging to it contain only atomic values \u200b\u200b(diet, ibid).

The rat ratio corresponds to the second normal form (2NF) if and only if it corresponds to 1NF, and each neat attribute completely depends on the primary key (diet, ibid).

The ratio R corresponds to the third normal form (3NF) if and only if it corresponds to 2NF, and each neutral attribute does not transitively depend on the primary key (date, ibid).

The ratio R corresponds to the normal form of Boys-Codd (BCNF) if and only if each determinant is a candidate for use as a key.

NOTE Below is a brief explanation of some abbreviations used in the definitions of the Date.

MVD (MULTI-VALUED DEPENDENCY) - multi-valued dependence. Used only for entities with three and more attributes. With multi-valued dependencies, the attribute value depends only on the part of the primary key.

FD (Functional Dependency) - Functional Dependency. With a functional dependency, the attribute value depends on the value of another attribute that is not part of the primary key.

JD (Join Dependency) - addiction to unification. With the dependence of the union, the primary key of the parent entity can be traced to the descendants at least the third level, while maintaining the ability to be used in the union by the source key.

The ratio corresponds to the fourth normal form (4NF) if and only if there are MVD, for example, A®®B. In this case, all attributes R functionally depend on A. in other words, only the dependences (FD or MVD) of the K®X form (i.e., the functional dependence of the X attribute from the candidate for using K) is present. Accordingly, R meets the requirements of 4NF if it corresponds to BCNF and all MVD is actually FD (diet, ibid.).

For the fifth normal form, the ratio R satisfies the dependences of the combination (JD) * (x, y, ..., z) if and only if R is equivalent to its projections on x, y, ..., z, where x, y,. .., z subsets of a set of attributes R.

There are many other normal forms for complex data types and specific situations that go beyond our discussion. Each enthusiast of the models is desirable to study other normal forms.

Business Normal Forms

In your book, Clive Finklestein (FinkLestein Cl. An Introduction to Information Engineering: from Strategic Planning to Information Systems. Reading, Massachusetts: AdDison-Wesley, 1989) Applied another approach to normalization. It defines business normal forms in terms of bringing these forms. Many models developers consider this approach intuitively clearer and pragmatic.

The first business normal form (1BNF) makes repeated groups to another entity. This entity gets its own name and primary (composite) key attributes from the initial essence and its recurring group.

The second business normal form (2BNF) makes attributes that partially depend on the primary key to another entity. Primary (composite) The key of this entity is the primary key of the essence in which it was originally located, along with additional keys, from which the attribute depends entirely.

The third business standard form (3BNF) makes attributes that are independent of the primary key, to another entity where they are completely dependent on the primary key of this entity.

The fourth business normal form (4BNF) makes attributes that depend on the value of the primary key or are optional, into the secondary essence, where they are completely dependent on the primary key value or where they should (must) be present in this entity.

The fifth business normal form (5BNF) is manifested as a structural essence, if there is a recursive or other dependence between copies of the secondary entity, or if the recursive dependence exists between the instances of its primary essence.

Completed logical data model

The completed logical model must meet the requirements of the third business normal form and include all the entities, attributes and communications necessary to support data requirements and business rules associated with data.

All entities must have names describing the content and clear, short, complete description or definition. One of the following publications will consider the original set of recommendations for the proper formation of the names and descriptions of entities.

Essences must have a complete set of attributes, so that each fact relative to each entity can be represented by its attributes. Each attribute must have a name reflecting its values, a logical data type and a clear, short, full description or definition. In one of the following publications, we will consider the source set of recommendations for proper formation of names and descriptions of attributes.

Communications should include a verb design that describes the relationship between entities, along with such characteristics as the multiplicity, the need for existence or the possibility of lack of communication.

NOTE Multiple Communications describes the maximum number of copies of the secondary entity that can be associated with an instance of the original entity.The need for existence orpossibility of absence Communications serves to determine the minimum number of copies of the secondary entity that can be associated with an instance of the initial entity.

Physical data model

After creating a complete and adequate logical model, you are ready to decide on the choice of the implementation platform. The selection of the platform depends on the requirements for the use of data and strategic principles for the formation of the corporation architecture. The choice of platform is a difficult problem that goes beyond the framework of this book.

In Erwin, the physical model is a graphical representation of a really implemented database. The physical database will consist of tables, columns and connections. The physical model depends on the platform selected for implementation, and the requirements for the use of data. The physical model for IMS will seriously differ from the same model for Sybase. The physical model for OLAP reports will look different than the model for OLTP (operational transaction processing).

Data Model Developer and Database Administrator (DBA - Database Administrator) use a logical model, customer requirements and strategic principles for the formation of a corporation architecture to develop a physical data model. You can denormalize the physical model to improve performance, and create views to support custom requirements. In subsequent sections, the process of denormalization and presentation creation is considered in detail.

This section provides an overview of the process of building a physical model, collecting data requirements, is given to the determination of the components of the physical model and reverse design. In the following publications, these issues are covered in more detail.

Collecting data requirements

Usually you collect requirements for using data in the early stages during the interviews and work sessions. At the same time, the requirements should most fully determine the use of data by the user. Surface attitude and lacuna in the physical model can lead to unscheduled costs and delaying the timing of the project implementation. Requirements for use include:

    Requirements for access and performance

    Volumenical characteristics (assessment of the amount of data to be stored), which allow the administrator to submit a physical database

    Assessment of the number of users who need simultaneous access to data that helps you design a database, taking into account an acceptable level of performance

    Total, summary and other calculated or derivative data that can be considered as candidates for storage in long-term data structures

    Requirements to form reports and standard queries that help the database administrator to form indexes

    Presentations (long-term or virtual) that will help the user when performing the operations of combining or filtering data.

In addition to the Chairperson, the secretary and users in the session on the use requirements, the developer of the model, the database administrator and the database architect should be involved. The discussion must be subject to the user to historical data. The duration of the time period during which data is stored, has a significant effect on the database size. Often, older data is stored in generalized, and atomic data is archived or removed.

Users should bring with you for session examples of requests and reports. Reports must be strictly defined and should include atomic values \u200b\u200bused for any total and summary fields.

Components of the physical data model

Components of the physical data model are tables, columns and relationships. The essence of the logical model is likely to become tables in the physical model. Logical attributes will become columns. Logical relations will become restrictions on the integrity of relations. Some logical relationship cannot be implemented in a physical database.

Reverse design

When the logical model is not available, there is a need to recreate the model from the existing database. In Erwin, this process is called reverse design. Reverse design can be made in several ways. The model developer can explore the data structures in the database and recreate the tables in the visual simulation environment. You can import a data description language (DDL - Data Definitions Language) to a tool that supports reverse design (for example, ERWIN). Developed tools, such as ERWIN, include functions that provide communication via ODBC with an existing database, to create a model by direct reading data structures. Reverse design using ERWIN will be discussed in detail in one of the following publications.

Use of corporate functional boundaries

When building a logical model for a developer model, it is important to make sure that the new model corresponds to the corporate model. The use of corporate functional boundaries means modeling data in terms used within the Corporation. The method of using data in the corporation varies faster than the data itself. In each logical model, the data should be presented integrity, regardless of the subject area of \u200b\u200bthe business it supports. Essences, attributes and relationships should define business rules at the Corporation level.

NOTE Some of my colleagues call these corporate functional boundaries in real world modeling. Modeling the real world encourages the developer of the model to consider information in terms of relationships and relationships that really inherent in it.

The use of corporate functional boundaries for a data model, built accordingly, provides the basis for supporting the information needs of any number of processes and applications, which makes it possible to efficiently exploit one of its most valuable assets - information.

What is a corporate data model?

Corporate Data Model (EDM - Enterprise Data Model)contains essences, attributes and relationships that represent the information needs of the corporation. EDM is usually divided into compliance with subject areas that represent groups of entities belonging to the support of specific business needs. Some subject areas can cover such specific business functions as contract management, others - combine entities describing products or services.

Each logical model must comply with the existing subject area of \u200b\u200bthe corporate data model. If the logical model does not comply with this requirement, a model determining the subject area should be added to it. This comparison ensures that the corporate model has been improved or adjusted, and within the Corporation, all efforts on logical modeling are coordinated.

EDM.also includes specific entities that determine the range of definition of values \u200b\u200bfor key attributes. These entities do not have parents and are defined as independent. Independent entities are often used to maintain the integrity of the relationship. These entities are identified by several different names, such as code tables, reference tables, type tables, or classification tables. We will use the term "corporate business object". A corporate business object is an entity that contains a set of attribute values \u200b\u200bthat do not depend on any other entity. Corporate business objects within the Corporation should be used uniformly.

Building a corporate data model by extension

There are organizations where the corporate model from beginning to end was built as a result of uniform agreed efforts. On the other hand, most organizations create fairly complete corporate models by increasing.

Building means building something consistently, layer behind the layer, just as the oyster grows the pearl. Each created data model ensures the contribution to the formation of EDM. Building an EDM in this method requires additional modeling actions to add new data structures and subject areas or expand existing data structures. This makes it possible to build a corporate data model by building, iteratively adding the levels of detail and refinement.

The concept of modeling methodology

There are several methodologies for visual data modeling. Erwin supports two:

    IDEF1X (Integration Definition For Information Modeling is an integrated description of information models).

    IE (Information Engineering - Information Engineering).

IDEF1X - good methodology and use of its notation is widespread

Integrated description of information models

IDEF1x- highly structured data modeling methodology that extends the IDEF1 methodology adopted as the FIPS standard (Federal Information Processing Standards is the Federal Information Processing Standards). IDEF1X uses a strictly structured set of types of modeling designs and leads to a data model that requires an understanding of the physical nature of the data before such information can be accessible.

The hard structure IDEF1X forces the developer of the model to prescribe the entities of the characteristics that may not be responsible to the realities of the world. For example, IDEF1x requires that all subtypes of entities are exclusive. This leads to the fact that the person cannot be simultaneously a client and an employee. While the real practice tells us another.

Information engineering

Clive Finkleshtein is often referred to as the father of information engineering, although similar concepts made together with him and James Martin (Martin, James. Managing The Database Environment. Upper Saddle River, New Jersey: Prentice Hall, 1983.). Information engineering uses an approach to business management to manage information, and applies other notation to present business rules. IE serves as the expansion and development of the notation and basic concepts of the ER methodology proposed by Peter Chen.

IE provides information support infrastructure by integrating corporate strategic planning with information systems developed. Such integration allows you to more closely link the management of information resources with long-term strategic perspectives of the corporation. This approach sent to the requirements of the business leads many models developers to the choice of IE instead of other methodologies that, mainly, focus on solving momentum development tasks.

IE offers a sequence of actions that lead to a corporation to define all its information needs for collecting and managing data and identify interconnections between information facilities. As a result, information requirements are clearly formulated on the basis of management directives and can be directly translated into the management information system that will support strategic information needs.

Conclusion

Understanding how to use a data modeling tool similar to ERWIN is only part of the problem. In addition, you must understand when data modeling tasks are solved and how the requirements for information and business rules are collected, which must be presented in the data model. Conducting working sessions provides the most favorable conditions for collecting information requirements in a medium, including experts of the subject area, users and specialists in the field of information technology.

To build a good data model, analysis and research requirements for information and business rules collected during work sessions and interviews are required. The resulting data model must be compared with the corporate model, if possible, to ensure that it does not conflict with existing object models and includes all the necessary objects.

The data model consists of logical and physical models displaying information and business rules. The logical model must be given to the third normal form. The third normal form limits, adds, updates and removes the abnormal data structures to support the "one fact in one place" principle. The collected requirements for information and business rules must be analyzed and investigated. They must be compared to the corporate model to ensure that they do not conflict with the existing models of objects, and include all the necessary objects.

The ERWIN data model includes both logical and physical models. Erwin implements the ER approach and allows you to create logical and physical models objects for presenting information requirements and business rules. Logic model objects include entities, attributes and relationships. The objects of the physical model include tables, columns and restrictions on the integrity of relations.

In one of the following publications, issues of identification of entities, identify entity types, selection of entities and descriptions, as well as some techniques that avoid the most common modeling errors associated with the use of entities are avoided.

Essences must have a complete set of attributes, so that each fact relative to each entity can be represented by its attributes. Each attribute must have a name reflecting its values, a logical data type and a clear, short, full description or definition. In one of the following publications, we will consider the source set of recommendations for proper formation of names and descriptions of attributes. Communications should include a verb design that describes the relationship between entities, along with such characteristics as the multiplicity, the need for existence or the possibility of lack of communication.

NOTE Multiple communications describes the maximum number of copies of the secondary entity that can be associated with an instance of the original entity.The need for existence or absence communication serves to determine the minimum number of copies of the secondary entity that can be associated with an instance of the original

{!LANG-3333e6b1e9331384b91ff9c440fa1f35!}

{!LANG-c39a297b6426c8a2ab9ebaee59a436dd!}

{!LANG-0681667aacee13ec64d5d0a616b48fe8!} {!LANG-f517578066bfc388290aa7857be71d10!}

  • {!LANG-b0e28c9a5cd9b83ad6d4f760a83a5a36!}
    • {!LANG-d991cfcb4f29359daae0616fdf51af53!}
    • {!LANG-a378ecb07351536d6054069baef8be5d!}
  • 2. {!LANG-872e7882066b3cab641486a441682f8f!}
  • {!LANG-1ad5aa5905501c7c974d5ec6eb23bf1c!}

{!LANG-aa4b12fd6885ec083bc6c5f2479113bc!}

{!LANG-83466002822c888bc501186d14bf6da0!}

{!LANG-8ba1b8160aec9ecd34d81286649201c8!}

{!LANG-a4d6e6b0a4dfc5830f46bb5d9e6551ce!}

{!LANG-a3ebadea3aad7a911909db545ebaa6d9!}

{!LANG-da175a532f5fe0d3fddb4240338bf627!}

{!LANG-d9cd2a55fc54d80977fd11fe9fe55c6d!}

{!LANG-5e7d00d840a0215c92c4cf3bc9dd32b2!}

{!LANG-b6633fe362b34cca4a2e17fe58fdc66b!}

{!LANG-69b92cc7e081f61ada2251ceed96840b!}

{!LANG-49d04d5f4a5c833b053a3bb2ac470a13!}

{!LANG-6a90820b31ccd07d853db16c797a9f51!}

{!LANG-d22a1fe6bbbbfe05c4a1a96fe40b2dc3!}

{!LANG-972866f8c875bc35cbc67b6e7c1e8aa6!}

{!LANG-df17c9e0f2bcf0d9e71a28b1698f94ed!}

{!LANG-34604a3a93603ac848106782409d6667!}

{!LANG-fb3707efd09446e86377fdfb42c6a4cb!}

{!LANG-2f6c4470ad0c19aa32e4aa8d43163364!}

{!LANG-d40cdd8e39b1a70663da8ab33f207ea6!}

{!LANG-7ea928982eeff54ed19b7ad73c3810ff!}

{!LANG-8b23719b20ad87f64675c8d2a65e28a7!}

{!LANG-352fb6ec398aa7b78eeb25b57ee7e7a8!}

{!LANG-ef6844071ef234a2bfe51d0e7e6caa90!}

{!LANG-5d57281d042a86341e7e13e1f8b06f39!}

{!LANG-3bc00d570a947cbd23805f10b2947811!}

{!LANG-0fa2694bbc71f3175d4594a5e0088cb4!}

{!LANG-80bb1c0b2f89c02707b8610caeb3ff53!}

{!LANG-7403b3d6a00e19f2bf2e46200fb7ab86!}

{!LANG-374269e1292b0c9914a0a618c3111a30!}

{!LANG-f642331168861e5b250cc670018c8fa4!}

{!LANG-8beba4e5b513ab3eb61414e25afb07a2!}

{!LANG-8531b1c53ca98624a49a16bca3a097a6!}

{!LANG-c020ba5e0d8c56df6ab6ea77eb978582!}

{!LANG-05633c745865f6ea2003e982686225aa!}

{!LANG-eb5ea1d03e15dcce3760bca78446e11f!}

{!LANG-e328adc7a12f86825689745bc3f318f8!}

{!LANG-c5ac75001e3a7170a85e5ba6431a21c6!}

{!LANG-ed1f3d0ca6e14ed472b7ea9408a0a5d2!}

{!LANG-b90666afca938faaeb07a4a253c68bab!}

{!LANG-5e36c343df6a80063220a5aed11554d0!}

{!LANG-1f624bbdb7c2be529c5b8b4648fc7aae!}

{!LANG-344d938e71a755c385a2d79bc844391c!}

{!LANG-6e61d656c0d3b346214e6c06bf09ddaa!}

{!LANG-0532b993290df2d91ccfe1bfaabdd031!}

{!LANG-1e252e2adec40791114d00c71b18561c!}

{!LANG-9ca18aec32af3361ad96104a211a7c3e!}

{!LANG-d99b2cac0aa06a5959438682b791773f!}

{!LANG-340fcfc98ae8c4901df12a901adb97e7!}

{!LANG-5dc416bae208e6a2e9082f8f643a31f1!}

{!LANG-d2dd740712a8eeb58cbed342a39f37c9!}

{!LANG-68a1bcc0d7b3f3a9bd2b5cafee053673!}

{!LANG-fd5d3d6edf8afa8d9e5fc368eb6140b3!}

{!LANG-2c8c63736aaebe7bfe706b6537990dba!}

{!LANG-082245e0b16e3379c1fd79db3b509b92!}

{!LANG-d01ef7a71cecc6d4b4ca625bb183360e!}

{!LANG-74a92b8b994caa3e43d699a78f71ac6c!}

{!LANG-c1cba907837c27035f04853157663849!}

{!LANG-9e042c85ce082e810640a65c1e17b4f2!}

{!LANG-de86eaee0ab4364a28952b65a9386a12!}

{!LANG-508cc39de5e9b4443fd36b4193e0f647!}

{!LANG-9dc6a88870a04b836308fda6a2e678aa!}

{!LANG-febbdfebd5878d4afe1a84088e5d7f04!}

{!LANG-99d589b7085ba2e4a7b32645b82cedcf!}

{!LANG-79082e9d76c4e0d58c1f03c1e9b8ce90!}

{!LANG-f8b9dbd1c17199fb962cc7e055451cb8!}

{!LANG-811bcf5e132d028b869435eb4ef9fb90!}

{!LANG-79afb6a4ba30184eb7e1d9bee572092f!}

{!LANG-7f8420f3cb25a55f39830439997b8d39!}

{!LANG-eb0c6ecd0efd00209124d0f6cadae755!}

{!LANG-693709c12c34b5defa6f34a53209d6dc!}

{!LANG-c0a5a5fec26eb949ad0f7f22801ba3c1!}

{!LANG-e12a0cd276268f8b0addf9a94acb57b3!}

{!LANG-b9751d0c8b39851642df48339e030187!}

{!LANG-3078450320dd615101afc40f0a82ccd4!}

{!LANG-bafc2d30dce595b9f46f9d6485e2f8a1!}

{!LANG-ae4abe9c1498a53feb05413230f16626!}

{!LANG-a8b80e630a220093702e736bf16aba20!}

{!LANG-32f861f1b35e7b556492148235f3cffa!}

{!LANG-7e2e799205c172d320338f7fdfebee97!}

{!LANG-41b166837d8381c710b52f1ff7bbab8f!}

{!LANG-e4589413a8a7be043b56925a5fed83b1!}

...

{!LANG-2315defc5e8fbf7be03badaab0f522f9!}

    {!LANG-e84afb0b8365829a33003ab3023b6aba!}

    {!LANG-2bde2f4b07cb26c0c0783e3af2337a9b!}

    {!LANG-b3bcfbdbdba6ecc21f55da6b0d3501e0!}

    {!LANG-eb7a98640658aef9110e17ed40f430fd!}

    {!LANG-41ca2bf1f6fb6a835e1d0b293755e3a3!}

    {!LANG-481f3c2f16b860f12fdfa44f70142cb3!}

    {!LANG-5ce989c9b48ed31fe04bc3994fc9d917!}

    {!LANG-85905362a0165ae373a19f01ef88bcc9!}

    {!LANG-9c5c59e48f5ff266a0d5d99b71e43414!}

    {!LANG-9842baed3d31dd7b1bdb90356bf14a12!}

    {!LANG-ae54a333df78f8040a68420c92fe78e4!}

    {!LANG-199faeb261997ea9f2aa3187d3457647!}

    {!LANG-af6ff049a5ecf49af92dc6138b89c004!}

    {!LANG-b9666305e264ad9c0f6cec0c4344f2ad!}

    {!LANG-63f83efbb26b63f310040474f016417d!}

    {!LANG-06390cda217927a1a71195ae736c98d8!}

    {!LANG-e34592f3258248039d8d15f4484f7610!}

    {!LANG-591a1ee818b97a14e4ab2879975443c0!}

    {!LANG-8c40e64b1ec72067abf6def4d8265ab8!}

Increasingly, IT specialists seek their attention to data management solutions based on standard sectoral data models and business solutions templates. Ready-free comprehensive physical data models and reports of business analysts for specific areas of activity allow us to unify the information constituent activities of the enterprise and significantly speed up the implementation of business processes. Decision templates allow service providers to use non-standard information capabilities hidden in existing systems, thereby reducing projects, costs and risks. For example, real projects show that the data model and business solutions patterns can reduce the volume of labor costs for the development by 50%.

The sectoral logic model is an object-oriented, integrated and logically structured presentation of all information that should be in a corporate data warehouse to receive answers to both strategic and tactical business issues. The main purpose of the models is to facilitate orientation in the data space and assistance in the allocation of parts important for business development. In modern conditions, to successfully conduct business, it is absolutely necessary to have a clear understanding of the relationship between different components and is good to imagine the overall picture of the organization. The identification of all parts and connections using models allows you to most effectively use the time and tools for organizing the company's work.

Under data models, abstract models describing how data representation and access to them are understood. Data models define data elements and links between them in a particular area. The data model is a navigation tool for both business and for IT professionals, which uses a specific set of characters and words to accurately explain a certain class of real information. This allows you to improve mutual understanding within the organization and, thus, create a more flexible and stable environment for applications.


An example of the GIS model for the authorities and local governments.

Today, software and service providers are strategically important to be able to quickly respond to changes in the industry related to technological novelties, withdrawing state restrictions and complications of supply chains. Together with the changes in the business model, the complexity and cost of information technologies needed to support the company's activities are growing. Especially the management of data is difficult in an environment where corporate information systems, as well as functional and business requirements, are constantly changing.

Help relief and optimize this process, in the translation of the IT approach to the modern level, sectoral data models are designed.

Industry data models from the companyEsri.

The data models for the ESRI ArcGIS platform are working templates for use in GIS projects and creating data structures for different application areas. The formation of the data model includes the creation of a conceptual design, logical and physical structure, which can then be used to build a personal or corporate geodatabase. ArcGIS provides tools for creating and managing a database schema, and data model templates are used to quickly launch the GIS project for different applications and industries. ESRI specialists together with the users community spent a significant amount of time to develop a number of templates that can provide the possibility of a quick start of designing a geodatabase of the enterprise. These projects are described and documented on the support.esri.com/datamodels website. Below, in the order of their mention on this site, is presented with meaningful translation of the names of sectoral models ESRI:

  • Address registry
  • Agriculture
  • Meteorology
  • Basic spatial data
  • Biodiversity
  • Interior buildings
  • Accounting for greenhouse gases
  • Department of administrative borders
  • Armed forces. Intelligence service
  • Energy (including the new ArcGIS Multispeak protocol)
  • Environmental structures
  • Ministry of Emergency Situations Fire security
  • Forest Cadastre
  • Forestry
  • Geology
  • GIS National Level (E-GOV)
  • Underground and wastewater
  • Health
  • Archeology and Protection of Memorial Places
  • National security
  • Hydrology
  • International Hydrographic Organization (IHO). S-57 format for ENC
  • Irrigation
  • Land Registry
  • Municipal government
  • Marine navigation
  • State Cadastre
  • Oil and gas structures
  • Pipelifiers
  • Raster storage
  • Batymetry, relief of the seabed
  • Telecommunications
  • Transport
  • Water supply, sewage, housing and communal services

These models contain all the necessary signs of the industry standard, namely:

  • are freely access;
  • do not have bindings to the "Favorite" manufacturer's technology;
  • created as a result of realization of real projects;
  • created with the participation of sectoral specialists;
  • are intended to provide information interaction between different products and technologies;
  • do not contradict other standards and regulatory documents;
  • used in implemented projects around the world;
  • are designed to work with information on the entire life cycle of the system being created, and not the project itself;
  • expandable customer needs without loss of compatibility with other projects and / or models;
  • accompanied by additional materials and examples;
  • used in the methodological instructions and technical materials of various industrial companies;
  • a large community of participants, while accessing the community is open to all;
  • a large number of references to data models in publications in recent years.

ESRI specialists are included in the expert group of independent bodies who recommend using various sectoral models, such as PODS (PIPELINE OPEN DATA STANDards - open standard for the oil and gas industry; currently there is a PODS implementation as a geodatabase of ESRI PODS ESRI Spatial 5.1.1) or The geodatabase (BGD) base from ArcGIS for Aviation, which takes into account the recommendations of ICAO and FAA, as well as the EXM 5.0 Navigation Data Exchange Standard. In addition, there are recommended models, strictly relevant to existing sectoral standards, such as S-57 and ArcGIS for Maritime (sea and coastal objects), as well as models created by the results of completed ESRI Professional Services and are "de facto" standards in the appropriate Areas. For example, GIS for the Nation and Local Goverment ("GIS for state authorities and local governments") influenced NSDI and Inspire standards, and Hydro and Groundwater (hydrology and groundwater) are actively used in freely affordable ARCHYDRO professional package and commercial products. Third firms. It should be noted that ESRI supports and "De-Facto" standards, such as NHDI. All proposed data models are documented and ready to use in IT processes of the enterprise. The accompanying materials for models include:

  • UML diagrams of entity relationships;
  • data structures, domains, reference books;
  • ready geodatabase templates in ArcGIS GDB format;
  • data examples and application examples;
  • examples of data loading scripts, examples of analysis utilities;
  • references according to the proposed data structure.

ESRI summarizes its experience in building industry models in the form of books and localizes published materials. The company Esri Cis is localized and the following books have been published:

  • Geospatial service-oriented architecture (SAO);
  • Design of geodatabases for transport;
  • Corporate geoinformation systems;
  • GIS: new energy of electrical and gas enterprises;
  • Oil and gas on a digital map;
  • Modeling our world. ESRI management on the design of the geodatabase;
  • Thinking about GIS. GIS Planning: Manual for managers;
  • Geographical information systems. Basics;
  • GIS for administrative and economic management;
  • Web GIS. Principles and application;
  • System design strategies, 26th edition;
  • 68 issues of ArcReview magazine with publications of companies and GIS systems;
  • ... and many other thematic notes and publications.

For example, the book " Modeling our world ..."(Translation) is a comprehensive guide and a guide to modeling data in GIS in general, and on the data model of geodatabase in particular. The book shows how to generate the correct data modeling solutions, solutions that participate in each aspect of the GIS project: from the design of the base data and data collection to spatial analysis and visual representation. It is described in detail how to design a geographical database, a corresponding project, configure the functionality of the database without programming, control the stream of work in complex projects, simulate a variety of network structures, such as river, transport or electrical networks, Implement data to the process of geographic analysis and display, as well as create 3D data models GIS. BOOK " Design geodatabases for transport"Contains methodological approaches tested on a large number of projects and fully relevant to the legislative requirements of Europe and the United States, as well as international standards. And in the book" GIS: New Energy of Electric and Gas Enterprises"With the use of real examples, the advantages that corporate GIS can give an energy supplier company, including aspects such as customer service, network operation and other business processes.


Some of the books, translation and original issues published in Russian by Esri Cis and Data +. They are affected both conceptual issues related to GIS technology and many applied aspects of modeling and deploying GIS of a different scale and destination.

Applying sectoral models Consider on the example of the BISDM data model (Building Interior Space Data Model, the information model of the internal space of the building) version 3.0. BISDM is the development of a more general model BIM (Building Information Model, the information model of the building) and is intended for use in the tasks of design, construction, operation and conclusion from the operation of buildings and structures. Used in GIS, it allows to effectively exchange geodan with other platforms and interact with them. Refers to the general group of FM tasks (organization of the organization's infrastructure). We list the main advantages of the BISDM model, the use of which allows:

  • organize the exchange of information in a heterogeneous environment according to the Unified Regulations;
  • get the "physical" embodiment of the concept of BIM and recommended rules for the construction of a construction project;
  • maintain a single storage facility in the entire life cycle of the building (from the project to output from operation);
  • coordinate the work of various professionals in the project;
  • visualize the laid calendar plan and stages of construction for all participants;
  • give a preliminary assessment of the cost and time limits (4D and 5D data);
  • monitor the progress of the project;
  • ensure high-quality operation of the building, including maintenance and repair;
  • become part of the asset management system, including the functions of analyzing the efficiency of areas of use (rental, warehouse, employee management);
  • conduct the calculation and managing the tasks of the energy efficiency of the building;
  • model the movement of human streams.

BISDM defines the rules for working with spatial data at the level of the internal premises in the building, including the purpose and uses of the use, laid communications, installed equipment, repairing repairs and maintenance, logging of incidents, relationships with other assets of the company. The model helps to create a single storage of geographic and non-geographic data. The experience of leading global companies was used to allocate entities and modeling at the level of BGD (geodatabase base) of the spatial and logical relationships of all physical elements forming both the building itself and its inner premises. Following the principles of BISDM makes it possible to significantly simplify the integration tasks with other systems. At the first stage, this is usually integrating with CAD. Then, during the operation of the building, data exchange is used with ERP and EAM systems (SAP, TRIRIGA, MAXIMO, etc.).


Visualization of BISDM structural elements by ARCGIS.

In the case of using BISDM, the Customer / Object owner receives through the exchange of information from the idea of \u200b\u200bcreating an object before developing a full project, monitoring construction with relevant information to the time of commissioning, control of parameters during operation, and even during reconstruction or output of the object from operation. Following the BISDM paradigm, GIS and the BGD generated by it becomes a common data storage for related systems. Often, data created and operated by third-party systems are in BGD. This must be taken into account when designing the architecture of the system being created.

At a certain stage, the accumulated "critical mass" of information allows you to go to a new quality level. For example, at the end of the design stage of the new building, in the GIS it is possible to automatically visualize overview 3D models, draw up a list of installed equipment, calculate the kilometer of the engineering networks paved, perform a series of verification and even give a preliminary financial estimate of the project costs.

We note again, when using BISDM and ArcGIS, it is possible to automatically build 3D models according to the accumulated data, since BGD contains a complete description of the object, including Z-coordinates, belonging to the flood, types of elements connections, equipment installation methods, material, Traveling staff, functional purpose of each element, etc. etc. It is necessary to consider that after the initial import of all design materials in BISDM BGD, there is a need for additional information filling for:

  • prostanovka in the designated places of 3D models of objects and equipment;
  • collecting information about the value of materials and the order of their laying and installation;
  • control of passability on the dimensions of the installed non-standard equipment.

Due to the use of ArcGIS, the import of additional 3D objects and reference books from external sources is simplified, because The ArcGIS Data Interoperability module allows you to create procedures for importing similar data and their correct placement inside the model. All formats used in this industry are supported, including IFC, AutoCAD Revit, Bentlye MicroStation.

Industry data models from IBM

IBM provides a set of tools and data storage control models for various areas of activity:

  • IBM Banking and Financial Markets Data Warehouse (Finance)
  • IBM Banking Data Warehouse
  • IBM Banking Process and Service Models
  • IBM HEALTH PLAN DATA Model (Healthcare)
  • IBM INSURANCE INFORMATION WAREHOUSE (insurance)
  • IBM Insurance Process and Service Models
  • IBM Retail Data Warehouse (Retail)
  • IBM Telecommunications Data Warehouse (Telecommunications)
  • InfoSphere Warehouse Pack:
    - For Customer Insight (for customer understanding)
    - For Market and Campaign Insight (for understanding the company and the market)
    - for Supply Chain Insight (for understanding suppliers).

For example, the model IBM.Banking.and.FINANCIAL.Markets.Data.Warehouse. Designed to solve specific problems of the banking industry in terms of data, and IBM.Banking.Process.and.Service.Models - from the point of view of processes and soa (service-oriented architecture). The telecommunications industry presents models IBM.InformationFramework (IFW) and IBM.TelecommunicationsData.Warehouse (TDW). They help to significantly speed up the process of creating analytical systems, as well as reduce the risks associated with the development of business analysis applications, the management of corporate data and the organization of data warehouses, taking into account the specifics of the telecommunications industry. The possibilities of IBM TDW cover the entire range of telecommunication services market - from Internet providers and operators of cable networks offering wired and wireless telephony services, data transmission and multimedia content, to transnational companies providing telephone, satellite, intercity and international services, as well as organizations Global networks. To date, TDW is used by large and minor service providers of wired and wireless communication worldwide.

Tool called INFOSPHERE WAREHOUSE PACK FOR CUSTOMER INSIGHT It is a structured and easy-to-implement business content for an increasing number of business projects and industries, including banking, insurance, finance, health insurance programs, telecommunications, retail and distribution. For business users INFOSPHERE WAREHOUSE PACK FOR MARKET AND CAMPAIGN INSIGHT It helps to maximize the effectiveness of measures for analyzing the market and marketing campaigns due to the step-by-step process of developing and taking into account the specifics of the business. Via InfoSphere Warehouse Pack For Supply Chain Insight Organizations have the ability to receive current information on supply chain operations.


ESRI position inside IBM solutions architecture.

Special attention is paid to the IBM approach for electricity companies and housing and utilities enterprises. In order to satisfy the growing consumer requests, more flexible architecture is required compared to the currently used today, as well as the standard sectoral object model, which simplifies the free exchange of information. This will increase the communicative opportunities of energy companies, providing interaction in more economical mode, and will provide new systems with the best visibility of all necessary resources, regardless of where they are located within the organization. The base for this approach serves as a (service-oriented architecture), a component model that establishes the correspondence between the functions of divisions and services of various applications that can be repeatedly used. "Services" of such components are exchanged by data via interfaces without hard binding, hiding from the user the complexity of the systems standing behind them. In such mode, enterprises can easily add new applications regardless of the software provider, the operating system, programming language or other internal characteristics of software. Based on SAO implemented concept Safe (Solution Architecture for Energy), it allows the company's electricity industry to obtain a holistic presentation of its infrastructure based on the standards.

Esri ArcGIS® - a world-world software platform for geo-information systems (GIS), providing the creation and management of digital assets of electric power, gas transmission, distribution, and telecommunication networks. ArcGIS allows you to carry out the most complete inventory of the components of the electrical distribution network, taking into account their spatial location. ArcGIS significantly expands the IBM SAFE architecture, providing tools, applications, workflows, analytics and information and integration capabilities necessary for managing intellectual energy enterprise. ArcGIS The IBM Safe allows you to receive information about infrastructure, assets, customers and employees with accurate data on their location from various sources, as well as create, store and handle geographic information about enterprise assets (support, pipelines, wires, transformers, cable sewage etc.). ArcGIS inside the SAFE infrastructure allows you to dynamically combine the main business applications, combining data from GIS, SCADA and customer service systems with external information, such as traffic intensity, weather conditions or satellite images. Energy enterprises use such combined information for various purposes, from S.O.R. (the overall picture of the operational situation) before inspecting objects, maintenance, analysis and network scheduling.

The information components of the power supplying enterprise can be simulated using several levels that are ranked from the lowest - physical - to the upper, most complex level of logic of business processes. These levels can be integrated to ensure compliance with typical sectoral requirements, for example, with automated measurement registration and managing control system control and data collection (SCADA). Having built the SAFE architecture, energy supplying companies make significant steps in promoting a secured open object model called "General Information Model for Energy Companies" (CIMFY AND UTILITIES). This model provides the necessary base for promoting multiple enterprises to a service-oriented architecture, since it encourages the use of open standards to structuring data and objects. Due to the fact that all systems use the same objects, confusion and inelasticity associated with various implementations of the same objects will be reduced to a minimum. Thus, the definition of the "client" object and other important business objects will be unified in all systems of the energy supplying enterprise. Now, using CIM, suppliers and consumers of services can use the common data structure, facilitating the output of expensive business components to outsourcing, since CIM sets a common base on which you can build an exchange of information.

Conclusion

Complex sectoral data models provide companies with a single integrated presentation of their business information. Many companies are not easy to integrate their data, although this is a prerequisite for the majority of general corporate projects. According to the study of the Institute of Data Warehouses (TDWI), more than 69% of the surveyed organizations have found that integration is a significant barrier when implementing new applications. On the contrary, the implementation of data integration brings a tangible income and increase in efficiency.

A properly constructed model definitely determines the value of the data that is in this case are structured data (as opposed to unstructured data, such as, for example, an image, a binary file or text, where the value may be ambiguous). The most effective sectoral models offered by professional suppliers (vendors), including ESRI and IBM. High returns from the use of their models is achieved due to the significant level of their detail and accuracy. They usually contain many data attributes. In addition, ESRI and IBM specialists not only have extensive modeling experience, but also work well in building models for a particular industry.




{!LANG-1c93328f6cdbe9ca6feac9f757270545!}