In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Explanation 2: the uniqueness of data is the basis for the realization of "data GSM".
In a small environment such as class and group, each person can be distinguished by his or her name. However, in the whole country, because there are too many people and there are many duplicate names, it is impossible to identify everyone accurately by name alone. The data in the relational database before the big data era is only applied within an organization, so the data can be easily identified. However, if the data in the relational database is put into the big data environment, then these data become unidentifiable data. In big data's environment, all data about people must contain a "× ×" number, which is to show the uniqueness of the data.
Relational databases use "ID" to indicate the uniqueness of the data in each table. The relational database only considers the uniqueness of the data in a table, but does not consider the data uniqueness in the environment of big data. For example, in many medical information systems, only "outpatient number" and "hospital number" are used to identify the patient's information, but does not contain the patient's × × number. If you want to query the medical history data of a patient in the national medical care big data environment, it will be very difficult to query because the patient's data does not contain a × × number. because the patient's medical history data may be included in more than millions of tables produced by 978000 medical institutions across the country.
In the environment of big data, the "uniqueness of data" of everything is a very important problem. "uniqueness of data" is a key to ensure that the data has "data recognition" in big data's environment. For example, in the information systems of manufacturers and distributors, the code name of the same item must be unique, unified and standard in the world, so as to ensure that the data is identifiable in big data's environment. However, at present, this has not been achieved internationally, and the information systems of each enterprise have their own coding methods, which are different, and the codes of different enterprises are different for the same commodity. This has caused great difficulties for the analysis of data in GSM and big data.
A qualified big data should be: if you buy a box of medicine in a drugstore, you can inquire about the whole production and circulation of this box of medicine according to the only code on it, which manufacturer produced it, when it was produced, when it left the factory, and which middlemen passed through it.
What the world economy needs most is "data GSM", that is, all kinds of data in the information systems of all enterprises in the world can be "interconnected". In other words, "the information systems of any two enterprises in the world can send and receive data of any commodity in a timely manner." The current actual situation is that each enterprise has its own product coding rules, and when an enterprise receives an order, it also needs to manually convert the order data into data that can be identified by its own information system, and then its own system can process customer order data, and only a very small number of enterprise information systems can directly deal with the data sent by upstream enterprises. The fundamental reason for this phenomenon of "global data is impassable" is that the current data lacks "data uniqueness", and there is no international unified and standard commodity coding standard to support "data uniqueness".
To track the circulation of a commodity around the world, "uniqueness of data" is the basis. The data of a commodity will appear in the information systems of millions of enterprises around the world. Only the big data identification code, which reflects the "uniqueness of data", can accurately identify the data of this commodity from millions of information systems. Global big data unified coding and decoding (can be called big data identification code) is a very important and complex work in big data. In international trade, the global unified coding and decoding of orders and commodities is very important, which is the basis of commodity "data GSM".
As far as enterprises are concerned, in the era of big data, the international standards, national standards and industry standards of order and commodity data are the basis for global enterprises to realize "data GSM". Without the standards of orders and commodities, enterprises cannot enter the era of big data.
Explanation 3: the attribution of data is a key to distinguish big data from small data.
From the perspective of relational database theory, adding "data sources" will lead to a large amount of redundant data in the system. However, in the era of big data, the data to be processed come from more than millions of information systems, so it is very necessary to explain clearly where each data comes from, otherwise, it is impossible to distinguish a large number of data. In big data's environment, "data source" is very critical data, but also essential. In big data, the purpose of adding a "data source" data item for each data is to make the data express its complete meaning independently and completely wherever it is. Data is like things, all kinds of things in human society have their owners, and data should also have their owners.
One of the key indicators to distinguish big data from small data is whether the data contains "data sources". All the data that does not contain "data sources" are small data and unqualified structured big data, which is difficult for relational database experts to understand. however, this is also a sign of whether the ideas of database technicians have changed to the era of big data. Big data is faced with more than hundreds of thousands of units, millions of information systems, tens of millions of meters and trillions of data. In big data's environment, there is no "data source" will cause great chaos. In the era of big data, with "data source", the number of lines of program code can be greatly reduced, "data source" is needed for data exchange, and "data source" is needed for data sharing.
Explanation 4: the standardization and standardization of data is the key to realize universal query.
The structured big data communication protocol was founded on the basis of imitating the memory, association and thinking of the brain. It began in 1982, when it was hoped that the computer could imitate the associative function of the human brain (that is, query). The technology used by the human brain in processing data is "super-high-fidelity data processing technology". "the standardization and standardization of data is the key to the realization of universal query", which needs to be understood from the perspective of human brain's super-high-fidelity data processing technology. At present, people interpret what is "data" from the perspective of computer technology. in fact, it is most appropriate to interpret what is "data" from the perspective of human brain memory, association and thinking.
The human brain is the best "computer" in nature. What is stored in the human brain is the truly qualified "data". The "data" in the human brain is "super high-fidelity data". The data in human brain are all simulated data, almost undistorted, super high-fidelity data, real data, which can truly reflect all kinds of things in nature, and is a microcosm of all kinds of things in nature. The relationship between data and data in human brain is a natural relationship established naturally based on the natural attributes of things, which can truly reflect the subtle relationship between various things in nature, which is the foundation of the super function of the brain.
The data in the computer is dead, and the information in the human brain is alive. The brain can break through time and space, activate "all kinds of things" in the brain at any time, and play back various scenes of the past. Computers can also play movies, but computers can't make associations for everything in movies. The human brain can associate one scene with another, but the computer cannot. When the brain recalls the Imperial Palace and the Great Wall in Beijing, it can recall Huangpu in Shanghai and Huangguoshu in Guizhou in the blink of an eye. The brain can achieve "instant thousands of years, blink nine × ×". There is no relationship between the data in the computer and the data. however, any information input into the human brain will automatically form an associative relationship with the related information in the brain. this association is based on the natural properties of things.
There are four kinds of super high fidelity data processing technology of human brain: 1, super high fidelity data acquisition technology; 2, super high fidelity data storage and reproduction technology; 3, super high fidelity data and data relationship technology (forming association relationship); 4. Super high fidelity uses the relationship technology between data (that is, using association to process data).
The current technology can better imitate the brain's "super high-fidelity data acquisition technology" and "super high-fidelity data storage and reproduction technology". However, existing technologies cannot fully realize (or even imitate) the brain's "super-high-fidelity data-to-data relationship technology" and the brain's "super-high-fidelity data processing technology". These two technologies are the foundation of the super function of the brain.
Super high-fidelity data acquisition technology: the brain collects data through vision, hearing, touch, smell, taste, pain and other sensory organs.
Super high-fidelity storage and true reproduction data technology: the brain can not only store data in the form of super high fidelity, as if "moving" natural things into the brain, but also break through time and space to reproduce past things at will (association). The data in the brain is a microcosm of the real and concrete things in nature.
Super hi-fi technology for building relationships between data: the brain can not only collect and store data, but more importantly, the brain can automatically allow data to form similar associations, close associations, and association relationships in the brain at the same time. The data association in the brain is naturally established according to the natural properties of things. The brain not only stores data with super high fidelity, but also stores the natural relationship between data and data. This is difficult to imitate in the existing technology.
Super high fidelity uses the relationship technology between data (data processing technology): the computer processes only digital signals, while the human brain processes all analog signals. The brain processes super high-fidelity simulated data (that is, brain thinking) by means of similar association, simultaneous association, close association and so on. The existing technology cannot fully imitate this technology at all, but can only be partially imitated.
The following examples are used to illustrate "the super high-fidelity data processing technology of the brain" in more detail. The main explanation: natural things, the attributes of things, the brain associates and deduces according to the attributes of things, and the associative relationship between data and data is established according to the natural attributes of things.
1. "people can tell whether you are knocking on iron or wood by listening to the sound." This is because, in the memory of the human brain, the sound of knocking on iron has been naturally associated with the iron, and the sound of knocking on wood has been naturally associated with wood. These messages are received by people in daily life. Therefore, people can associate with the corresponding things through sound. The computer can also store audio and video files, but the computer can not achieve the natural relationship between sound and image, nor can it flexibly recognize sound and image.
2. "I can tell whether the egg is good or not by gently throwing it up a few times in my hand." This is because when a good pine flower egg is thrown in the hand, the palm will feel a slight tremor, while raw and cooked eggs will not tremble, and the bad pine flower egg will not tremble. In my brain's memory, the tremor has naturally established a connection with the pine flower egg.
3. "when you buy an egg, you can judge whether the egg is good or bad by gently shaking it in your hand." A bad egg, or an egg that has been kept for a long time, shakes gently with your hand, and the yolk and white in the egg will move, while the yolk and white in the good egg will not. In my brain's memory, this information about eggs has naturally established a connection with the quality of eggs.
4. "when you see the trees moving outside the window, you know it's windy." The information of the wind blowing trees has been stored in the human brain.
5. "when you see the tree moving outside the window, you know that someone is shaking the tree." Because people shake trees and the wind blows trees is different. When the wind blows the trees, many trees move. When a man shakes a tree, only one tree is moving, and the other trees are not moving. Moreover, the tree movement caused by people shaking the tree is different from the tree movement caused by the wind.
Compared with the human brain, the data in the relational database is almost 100% distorted. Relational database is to establish relationships for data artificially. According to relational database theory, this is the most prominent advantage of relational database, but this is the most fatal defect of relational database! Because the artificial establishment of relationships for data destroys the natural relationship between things in nature. Relational databases cannot make connections based on the natural properties of things as the human brain does. One of the advantages of relational databases is that data redundancy is very small. However, this is also a fatal flaw in relational databases! Because relational database not only reduces data redundancy, but also leads to serious data distortion. Severely distorted data cannot naturally establish relationships according to the natural properties of things.
Relational databases store data in different tables, which separates the relationship between things and the natural attributes of things. Relational database stores the data of the same kind of things in the same table, and the data of different kinds of things in different tables. The brain classifies things according to their natural attributes. Whether things are the same or not is determined by their natural attributes. Things with the same attributes are the same kind of things. Plastic pots, plastic cups, plastic bags and plastic buckets are of different shapes, and the brain classifies them according to the natural properties of plastics. For plastic cups, glass cups and steel cups, the brain classifies them according to the natural properties of "cups". The data in the brain are all in the same table, and the brain can classify all kinds of data very flexibly according to the natural properties of things.
"data" is not just a code name or symbol, the real "data" should be the epitome of specific things in nature. The human brain can naturally associate the sound of knocking iron with iron, and relational databases cannot allow "data" to achieve such a natural connection.
The structured big data communication protocol mimics the super high-fidelity data processing technology of the brain. The structured big data communication protocol is to resolutely eradicate the "human relationship" in the relational database and let the data establish a "natural relationship" independently and naturally according to the natural attributes of things. The relationship in the relational database is artificially established, which destroys the natural relationship between things. If we want to make the computer close to the super thinking function of the human brain, we must, like the brain, minimize the distortion of the data, so that the data can establish a natural relationship according to the natural properties of things. Artificial relationships for data must also be resolutely eradicated, because human relationships are bound to destroy the natural relationship between data and data.
The concept of "data" in computers is very narrow. "data" should not only be "numbers" and "codes", but also a true reflection of things in nature, and more importantly, it should also reflect the natural relationship between "data" and "data". The "mobile phone" in the computer is only a number, while the "mobile phone" in the human brain is the true reflection of the real "mobile phone". The brain receives a great deal of various signals about the "mobile phone" through vision, hearing and touch. Qualified "data" should have the least degree of distortion, can fully reflect specific things, but also truly reflect the natural relationship between things. The data in relational database can not truly reflect the natural relationship between data and data. The relationship between data and data must not be established artificially, but should be established naturally by the natural attributes of things themselves. Structured big data communication protocol is to reduce the distortion of data as much as possible through a certain amount of "data redundancy", and to establish a "natural relationship" between "data" and "data" according to the natural attributes of things.
"Information system name, database name, table name and field name" should use standardized, unified and standardized natural language and try not to use code in order to realize "association". The name of the information system, the name of the database, the name of the table and the name of the field are all very important things and have important meanings. Designers of relational database systems are used to using codes, English abbreviations and Hanyu pinyin abbreviations as database names, table names and field names. As a result, ordinary users do not understand the data in the relational database. A relational database ignores this information because it deals with small data. In big data's environment, this information is very important and cannot be defaulted.
In the structured big data communication protocol, in order to make the data independent, complete and identifiable, "the name of the information system, the name of the database, the table name" are added to each data. "the name of the information system, the name of the database, the table name" is actually the "classification" of things, or the attributes of things. This approach is incomprehensible and inconceivable to relational data experts because it adds a lot of data redundancy. The structured big data communication protocol chooses the latter between "data redundancy" and "data independence, data integrity, data identification, data and system coupling". The aim is to enable ordinary people who do not understand technology to understand the true meaning of the data.
The data redundancy of relational database is very little, but the cost is that ordinary people who do not understand technology can not understand the data in relational database, and the data in relational database can only be stored in the corresponding database. Once separated from the corresponding database, it becomes meaningless data. The data in the relational database needs to be translated by a large number of applications in order to be understood by ordinary users.
If the data in the database is standardized and standardized, these data can automatically establish a natural "association" relationship (established by index) according to the "thing attribute" and "thing attribute value" in the "universal data structure table". Because the data generated by various information systems established by the structured big data communication protocol are all stored in one or several "universal data structure tables" with exactly the same structure, so it is easy to write a general "universal query" tool. For example, if various medical information systems across the country are established using the structured big data communication protocol, then you can easily "associate" (query) the patient's medical history data from the National Medical big data Center through the patient's × × number. Because every piece of data in the patient's medical history contains a × × number (big data identification number), all the data related to the patient can be "associated" through the patient's × × number. However, the current medical data do not necessarily contain the patient's number, so it is very difficult to query the patient's medical history data from the information systems of hospitals across the country.
The reason why the structured big data communication protocol uses a large number of "data redundancy" to make the data meet 12 technical characteristics, its fundamental purpose is to make the data become "high-fidelity data", and "data redundancy" makes up for the distortion of the data. Only "high-fidelity data" can enable the information system to achieve "super high-fidelity data processing" like the human brain.
Note 5: efficient mining and universal query can be realized without ETL transformation
It will be very difficult to mine the current national medical data because the data in the current information systems are not standard and non-standard. For example, the medical industry has millions of tables and hundreds of billions of records, each with a different structure. It is necessary to write a lot of programs to mine and query the data in so many tables with different structures. If all kinds of information systems of medical institutions across the country are designed according to the structured big data communication protocol, then it will be easy to mine and query the data generated by such information systems. Because these information systems all use the "universal data structure table", in which the data are all standard, standardized and unified.
Table 5: comparison of data mining and query results of the two methods
Serial number
Contrast content
Using relational databases to build the current
Various medical information systems throughout the country
Various information systems of national medical care built with structured big data communication protocol
one
Quantity and structure of tables
There are more than millions of tables, each with different structures.
Millions of tables, the structure of each table is exactly the same, all using the "universal data structure table".
two
Amount of data
Hundreds of billions of articles
Hundreds of billions of articles
three
ETL, data Mining
Because the data of various medical institutions are all non-standard, non-standard and non-uniform, the difficulty of ETL is very high, and the cost of data mining is very high. Due to the differences in gender, symptom name, disease name and drug name, data mining, statistics and analysis are very difficult.
In the information system design stage, in the data collection stage, in the data generation stage, all use standard, standardized, national unified data, do not need ETL is already standard, standardized, national unified data, data mining, statistics, analysis is very easy.
four
Take the inquiry of patients' medical history as an example
Querying millions of tables with different structures across the country requires a large number of programs, and the cost is very high. Various medical institutions record all kinds of data of patients with hospitalization number and outpatient number as identification, but the codes of hospitalization number and outpatient number of each hospital are different, and there are no rules between them, so it is difficult to check patient history data in the whole country. It is necessary to first inquire whether the patient has medical records from the information system of 978000 medical institutions across the country according to the patient's name and number, and if so, check the corresponding hospitalization number and outpatient number, and then query the patient's medical history data from various tables according to the hospitalization number and outpatient number. (note: because there was no concept of "uniqueness of data" and big data identification code, the medical data of the same patient have different forms of expression in different medical institutions, and the identification methods are also different, which can not be kept "unique". )
Millions of tables of data, with exactly the same structure. Therefore, through technical processing, we can write a general query tool to make users query data as if they were querying the contents of a table. Because all the data related to the patient contains the patient × × number, all the data can be queried through the × × × number. With the general query tool, the difficulty and workload of query are greatly reduced (that is, only querying the data in a table). (note: this also reflects the "uniqueness of data" and the powerful role of big data identification code in big data. )
five
Universal query
To query data from millions of tables with different structures, you cannot achieve universal query.
After technical processing, it is as if there is only one table, which can achieve universal query, as long as you write a general software tool.
"big data's most critical technology is query technology": big data is characterized by its large size, so it is particularly difficult to obtain the required data. Therefore, querying the required data from big data is the most critical. Then there is the analysis and statistics of the queried data. Therefore, it can be said that "big data is the query". Big data's preliminary work is to prepare for the query, big data's later work is to statistics and analyze the data obtained from the query, and big data's various work is carried out with the query as the center.
Explanation 6: using the 12 technical characteristics of structured big data to provide technical guarantee for big data's authenticity.
Big data is a resource as important as oil. The authenticity of big data is the foundation of big data, and big data, who has lost his authenticity, is data garbage. Therefore, in the era of big data, how to ensure the authenticity of big data is a very important task.
In the era of small data, the data processed by various information systems are mainly the internal data of each unit, and the authenticity of the data is mainly controlled by each unit. In the era of big data, data circulated not only within various units, but also among various units at home and abroad.Therefore, big data's authenticity, notarization, and authority need to be guaranteed. Big data must be made as legally effective as official documents. The structured big data communication protocol ensures big data's authenticity from a technical point of view. " The uniqueness of data is the key to control big data's "authenticity of data". " The uniqueness of data can be reflected by big data identification code, and the "authenticity of data" of big data can be controlled by controlling big data's identification code. Big data identification code is the "× ×" of the data of things. No matter what environment the data of a thing is in, its big data identification code is unique. Big data is not only data, codes, symbols, but also a kind of resource, like a commodity, also like goods, but also like property, so big data should be managed like resources, commodities, goods, and property. Logistics and the flow of people need to be controlled by a large number of traffic police, as well as the data flow. The state manages and controls commodities through agencies such as the Bureau of Industry and Commerce and the Customs, and the authenticity of big data also needs to be managed and controlled by methods similar to those managed and controlled by the Bureau of Industry and Commerce and Customs. It is more appropriate for the national big data center of the industrial and commercial bureaus (or courts, Ministry of Public Security, Industry and Information Commission, etc.) to manage and control the authenticity of big data.
The big data identification codes for various commodities and orders are coded and distributed by the national big data centers of various countries, and the big data identification codes are put on record. The national big data Center is responsible for the examination and verification of various qualifications of various units. Only the units that have passed the examination and approval of the National big data Center are eligible to obtain the big data identification codes for commodities, orders, etc. The national big data Center is only responsible for issuing big data identification codes and is not responsible for verifying the authenticity of goods, orders and other data. When there is a problem with the authenticity of the data and a dispute occurs, the "data police" of the national big data Center examines the authenticity of the data, carries out corresponding penalties according to the audit results, and records the results. Like traffic, drivers are responsible for their own actions, and traffic police only appear when there is a traffic accident.
Orders and official documents that have obtained big data's identification number should be filed with the state-level big data Center or a third-party notary organization for the record. orders and official documents filed by a third-party notarization organization have the same legal effect as if they had been stamped with an official seal. This can save a lot of paper documents, but also save the delivery time of orders, documents, and so on.
After obtaining the big data identification code of the commodity, the enterprise needs to upload all kinds of data of the commodity to the national big data center for the record. Customers of the enterprise can obtain all kinds of data of the goods through the national big data center according to the big data code of the goods.
Because of the unified global coding, the enterprise information systems can directly send and receive orders, and interpret the contents of the orders. The data in the order is stored in a "universal data structure table", which makes the data have 12 technical characteristics of structured big data. The "transaction attributes" in the order (like field names) must be globally uniform. The "thing attributes" in the order will be different when expressed in different languages, so it is also necessary to develop global standards so that the "thing attributes" can correspond to international standards in all languages. In this way, a general software tool for data interpretation and translation can be designed, and the software tools can automatically complete the translation of orders in different languages.
Current problem: the information systems of global enterprises cannot be interconnected. The reason is that the data coding adopted by each system is not uniform and standard, and the information systems of enterprises can not directly send and receive order data, so it is necessary to manually input the order data into their own system.
The advantage of big data identification code: the realization of data GSM. With timely, accurate and comprehensive data flow to ensure the smooth flow of commodity flow. With the help of big data identification code, enterprises can use 100, 000 or millions of global information systems to track the sales and inventory of goods around the world. The interconnection of global enterprise information systems is beneficial to the enterprises in the upper and lower reaches of the supply chain, and can provide guarantee for the production and circulation of goods.
The certification of the national big data Center for the qualification of various organizations and individuals to use the big data identification code: all kinds of organizations and individuals can obtain the qualification to use the big data identification code, but they need to pass the examination of the state-level big data Center before using it. After passing the examination, it will be issued with a legally valid "big data electronic seal". After being examined and certified by the National big data Center, you can obtain the qualification to use various related functions of big data identification code, and you can release relevant information. The notarization and authority of the national big data Center ensures big data's "data authenticity". After big data has "data authenticity", he can be widely used in various fields.
Big data identification code has a wide range of applications in product anti-counterfeiting and drug supervision. Enterprises can apply for a big data identification number and a verification code for each commodity. After the user buys the commodity, the user can obtain the verification code through the mobile phone according to the big data identification code of the commodity, which is authentic if it is the same as that on the commodity, otherwise it is fake, or the mobile phone can scan the QR code to know whether it is fake or not.
With big data identification code, you can easily manage all kinds of documents, and the verification of documents is very convenient. As long as you can check the information of documents in the national big data center according to big data identification code. For example, it can be used for the following document management: various qualifications of the enterprise, various certificates of the individual, various certifications of the enterprise, notarization certificate, × × certificate, commodity inspection certificate, marriage certificate, graduation certificate, driver's license (no longer need to show the driver's license, say the number, or show the QR code). You don't even have to issue all kinds of certificates, just issue a big data certificate.
With big data identification code, you can easily manage "contracts, documents, contracts, IOUs, statements, various commitments, bills, orders, bidding documents, bid documents" and so on. Big data Center can also become a huge file management system. The International big data Center is the highest management organization of big data in the world, composed of various countries, is responsible for the formulation of global big data standards and norms, and sets rules for big data around the world.
Note 7: the data generated by various information systems established by the structured big data communication protocol are cumulative.
The initial idea of creating a structured big data communication protocol: big data is a large amount of data, at present, various industries already have a lot of small data, can these small data add up to be called big data? It can be called big data, but it cannot be called a qualified big data. Because it is very difficult to mine these data! So, how to make these small data become qualified big data in a cumulative way? Why can't the current data add up to a qualified big data? Because the data generated by a relational database is not real data at all, it can only be called code! To really understand what big data is, we need to first figure out what is "data" and what is "code".
The definition of data: "the information that can be understood by the corresponding professionals is called real data." For example, the data about medical treatment should be the data that the corresponding medical professionals can understand directly, and there is no need for other notes and explanations; the data about chemistry should be data that chemistry professionals can understand, and there is no need for other notes or explanations.
The definition of the code: "the information that the corresponding professionals cannot understand is called the code, and the corresponding professionals need to use the corresponding applications and software tools to translate, interpret and comment the code before they can understand the true meaning of the code."
For the relational database, the data seen by ordinary users is the data after interpreting, translating and annotating the data in the relational database through the information system, not the original data in the relational database. The data in the relational database does not have "identification, independence and integrity", that is, when the data in the relational database is presented directly to ordinary users, the user can not "identify" these "data". The reason is that the relational database can not express its meaning "independently" and "completely".
The definition of qualified data: only the data that can express the proper meaning "independently (independence of the data)" (independent of the interpretation of the software) and "completely (the integrity of the data)". And can enable people and other information systems to "identify (data identification)" data is qualified data. However, the data in the relational database does not have such characteristics, because the data in the relational database is a kind of "data with high coupling with the system". The data in relational database is closely related to relational database system and application system. Once the data in the relational database is separated from the relational database system and application system, it becomes unidentifiable and meaningless data.
The data in the relational database can be described in this way from the point of view of the 12 technical characteristics of the structured big data: because the "data" in the relational database is inseparable from the relational database system and the application system (does not have "coupling with the system (coupling is zero)"). Therefore, "data" cannot be identified independently (without "independence") and completely (without "integrity"), nor can it be identified by other information systems.
From the above analysis, we can draw the conclusion that because the data in the relational database is "very coupled with the system", once the data in the relational database is separated from the relational database system and application system, it becomes unidentifiable and meaningless data, so the data in the relational database is not cumulative. As the current information systems are basically developed by using relational databases, the data generated by the current information systems can not become qualified big data through the cumulative method.
The reason why the information system established with relational database is difficult to interconnect is that the data generated by such information system is not "portable", that is, the data can not be directly transplanted from one system to another, which is caused by the problem of "Variety" in big data's 4V characteristic. If all information systems use "universal data structure table" to store data, then the problem of "Variety" will be easily solved. At present, only the "universal data structure table" can make the data have "structural unity" and "portability", and can also make the data out of coupling with the information system.
The structured big data communication protocol is established to solve the problems existing in the relational database, and the purpose is to transform the data in the relational database into qualified big data. The solution is to use the "universal data structure table" to "decouple" the data, make the data have "structural unity", and make the data "identifiable" with "independence, integrity, standardization, uniqueness and attribution".
By using the existing technology, the data can have "recognition, independence, integrity, coupling with the system (the degree of coupling is zero) and structural unity". However, only using the existing technology can not make the data really "cumulative" and "portable". The structured big data communication protocol makes the data really have "accumulation" and "portability" with "uniqueness, attribution and standardization", and effectively solves the problem of "data speed (velocity)" in big data 4V. The method of making data with "uniqueness, attribution and standardization" is the core technology of structured big data communication protocol, which is specially created for the transformation of small data into big data. It seems to be untechnical, but it is very critical.
The importance of the standardization of data to big data: in the era of small data, all information systems are basically used within the unit. In the era of big data, the interconnection between information systems and mining data from different information systems has become a very prominent problem, so it is very necessary to make the data normative. If there is no "international big data standard, national big data standard, big data standard of various industries", then the big data era is impossible to come. The reason why we try our best to emphasize the importance of data standards is that the structured big data communication protocol comes from the association of the brain and the super high fidelity data processing technology of the brain. Only after all the data is standardized, can the associative relationship be established automatically and naturally according to the natural attributes of things. The problem of "velocity" in big data 4V can be easily solved! Countless people in the industry have tried all kinds of ways to fundamentally solve the problem of data mining, one of the fundamental reasons is that the data in the current information systems are all non-standard and non-standard. If the data in each information system is standardized and unified, data mining will be very easy. The standardization of data is a very common concept that everyone knows, but the role behind the surface is enormous. Only by making the data normative can the data mining become easy. Only by exerting the standardization of the data to the extreme, so that all the data are standard, standardized and unified, the normative super power of the data can be displayed. The data standard is easy to say, but very difficult to do, and requires a lot of manpower and material resources, which has become a key factor affecting big data.
On the surface, "uniqueness of data" and "attribution of data" do not have any technical content, but add two data items and two attributes to the data. If this is true from the perspective of small data, because the information system in the small data era is mainly used to deal with data within a unit, "uniqueness of data" is not a technology at all. "attribution of data" will only bring a lot of redundancy to the system. However, in the era of big data, "uniqueness of data" and "attribution of data" are of epoch-making significance and are the key for small data to become big data. Only by adding these two data items can small data become big data. Whatever does not contain these two data items is not a qualified structured big data, and small data can only be qualified to enter the big data era with these two labels.
The importance of the attribution of data to big data: the scope of small data is a certain unit, which only exists in an information system, while big data's scope is global, facing more than millions of information systems in the world. The purpose of adding attributes to the data is to ensure that the data remains unchanged and will not be distorted no matter where it is placed. If the attribution is not included in the data, it will be distorted when the data is transplanted to other information systems, or it is impossible to know where the data was found from big data. The attribution of data is very important to big data, which is the basis of data identification, accumulation and portability.
The importance of the uniqueness of data to big data: the uniqueness of data is not only to easily grasp the data quickly and accurately in big data's environment, but also to enable the computer to imitate the associative function of the brain. Big data's environment is very large, which can be national or global, while uniqueness can ensure that computers can quickly and accurately grasp data from the ends of the earth on a global scale. Without uniqueness, it is very difficult to capture data on a global scale. For example, A goods of enterprises will appear in hundreds of thousands of retail stores around the world. Without big data identification number, it is very difficult for enterprises to capture the inventory and sales data of A goods from the global data 100, 000 information system. Uniqueness makes the data nowhere to hide and nowhere to escape. Without uniqueness, data will become different in different information systems, like Baigujing. Adding "data uniqueness" to the data is tantamount to installing a tracker for the data.
The relationship between the 12 technical characteristics of data: "accumulation and portability" is realized by "1, identifiability; 2, independence; 3, integrity; 4, standardization; 5, coupling with the system (the degree of coupling is zero); 6, structural unity; 7, uniqueness; 8, attribution". The coupling between data and system (the coupling degree is zero) is realized by "1, identifiability; 2, independence; 3, integrity; 4, standardization; 5, structural unity". The identifiability of data is realized by "independence, integrity, standardization, uniqueness and attribution".
Why can the data generated by the system designed by the structured big data communication protocol add up to qualified big data? Because the data structure of all data is the same and the data is standard, it can be mined without ETL. The accumulation is guaranteed by the uniqueness, attribution, identifiability, independence, integrity, standardization, coupling with the system and structural unity of the data. The data is cumulative when it has "uniqueness, attribution, identifiability, independence, integrity, standardization, coupling with the system (the degree of coupling is zero) and structural unity".
Note 8: the portability of data provides convenience for the interconnection of information systems.
The reason why the current information system is difficult to interconnect is that the data in the current information system has a very high degree of coupling with the system, and when the data is separated from the relational database system and application system, it becomes meaningless data. Through the optimization of the data, the structured big data communication protocol makes the data have "1, identifiability; 2, independence; 3, integrity; 4, standardization; 5, coupling with the system (the degree of coupling is zero); 6, structural unity; 7, uniqueness; 8, attribution; 9, timeliness, 10, authenticity". At the same time, the data with these eight technical attributes will have "portability". The meaning of the data with "portability" in any information system is the same and remains the same, that is, the data can be directly sent to any data system to achieve interconnection.
Note 9: the structured big data communication protocol can provide a communication protocol for the interconnection of data between database systems.
Communication protocols for data interconnection and interworking between database systems:
1. It is necessary to establish a universal data structure table in each database, and the structure of the universal data structure table in each database system must be completely unified.
2. The structured data to be sent must meet 12 technical characteristics: "1, uniqueness; 2, attribution; 3, identifiability; 4, independence; 5, integrity; 6, standardization; 7, coupling with the system (the degree of coupling is zero); 8, structural unity; 9, cumulative; 10, portability; 11, timeliness; 12, authenticity."
As long as the above two conditions are met, any data between any database can be interconnected, because both the sender and the receiver of the data store data in a ten thousand data structure table. so the receiver of the data can directly write the data into the universal data structure table in his own database after receiving the data.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.