Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the blood relationship of big data?

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Today, I was chatting with a tester colleague:

Me: what projects have you been working on these days?

He: testing the consanguinity of big data

Me: what?

He: by blood.

Me: what kinship?

He: big data is related by blood.

Me: what is consanguinity?

He: it's data consanguinity.

Me:.

Look, that's how God is chatted to death. I can't help but feel like OS (no wonder you're bald and don't have a girlfriend).

I hastened to come back and ask google. After analyzing the various answers, I can sum up into two sentences:

Usually we will process the original data in many steps, and finally produce new data. In this process, we will produce many tables. The link relationship between these tables can be called big data consanguinity.

Big data consanguinity test is to test the data quality of each link in the process of data flow.

At the same time, data consanguinity has several synonyms:

Data Lineage data consanguinity (data lineage) = Data Provenance data origin = Data Pedigree data pedigree

In the real world, each of our individuals is bred by our ancestors through the reproductive relationship generation, thus forming various blood relationships of our human beings.

In the data information age, a huge amount of data is produced every moment, that is, big data, which we usually call. Various processing combinations and transformations of these data will produce new data. There is a natural relationship between these data. We call these connections data consanguinity.

To put it bluntly, data consanguinity refers to the link relationship generated by the data, that is, how the data came from and what processes and stages it has gone through.

Here's an example of a bit more popular:

For example, in Taobao, after customers buy items on Taobao web pages, the data is stored in the background database table A. When we want to see which items are the most popular in a certain month, we need to process and summarize the original data in the database to form an intermediate table B to store the data processed in the stage. If the logic is more complex, we have to continue to process and continue to form intermediate tables. Until it is finally processed into the final table used by our foreground presentation, let's say table C.

Then table An is the original source of table C data and the ancestor of table C data. From table A data to table B data and then to table C data, this link is the data consanguinity of table C.

In the process of data processing, from the data source to the final data generation, each link may lead to data quality problems. For example, the data quality of our data source itself is not high, if there is no data quality detection and processing in the subsequent processing, then the data information will eventually be transferred to our target table, and its data quality is not high. It is also possible that in a certain link of data processing, we have carried out some inappropriate processing of the data, resulting in poor data quality in the follow-up link.

Therefore, for the consanguinity of data, we should ensure that every link should pay attention to the detection and processing of data quality, so that our follow-up data will have excellent genes, that is, high data quality.

Common analysis process of data consanguinity:

Now suppose you are a data development engineer and you need to generate the final table X in order to meet a business requirement.

It may be for the sake of clear program logic or performance optimization, in order to generate this table, you use MR, Spark or Hive to generate a lot of intermediate tables.

The following figure shows the entire data flow that you will take the time to implement, where:

Table X is the table that is finally given to the business side

The blue Table Amure is the raw data.

The Table Fmuri of * * is the intermediate table that you have calculated. These are the tables that you write your own program to deal with.

Table J is the result table that has been processed by others, because in line with the principle of not repeating development, you are likely to use the table handled by colleagues and buddies.

After a while, the business side feels that there is always something wrong with one of the fields in the data you provide. In fact, it is suspected that there is something wrong with your data! I need you to trace the source of this field.

First of all, you find the abnormal field in Table X, then locate it from Table I, then locate it from Table I to Table G, then trace it back from Table G to Table D, and finally find that there is an anomaly in the source data of a few days. In other words, you locate the exception field from Table X to the table Table J handled by another partner, and then go back and find that there is something wrong with the table at some point in the process.

The above process is the process of data consanguinity analysis.

At this point, I believe you have probably understood what consanguinity is.

In fact, the consanguinity of the data is not difficult, but the concept is relatively high. In fact, when we test it, it is similar to the ordinary sql operation, except that the syntax used is the corresponding syntax of hive, sqoop, pig and other components, not the common sql syntax.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report