Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Ten-step principle to solve the problem of data quality

2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

I. related concepts 1.1 data quality

The extent to which a set of inherent attributes of data meets the requirements of data consumers.

1) inherent attributes of the data

Authenticity: that is, the data is the real reflection and timeliness of the objective world, that is, the data is the relevance that is updated with the change, that is, the data is concerned and needed by data consumers.

2) High quality data meets the requirements (from the consumer's point of view)

Available, available when the data consumer needs; timely, when needed, the data is obtained and updated in a timely manner; complete, the data is complete without omission; secure, the data is secure, avoid unauthorized access and manipulation; understandable, the data is understandable and interpretable; correct, the data is a true reflection of the real world. 1.2 data quality management

Data quality management refers to all kinds of data quality problems that may arise from each stage of data planning, acquisition, storage, sharing, maintenance, application and extinction life cycle. carry out a series of management activities such as identification, measurement, monitoring, early warning and so on, and further improve the data quality by improving the management level of the organization.

II. Evaluation dimension

Any improvement is based on assessment, knowing where the problem can be implemented. Usually, data quality assessment and management evaluation need to be measured by the following dimensions. Common dimensions are as follows:

1) Integrity

Integrity refers to whether the data information is complete and whether it is missing. The situation of missing data may be that the whole data record is missing, or it may be the missing record of some field information in the data. The integrity of records, the general use of statistical records and the number of unique values. On the other hand, the data of a field in the record is missing and can be audited using the number of NULL in the statistics. In general, the proportion of null values is basically constant, and the statistical number of null values can also be used to calculate the proportion of null values. if the proportion of null values increases obviously, it is likely that there is something wrong with the record of this field and the information is missing. In a word, the integrity can be measured by the number of records, the average value, the unique value, the proportion of null value and so on.

2) Standardization

Standardization refers to whether the record conforms to the specification and whether it is stored in a prescribed format (for example, standard coding rules). Data normative audit is an important and complex part of data quality audit. Normative testing is mainly to test whether the data and the definition of the data are consistent, so it can be measured by the ratio of compliance records. For example, the value range is the data of the enumerated collection, and the proportion of the data whose actual value is outside the range, such as the proportion of records whose attribute values of specific coding rules do not conform to their coding rules.

3) consistency

Consistency refers to whether the data is logical, and there is a logical relationship between one or more items of data within the data. Consistency check, the check between attributes that have a logical relationship, such as when attribute A takes a certain value, the value of attribute B should be within a specific data range, which can be measured by the compliance rate.

4) accuracy

Accuracy, used to measure which data and information is incorrect, or which data is overdue. Accuracy may exist in individual records or on the entire dataset. The difference between accuracy and standardization lies in the normative focus on compliance, indicating uniformity, while accuracy is concerned with data errors. Therefore, the same data performance, such as the actual value of the data is not within the defined range, if the defined range is accurate, the value is completely meaningless, then this is a data error.

The accuracy of the data may exist in individual records or in the entire dataset. If there is an error in the data of a field in the entire dataset, this error is easy to find, and such problems can also be found using averages and median. When there are individual outliers in the dataset, you can use the statistics of the maximum and minimum values to audit, or use a box chart to make the exception clear at a glance.

There are also several accurate audit problems, the problem of garbled characters or the problem of truncation of characters. Distribution can be used to find such problems. General data records basically conform to normal distribution or quasi-normal distribution. Then those data items that account for an abnormally small proportion are likely to have problems. If the data are not significantly abnormal, but it is still possible that the recorded values are wrong, but these values are close to the normal values, this kind of accuracy test is the most difficult, and problems can only be found by comparing with other sources or statistical results.

5) timeliness

The time interval between the generation and the viewability of the data, also known as the delay of the data. Some real-time analysis and decision-making need to use hour-or minute-level data, these requirements for the timeliness of data is extremely high, so timeliness is also one of the elements of data quality. For example, define the date on which a table arrives at the latest each month.

6) uniqueness

Uniqueness, used to measure which data is duplicated or which attributes of the data are duplicated. That is, measurements that accidentally repeat specific fields, records, or data sets that exist within or between systems.

7) rationality

Rationality is to judge whether the data is correct from the point of view of business logic. Normative and consistent practices can be used for evaluation.

8) redundancy

Redundancy refers to whether there is unnecessary data redundancy in multi-level data.

9) accessibility

Accessibility refers to whether the data is easy to obtain, easy to understand and easy to use.

III. Influencing factors

The factors that affect data quality mainly come from four aspects: information factors, technical factors, process factors and management factors.

1) Information factors

The main causes of data quality problems are: incorrect description and understanding of metadata, non-guarantee of various properties of data metrics (such as inconsistent data source specifications) and inappropriate frequency of changes, and so on.

2) Technical factors

It mainly refers to the data quality problems caused by the abnormalities in various technical links of specific data processing. The generation of data quality problems mainly includes data creation, data acquisition, data transmission, data loading, data use, data maintenance and so on.

3) process factors

It refers to the data quality problems caused by the improper setting of system operation flow and manual operation flow, which mainly comes from the system data creation process, transmission process, loading process, use process, maintenance process and audit process and so on.

4) Management factors

It refers to the data quality problems caused by the quality of personnel and management mechanism. Such as personnel training, personnel management, training or improper measures of reward and punishment caused by the lack of management or management defects.

IV. methods to solve quality problems

You can follow the following ten-step principles (this part is excerpted from the public materials of Yushifang).

Figure 1

4.1 define business requirements and methods

Find out which businesses are affected by data quality problems, or the demand that the improvement of data quality will bring better business benefits to enterprises, evaluate these business requirements and rank them according to the level of importance, as the goal and scope of this data quality improvement. Only by defining the business requirements and methods, can we ensure that the data quality problems to be solved are related to the business requirements, so as to really solve the business problems.

4.2 analyze the information environment

Refine the defined business requirements, identify the associated information between business requirements and data, data specifications, processes, organizations and technologies (such as systems, software, etc.), define the information life cycle, and determine the data source and scope. Through the analysis of the information environment, we can not only provide help for the follow-up cause analysis, but also enable us to have a more comprehensive and intuitive understanding of the data problems and the current situation.

4.3 assess data quality

Extract data from relevant data sources, design data evaluation dimensions around defined business requirements, and use relevant tools to complete the evaluation, and accurately express the data quality assessment results in the form of charts or reports, so that relevant leaders or business personnel can clearly and intuitively understand the actual data quality, and ensure that data problems are related to business needs. And can get the attention and support of relevant leaders or business personnel.

4.4 assess business impact

Understand how low-quality data affects your business, why it is important, and what business value it will bring if you improve these issues. The higher the complexity of the evaluation method, the longer the time it takes, but it is not necessarily proportional to the evaluation effect, so we should also pay attention to the choice of methods when evaluating the business impact. In addition, the results of the business impact assessment should be archived in a timely manner so that even if the problem is watered down over time, it can be traced.

4.5 determine the root cause

The root cause of the data problem should be determined before correcting it. There are many root causes of the problem. However, the occurrence of some problems is only the appearance, not necessarily the root cause of the wrong data, so in the process of analysis, we should constantly track the data to locate the problem and determine the root cause of the problem; or ask yourself "WHY" several times to find out the root cause of the problem, so that the problem can be effectively solved and achieve the effect of temporary and permanent cure.

4.6 develop an improvement plan

Through the detailed problem analysis and cause determination in the previous steps, a reasonable data quality improvement scheme can be developed in this step, including suggestions for the improvement of known data problems and how to prevent the occurrence of similar incorrect data in the future.

4.7 prevent future data errors

Prevent the occurrence of wrong data in the future according to the design of the solution.

4.8 correct current data errors

Solve existing data problems according to the design of the solution. This step is more "dirty work", but it is very important to achieve the ultimate quality goal.

4.9 implement control and monitoring

Carry out continuous monitoring to determine whether the desired results have been achieved.

4.10 communicate actions and results

Communicate the results and the progress of the project to ensure the continuous progress of the overall project.

Fifth, data quality product design 5.1 data product value complete check standard carding method and index rule template. Automatic checking and processing and problem notification mechanism, so that no one is on duty. Provide a comprehensive data analysis mechanism to accelerate problem resolution. Standardize the problem management process and system, accurately manage each stage of the problem. Perfect quality problem solving sharing mechanism to realize the closed-loop management of data governance. 5.2 processing problem flow determination rules: data quality indicator Discovery problem: data quality check question: quality problem alarm solution problem: quality problem Analysis and Induction problem: problem Management process 5.3 main function Module

1) quality assessment

Provide a full range of data quality assessment capabilities, such as data repeatability, relevance, correctness, completeness, consistency, compliance, etc., physical examination of data to identify and understand data quality issues. With the evaluation system as a reference, it is necessary to collect, analyze and monitor the data so as to provide comprehensive and reliable information for the data quality. The collection points are set on the key points of the data transfer link, and the corresponding collection rules are configured according to the requirements of the system for data quality. Through the quality data collection and statistical analysis at the collection points, the data analysis report at the collection points can be obtained.

2) check and implement

It provides the ability to generate configured measurement rules and check methods, and provides the scheduled scheduling and execution of check scripts and the scheduling and execution of third-party scheduling tools.

3) quality control

The system provides an alarm mechanism, sets the threshold of the checking rules or methods, and gives different levels of alarm and notification to the rules that exceed the threshold.

4) problem management

Carry on the process processing support to the data problem, standardize the problem handling mechanism and steps, strengthen the problem authentication, improve the data quality. Through the quality evaluation system and quality data acquisition system, problems can be found, and then we need to respond in time to the problems found, trace back the causes and formation mechanism of the problems, take corresponding improvement measures according to the types of problems, and continuously track and verify the improvement effect of data quality after improvement, and form positive feedback to achieve the effect of continuous improvement of data quality.

Establish data standards or access standards at the source, standardize data definition, establish a process and system to monitor the quality of data conversion in the process of data transfer, try to solve problems wherever problems are found, and do not bring problem data to the back end.

5) quality report

The system provides a wealth of API to customize data quality including development, in addition, the system has built-in common quality reports.

6) quality analysis

Provide a variety of problem analysis capabilities, including pedigree analysis, impact analysis, full-chain analysis, and locate the root causes of problems.

Author: Han Feng

The first release is on the author's personal official name "Han Feng Channel".

Source: Yixin Institute of Technology

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report