Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Which two kinds of data quality problems can be solved by data analysis?

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article shares with you about which two types of data quality problems to solve by data analysis. Xiaobian thinks it is quite practical, so share it with you. I hope you can gain something after reading this article. Let's not say much. Let's take a look at it together with Xiaobian.

In order to systematically and efficiently solve any problems that arise, we must learn to divide and conquer them. After all, knowing yourself and your enemy is the most important thing in solving problems. That's how we find the solution. The same applies to improving data quality: each approach to solving problems has a different phase and angle.

When a data quality improvement program is launched, it is not enough to know how many miscalculations or duplicate entries are in the database. More than that, we also need to know how the different types of errors are distributed among the collected resources.

According to an interesting blog post by Jim Barker, data quality is broken down into two different types. In this article, I'll take you through how these "types" differ, and how to use them to ensure where our dominant resources are placed in our development budget.

data type

Jim Barker, known as "Dr. Data," borrows a simple medical concept to define data quality problems. His blog describes how to combine these two "types" and has successfully piqued the interest of data analysts who have struggled to find moths that are pulling down data quality in databases.

Type I data quality issues can be detected using automated tools. Type II data quality issues are very secretive. Everyone knows it exists, but it can't be seen or touched, let alone handled, because it needs to be placed in special situations to be detected.

The differences can be summarized as follows:

Type A data quality problem requires "knowing" to test the integrity, consistency, integrity and validity of the data. These attributes are found very well by data quality software and even manually. You don't need to have a lot of background knowledge or data analysis experience. As long as its existence is verified according to four attributes, it can be judged wrong. For example, if we insert a 3 in the gender field, we can determine whether it is a valid value or not.

Type data quality problems require "knowing why" to test timeliness, consistency, and accuracy attributes. It requires research ability, insight and experience, not something that can simply be found. These data sets often look problem-free on the surface. But moths often exist in detail and take time to discover. Jim gave an example of a retiree's employment record. If we did not know that they had retired long ago, we would not be able to see that this figure was wrong.

So the key to solving these data quality problems is to take a complex, strategic approach rather than an isolated, one-sided view of the problem. Once the data quality is not good, we need to find automatic and manual ways to solve this problem. It can be said that "the house leaks and the rain falls overnight."

cost adjustment

So, how do we solve Type I and Type II data quality problems? Are the costs of treating them comparable or completely different?

The important thing to remember is that the validation problem for Type I data can be logically defined, which means we can write software to find and display it. Software auto-repair is fast and inexpensive, and can even be done with manual review. Considering that Type I data quality problems are actually validation as fields within tables, once the table field problem is solved, the Type I data quality problem is actually solved.

Based on our past experience, Type I data basically covers 80% of data quality problems, but consumes 20% of our cost.

The second type of data problem often requires multiple inputs in order to detect, label, and eradicate. Although everyone in our CRM system has a purchase date, the purchase date may be incorrect or inconsistent with the invoice or shipping list. Only experts can solve problems and manually improve CRM systems by scrutinizing their contents.

Under normal circumstances, it is difficult for enterprises to achieve a reasonable allocation of resources for two reasons, especially when the enterprise is in a period of rapid growth or when it is in a period of brain drain. Although these Type II problems are relatively small and may account for only 20% of the remaining data problems, they are likely to consume more than 80% of the cost budget. Therefore, if the enterprise is in a large number of brain drain, but there is nothing to do about it. You'll find the second type of data problem harder to deal with because the manual solution no longer exists.

Increased accuracy

To improve data accuracy, we must study Type I and Type II data problems as separate but simultaneous problems. Type I data quality challenges can be won quickly, but Type II problems present a challenge that must be solved with human expertise.

Over time, the database will expire. This requires continuous effort in order to preserve its timeliness. The data can be cleaned in the database or cleaned up during the usage phase, but Type I errors still need to be noted due to a variety of reasons such as import/export, corruption, manual editing, human error, etc. Type II data problems arise naturally at this stage, because even if the data looks correct after validation and review, it may still be incorrect for the present, because it is not then, and the context in which the data is used has changed.

Ensure data integrity

The completeness of the data helps us see the whole picture and drive decisions about things. As we said earlier, discovering Type I data quality problems is relatively simple, inexpensive, and fast. But if your business hasn't already adopted some kind of data quality software to address Type I data quality issues, it should do so now to avoid future resource waste, brand damage, and misperceptions from the public.

For Type II data problems, the key is to understand why it happens and take steps to prevent it. From day-to-day work, flexibility and employee negligence often lead to poor data quality. Misallocation of resources can also increase the number of Type II data problems over time. And the cost of improving it is multiplied, because you need expert vision to find it in the data.

The above is to do data analysis to solve which two types of data quality problems, Xiaobian believes that some knowledge points may be what we will see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report