Do you know some of the most common mistakes in data mining? 07/03 Update SLTechnology News&Howtos

Do you know some of the most common mistakes in data mining?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

According to Dr. Elder's summary, the top 10 mistakes include:

0. Lack of data (Lack Data)

1. Pay too much attention to training (Focus on Training)

two。 Rely on only one technology (Rely on One Technique)

3. Asked the wrong question (Ask the Wrong Question)

4. Listen (only) to the Data only by data.

5. Using future information (Accept Leaks from the Future)

6. Abandoned cases that should not be ignored (Discount Pesky Cases)

7. Credulous prediction (Extrapolate)

8. Try to answer all the questions (Answer Every Inquiry)

9. Random sampling (Sample Casually)

10. Put too much faith in the best model (Believe the Best Model)

Ten mistakes easy to make in data Mining

Details are as follows:

0. Lack of data (Lack Data)

For classification problems or estimation problems, there is often a lack of accurately labeled cases.

For example: fraud detection (Fraud Detection)-there may be only a handful of fraudulent transactions in millions of transactions, and many fraudulent transactions are not properly marked, which requires a lot of effort to correct before modeling; credit score (Credit Scoring)-requires long-term tracking of potential high-risk customers (say two years) to accumulate sufficient scoring samples.

1. Pay too much attention to training (Focus on Training)

IDMer: just like physical training pays more and more attention to actual combat training, because simple closed training often leads to a brave state in training and a mess in the game. In fact, only the model scores on the out-of-sample data are really useful! Otherwise, use the reference table directly.

For example, × × testing (Cancer detection)-MD Anderson doctors and researchers (1993) used neural networks to perform xxx testing and were surprised to find that the longer the training time (from days to weeks), the performance improvement on the training set was very slight, but the performance on the test set decreased significantly. Researchers in machine learning or computer science often try to make models perform best on known data, which usually leads to overfit. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

Solution:

The typical way to solve this problem is Re-Sampling. Resampling techniques include: bootstrap, cross-validation, jackknife, leave-one-out... Wait.

two。 Rely on only one technology (Rely on One Technique)

IDMer: this error has something in common with the 10th error, please refer to its solution at the same time. If there is no comparison, there will be no good or bad, and the idea of dialectics is fully reflected here. "when a child holds a hammer in hand, the whole world looks like a nail." To make the work perfect, you need a complete set of toolkits. Don't simply trust the results you analyze with a single method, at least compare it with traditional methods such as linear regression or linear discriminant analysis.

Results: according to the statistics in the journal Neural Networks, only one of the six articles in the past three years has achieved the above two points. In other words, the open set test is carried out on the test set independent of the training samples, and compared with other widely used methods.

Solution:

Use a series of good tools and methods. (each tool or method may bring a maximum of 5% to 10% improvement.)

3. Asked the wrong question (Ask the Wrong Question)

IDMer: generally speaking, the classification accuracy is given as the standard to measure the quality of the model in the classification algorithm, but we hardly look at this index in the actual project. Why? Because that's not what we're focused on.

A) the goal of the project: be sure to target the right target

For example: fraud detection (focusing on positive examples!) (Shannon Lab analysis on international long-distance calls): do not try to classify fraudulent and non-fraudulent behavior in general calls, but focus on how to describe the characteristics of normal calls, and then detect abnormal calls accordingly.

B) the goal of the model: let the computer do what you want it to do

Most researchers will indulge in the convergence of the model to minimize errors, so that they can gain a sense of mathematical beauty. But what should be done by the computer should be how to improve the business, rather than just focusing on the accuracy of the model calculation. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

4. Listen (only) to the Data only by data.

IDMer: there is nothing wrong with "Let the data speak". The key is to remember another sentence: if you listen at the same time, you will be clear, and if you listen, you will be dark. If data + tools can solve the problem, what else do you need people to do?

4a. Opportunistic data: the data itself can only help analysts find out what is a significant result, but it doesn't tell you whether the result is right or wrong.

4b. After the design of the experiment: some experimental designs mixed with artificial components, such experimental results are often not credible.

5. Using future information (Accept Leaks from the Future)

IDMer: it seems impossible, but it's an easy mistake to make in practice, especially when you're faced with thousands of variables. Being serious, careful and organized is the basic requirement of data mining personnel. Forecasting (Forecast) example: forecasting the interest rate of the Bank of Chicago on a certain day, using neural network modeling, the accuracy of the model is 95%. However, the interest rate of that day is used as the input variable in the model. Example of forecasting in the financial industry: use a 3-day moving average to forecast, but set the midpoint of the moving average at today.

Solution:

Take a closer look at the variables that make the results perform extraordinarily well, which may not be used or should not be used directly. Time-stamp the data to avoid misuse.

6. Abandoned cases that should not be ignored (Discount Pesky Cases)

IDMer: is it "better to be a chicken head than a chicken tail", or "the big is vaguely in the city, the small is faint in the wild"? Different attitudes towards life can have the same wonderful life, and different data may contain the same important value. Outliers can lead to wrong results (for example, the decimal point in the price is mismarked), but it may also be the answer to the question (such as ozone hole). So you need to check these anomalies carefully. The most exciting words in the study are not "AHA!", but "that's a little strange." Inconsistencies in the data may be clues to solving the problem, and digging deeper may solve a big business problem.

For example, in direct mail marketing, the data found in the consolidation and cleaning of home addresses are inconsistent, which may be a new marketing opportunity.

Solution:

Visualization can help you analyze whether a large number of assumptions are valid.

7. Credulous prediction (Extrapolate)

IDMer: it is still the point of view in dialectics that things are constantly developing and changing. People often come to some conclusions easily when they are inexperienced. Even if some counterexamples are found, people are reluctant to give up their original ideas. Dimensional mantra: intuition in low dimensions is often meaningless in higher dimensions.

Solution: the theory of evolution. There is no correct conclusion, only more and more accurate conclusions.

8. Try to answer all the questions (Answer Every Inquiry)

IDMer: it's kind of like when I encourage myself when I climb a mountain, "I don't know when I can climb the mountain, but I know that one step is closer to the finish line."I don't know" is a meaningful model result. The model may not answer questions 100% accurately, but at least it can help us estimate the likelihood of some kind of result.

9. Random sampling (Sample Casually)

9a reduce the sampling level. For example, MD Direct Mail conducted a response prediction analysis but found that the proportion of unresponsive customers in the data set was too high (a total of 1 million direct mail customers, of which more than 99 per cent did not respond to marketing). So the modeler sampled as follows: put all the responders into the sample set, and then systematically sampled all the non-responders, that is, every 10 people into the sample set until the sample set reached 100000. But the model came up with the following rule: everyone who lives in Ketchikan, Wrangell and Ward Cove Alaska will respond to marketing. This is obviously a questionable conclusion. (the problem lies in this sampling method, because the original data set has been sorted by zip code, and non-responders in the above three areas have not been taken into the sample set, so this conclusion has been drawn.

Solution: "shake it before drinking!" First, disrupt the order of the original data set, so as to ensure the randomness of the sampling. 9b to improve the sampling level. For example, in credit scores, because the proportion of defaulting customers is generally very low, the proportion of defaulting clients is often artificially increased when modeling (for example, increasing the weight of these defaulting clients by five times). In the modeling, it is found that as the model becomes more and more complex, the accuracy of identifying default customers becomes higher and higher, but the misjudgment rate of normal customers also increases. (the problem lies in the partition of the dataset. When dividing the original data set into training set and test set, the weight of default customers in the original data set has been increased)

Solution: first divide the data set, and then improve the weight of default customers in the training set.

10. Put too much faith in the best model (Believe the Best Model)

IDMer: it's the same old saying-"there's no best, only better!" But explanation is not always necessary. Models that don't seem entirely correct or explainable can sometimes be useful. some of the variables used in the "best" model distract people too much attention. (unexplainability is sometimes an advantage.) in general, many variables look very similar to each other, and the structure of the best model looks so different that there is no trace. However, it should be noted that structural similarity does not mean that it is also functionally similar.

Solution: assembling multiple models may lead to better and more stable results.

Conclusion

Thank you for watching. If there are any deficiencies, you are welcome to criticize and correct them.

If you have a partner who is interested in big data or a veteran driver who works in big data, you can join the group:

658558542 (click on ☛ to join the group chat)

It collates a large volume of learning materials, all of which are practical information, including the introduction to big data's technology, high-level analysis language for massive data, distributed storage for massive data storage, and distributed computing for massive data analysis. for every big data partner, this is not only a gathering place for Xiaobai, but also Daniel online solutions! Welcome beginners and advanced partners to join the group to learn and communicate and make progress together!

Finally, I wish all the big data programmers who encounter bottlenecks to break through themselves and wish you all the best in the future work and interview.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.