Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Apriori algorithm for deep parsing data mining association rules

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

01. Background and basic concepts of association rule mining

In the dataset shown below, each row in the table represents a purchase list. Note that we only care about whether the record appears or not, not how many times a record has been purchased, such as buying ten cartons of milk.

The collection of all items in the data record is called the total itemset, which is in the table above:

S = {Milk, bread, diapers, beer, eggs, cola}

Association rules

It is an associated rule, and the form is defined like this: two disjoint non-empty sets X, Y, if there is

X- > Y, let's say that Xmuri-> Y is an association rule, for example, {beer}-- > {diaper} is an association rule.

The strength of association rules is described by support (support) and confidence (confidence).

Support degree

Support (XMY-> Y) = the number of simultaneous occurrences of items in set X and set Y in one record / the number of data records. For example: support ({beer}-> {diaper}) = the number of times beer and diapers appear at the same time / the number of data records = 3max 560%

Self-confidence

Confidence (XMY-> Y) = the number of simultaneous occurrences of items in set X and set Y in one record / the number of occurrences of set X. For example: confidence ({diaper}-> {beer}) = the number of times beer and diapers appear at the same time / the number of times diapers appear = 3 times 4 = 75%.

Summary

The higher the degree of support and confidence, the stronger the rules, and association rule mining is to mine rules that meet a certain intensity.

02. An exhaustive algorithm for mining association rules

Association rule mining

Given a transaction data set T, find out all the association rules in which support support > = min_support and confidence confidence > = min_confidence.

Exhaustive algorithm

Find out the required rule is to enumerate all the combinations of itemsets, and test whether each combination satisfies the condition. The time complexity of an itemset with n elements is O (2 ^ N). The total itemset S = {milk, bread, diapers, beer, eggs, cola} in the above table, the number of elements is 6, and the number of all combinations is 63.

For simplicity, it is known that the total itemset of a commodity number is: {1, 2, 3}, then all possible combinations are:

There are 7 items (2 ^ 3-1). Check the above combinations respectively and find out the association rules that meet the requirements of support and confidence in each combination.

For ordinary supermarkets, the number of itemsets of goods is also more than 10,000, so the algorithm of exponential time complexity can not solve the problem in an acceptable time.

How to quickly dig out the association rules that meet the conditions is the main problem to be solved in association mining.

03. Apriori algorithm for optimization algorithm of association rule mining

Mining association rules is carried out in two steps:

1) generate frequent itemsets

In this stage, all the itemsets that meet the minimum support are found, and these itemsets are called frequent itemsets.

2) generate rules

The rules that satisfy the minimum confidence are generated on the basis of the frequent itemsets generated in the previous step, and the resulting rules are called strong rules.

The time spent on association rule mining is mainly in the first step: generating frequent itemsets. Because there are not many frequent itemsets found, 2) is less time-consuming than 1).

In order to reduce the generation time of frequent itemsets, some sets which can not be frequent itemsets should be eliminated as soon as possible. Apriori algorithm mainly reduces frequent itemsets through two laws.

Two laws

From high to low. If a collection is a frequent itemset, then all its subsets are frequent itemsets. Suppose that a set {A _ (A) B} is a frequent itemset, then its subsets {A}, {B} are frequent itemsets.

From low to high. If a collection is not a frequent itemset, none of its supersets are frequent itemsets. Suppose that the set {A} is not a frequent itemset, then any of its supersets, such as {A _ Magi B} and {A _ Magi B ~ C}, must not be a frequent itemset either.

What has practical application value is Article 2, the evolution from low-level frequent itemsets to high-level frequent itemsets. Just imagine, if the support of the second-level itemsets {A _ Magi B} is not greater than the threshold, that is, if it is not a frequent itemsets, how can the third-level {A _ Mague B} or higher be frequent itemsets? If it is, {A _ Magi B} must be a frequent itemset, doesn't it contradict the original condition?

First of all, the first-level candidate itemsets are counted, the candidate sets that do not meet the conditions are cleared, and the first-level itemsets that meet the conditions are obtained. on the basis of generating the first-level itemsets, the secondary itemsets are generated, and the secondary itemsets that meet the conditions are obtained. again, according to the idea of Law 2, for example, {milk, beer} is not a frequent itemset, so {milk, beer, diaper}, {milk, beer. Bread} is not a frequent itemset.

Apriori algorithm

It belongs to the candidate elimination algorithm, which is a process of generating a candidate set according to law 2, eliminating a candidate set that does not meet the conditions according to the preset of support and credibility, and looping until no more candidate sets are generated.

Pseudo code of the algorithm:

This paper summarizes the classical algorithm Apriori algorithm for mining association rules, which uses a law: if a set is not a frequent itemset, then all its supersets are not frequent itemsets. From bottom to top, mining frequent itemsets of all levels that meet the support and credibility thresholds.

Conclusion

Thank you for watching. If there are any deficiencies, you are welcome to criticize and correct them.

If you have a partner who is interested in big data or a veteran driver who works in big data, you can join the group:

658558542

Welcome everyone to exchange and share, study and exchange, and make common progress. There are also a lot of free materials to help you overcome difficulties on your way to becoming big data engineers and even architects! )

Finally, I wish all the big data programmers who encounter bottlenecks to break through themselves and wish you all the best in the future work and interview.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report