Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to deeply understand the Apriori Association Analysis algorithm in Python

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about how to deeply understand the Apriori association analysis algorithm in Python. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

There is a strange supermarket in the United States that sells two strange things such as beer and diapers, and eventually increases sales of beer and diapers, which seem to be unrelated. The name of this supermarket is Walmart.

Don't you think it's a little weird? Although the case turned out to be well-founded, American wives often tell their husbands to buy diapers for their children after work, and husbands bring back their favorite beer after buying diapers. But after all, this is an ex post facto analysis, and we should pay more attention to how to find out the association rules between objects in such a scenario. The next Python tutorial will introduce how to use the Apriori algorithm to find association rules between items.

one。 Overview of correlation Analysis

To choose the association rules between items is to find the potential relationship between items. To find this relationship, there are two steps. Take the supermarket as an example.

Find out the set of items that appear frequently together, which we call frequent itemsets. For example, a supermarket's frequent itemsets may have {{beer, diaper}, {egg, milk}, {banana, apple}}

On the basis of frequent itemsets, the association rules algorithm is used to find out the association results of the items in it.

To put it simply, it is to find frequent itemsets first, and then find related items according to association rules.

Why look for frequent itemsets first? Or take the supermarket as an example, if you think about it, what is the purpose of looking for item association rules? it is to improve the sales of goods. If there are not many people who buy an item itself, then no matter how much you improve, it will not be as high as it is. Therefore, in terms of efficiency and value, priority must be given to finding related items that people buy frequently.

Since there are two steps to find out the association rules of items, let's take it one step at a time. We will first introduce how to use Apriori to find the frequent itemsets of items, and then the next article will analyze the association of items on the basis of the frequent itemsets processed by Apriori.

two。 Basic concept of Apriori algorithm

Before introducing the Apriori algorithm, we need to understand a few concepts, don't worry, we will combine the following examples to illustrate.

These are some of the purchase records in a supermarket:

Buy 0 Milk, Onion, nutmeg, Kidney Bean, Egg, yoghurt 1 dill, Onion, Nutmeg, Kidney Bean, Egg, yoghurt 2 Milk, Apple, Kidney Bean, Egg 3 Milk, Unicorn, Corn, Kidney Bean, yoghurt 4 Corn, Onion, Onion, Kidney Bean, Ice Cream, Egg

1. Several Concepts of Relational Analysis

Support (Support): support can be understood as the current popularity of the item. The method of calculation is:

Support = (number of records including item A) / (total number of records)

Using the supermarket records above, for example, there are five transactions, and milk appears in three transactions, so {milk} has a support rating of 3 to 5. The support rating of {Egg} is 4amp 5. The number of simultaneous occurrences of milk and eggs is 2, so the support of {milk, eggs} is 2. 5.

Confidence (Confidence): confidence means that if you buy item A, you are more likely to buy item B. The calculation is as follows:

Confidence (A-> B) = (number of records containing items An and B) / (number of records including A)

For example, we already know that (milk, eggs) are purchased twice and eggs are purchased four times. Then Confidence (milk-> egg) is calculated as Confidence (milk-> egg) = 2 / 4.

Degree of improvement (Lift): the degree of improvement refers to how much the sales rate of one item increases when another item is sold. The method of calculation is:

Degree of improvement (A-> B) = confidence (A-> B) / (support A)

For example: above we calculated the confidence of milk and eggs Confidence (milk-> eggs) = 2 / 4. The support degree of milk Support (milk) = 3 / 5, then we can calculate the support degree of milk and eggs Lift (milk-> egg) = 0.83,

When the value of promotion (A-> B) is greater than 1, it means that the more item A sells, the more B will sell. If the degree of improvement is equal to 1, it means that there is no correlation between product An and B. Finally, the degree of improvement is less than 1, which means that buying A will reduce the sales of B.

The degree of support is related to Apriori, while the degree of confidence and promotion are used in the next article when looking for item association rules.

Introduction of 2.Apriori algorithm

The function of Apriori is to find out the frequent itemsets in items according to the support between items. As we know from the above, the higher the degree of support, the more popular the item. So how to determine the degree of support? This is our subjective decision, we will give Apriori a minimum support parameter, and then Apriori will return those frequent itemsets that are higher than this minimum support.

At this point, some people may find that now that we all know the formula for calculating the degree of support, isn't it OK to directly traverse all the combinations to calculate their support?

Yes, that's right. It is true that all frequent itemsets can be found by traversing all combinations. But the problem is that it takes too much time to traverse all combinations and is too inefficient. If there are N items, then a total of 2 ^ N-1 calculations are needed. For each additional item, the order of magnitude increases exponentially. Apriori is an efficient algorithm for finding frequent itemsets. At its core is the following sentence:

If an itemset is frequent, so are all its subsets.

This sentence seems to be useless, but the reverse is very useful.

If an itemset is an infrequent itemset, then all its supersets are also infrequent itemsets.

As shown in the figure, we find that the itemset {Arector B} is infrequent, then the superset of the itemset {Arector B}, the superset of {Ameng Brecinct C}, {Amaine Bmor D} and so on are also infrequent, which can be ignored and not calculated.

Using the idea of Apriori algorithm, we can get rid of many infrequent itemsets and greatly simplify the amount of computation.

3. Apriori algorithm flow

To use the Apriori algorithm, we need to provide two parameters, the dataset and minimum support. We already know from the front that Apriori traverses all the combinations of items, how to do it? The answer is recursion. First go through the situation of a combination of items, remove the data items whose support is lower than the minimum support, and then combine them with the remaining items. Go through the situation of 2 combinations of items, and then eliminate the combinations that do not meet the conditions. Recursion continues until there are no more items to combine.

Now let's use the Apriori algorithm to do some actual combat.

three。 Actual combat of Apriori algorithm

Let's use Apriori as a simple example. The library used here is mlxtend.

Before you put the code, let's introduce the parameters of the Apriori algorithm.

Def apriori (df, min_support=0.5

Use_colnames=False

Max_len=None)

The parameters are as follows:

Df: needless to say, this is our dataset.

Min_support: the minimum support given.

Use_colnames: default False, the returned combination of items is displayed by number, and the name of the item is directly displayed if it is True.

Max_len: the maximum number of item combinations. The default is None. There is no limit. If you only need to calculate the combination of two items, set this value to 2.

OK, let's use a simple example to see how to use the Apriori algorithm to find frequent itemsets.

Import pandas as pdfrom mlxtend.preprocessing import TransactionEncoderfrom mlxtend.frequent_patterns import apriori# setting dataset dataset = ['milk', 'onion', 'nutmeg', 'kidney bean', 'egg', 'yogurt'], ['dill', 'onion', 'nutmeg', 'kidney bean', 'egg', 'yogurt'], ['milk', 'apple', 'kidney bean', 'egg'], ['milk', 'unicorn' 'corn', 'kidney bean', 'yogurt'], ['corn', 'onion', 'onion', 'kidney bean', 'ice cream', 'egg'] te = TransactionEncoder () # encode te_ary = te.fit (records) .transform (records) df = pd.DataFrame (te_ary, columns=te.columns_) # find frequent itemsets freq = apriori (df, min_support=0.05, use_colnames=True) using Apriori

First of all, the goods need to be encoded by one-hot, and then represented by Boolean values. The so-called ont-hot coding, intuitively speaking, is a code system in which there are as many bits as there are states, and only one bit is 1 and the others are all 0. For example, ice cream only exists in the final total transaction, but not in other transactions. Then the ice cream can be expressed as [0pc0pl 0pc0pl].

The encoded data here is as follows:

Ice cream onion milk unicorn corn nutmeg kidney bean apple dill yogurt egg 0 False True True False False True True False False True True1 False True False False False True True False True True True2 False False True False False False True True False False True3 False False True True True False True False False True False4 True True False False True False True False False False True

The minimum support we set is 0.6, then only the item sets with support greater than 0.6 are frequent itemsets. The final result is as follows:

Support itemsets0 0.6 (Onion) 1 0.6 (Milk) 2 1.0 (Kidney Bean) 3 0.6 (Yogurt) 4 0.8 (Egg) 5 0.6 (Kidney Bean, Onion) 6 0.6 (Onion, Egg) 7 0.6 (Milk, Kidney Bean) 8 0.6 (Yogurt, Kidney Bean) 9 0.8 (Kidney Bean, egg) 10 0.6 (Kidney Bean, Onion, Egg)

Today, we introduce several concepts that will be used in correlation analysis, such as support, confidence and promotion. Then it describes the function of Apriori algorithm and how to find out the frequent itemsets of items efficiently by Apriori algorithm. Finally, the Apriori algorithm is used to find out the frequent itemsets in the example.

After reading the above, do you have any further understanding of how to deeply understand the Apriori association analysis algorithm in Python? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report