How to use Apriori algorithm in R language 07/16 Update SLTechnology News&Howtos

How to use Apriori algorithm in R language

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to use the Apriori algorithm in R language. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

one。 Concept

Association analysis is used to discover meaningful connections hidden in large datasets. The discovered connections can be expressed in the form of association rules (association rule) or frequent itemsets.

Itemsets: in association analysis, a set containing 0 or more items is called an itemset (itemset). If an itemset contains k items, it is called a k-itemset. For example: {beer, diapers, milk, peanuts} is a 4-item set. An empty set is an itemset that does not contain any items.

Association rules (association rule): an implied expression shaped like X → Y, where X and Y are disjoint itemsets, that is, X ∩ Y = ∅. The strength of association rules can be measured by its support (support) and confidence (confidence).

Support: how often an itemset or rule appears in all things, determining how often a rule can be used for a given dataset. σ (X): indicates the support count of itemset X

Support of itemset X: s (X) = σ (X) / N; support of Rule X → Y: s (X → Y) = σ (X ∪ Y) / N

Confidence: determine how frequently Y occurs in transactions that contain X. C (X → Y) = σ (X ∪ Y) / σ (X)

Support is an important measure, because rules with low support may appear only occasionally, and rules with low support are mostly meaningless. Therefore, the degree of support is usually used to delete meaningless rules.

The measure of confidence is that reasoning through rules is reliable. For a given rule X → Y, the higher the confidence, the more likely Y is to appear in things that contain X. That is, the greater the conditional probability P (Y | X) of Y under a given X.

two。 Application of Apriori algorithm in R language

The implementation of the Apriori algorithm in R language is included in the arules package. This paper does not involve the implementation of the algorithm, but only uses the arules package to mine association rules.

1. Data source: take advantage of the Groceries data set included in the arules package, which is one-month shopping data from a real-world supermarket and contains 9835 transactions. According to the 12-hour working hours of the supermarket, we calculate that the number of transactions per hour is 9835pm, which indicates that the size of the supermarket is medium.

> library (arules) # load arules package > data (Groceries) > Groceries transactions in sparse format with 9835 transactions (rows) and 169items (columns)

two。 Explore and prepare data:

(1) each row of transactional data specifies a single instance, and each record includes any number of product lists separated by commas. Through the inspect () function, you can see the transaction records of the supermarket and the commodity name of each transaction; through the summary () function, you can view some basic information of the data set.

> inspect (Groceries [1:5]) # View the first five transactions of the Groceries dataset through the inspect function items 1 {citrus fruit,semi-finished bread,margarine,ready soups} 2 {tropical fruit,yogurt Coffee} 3 {whole milk} 4 {pip fruit,yogurt,cream cheese, meat spreads} 5 {other vegetables,whole milk,condensed milk,long life bakery product}

> summary (Groceries) transactions as itemMatrix in sparse format with 9835 rows (elements/itemsets/transactions) and 169 columns (items) and a density of 0.02609146 most frequent items: whole milk other vegetables rolls/buns soda yogurt (Other) 2513 1903 1809 1715 1372 34055 element (itemset/transaction) Length distribution: sizes 12 3 45 6 78 9 10 11 12 13 14 15 16 17 18 18 20 21 22 23 26 27 28 29 2159 1643 1299 1005 855 645 545 438 350 246 182 78 77 46 14 9 46 11 11 3 32 1 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.000 4.409 6.000 32.000 includes extended item information-examples: labels level2 level1 1 frankfurter sausage meat and sausage 2 sausage sausage meat and sausage 3 liver loaf sausage meat and sausage > itemFrequency (Groceries [, 1:3]) # itemFrequency () function can check the trading proportion of goods

Frankfurter sausage liver loaf

0.058973055 0.093950178 0.005083884

Analysis:

The ① density value of 0.02609146 (2.6%) refers to the proportion of non-zero matrix cells.

The dataset has a total of 9835 rows (transactions) and 169 columns (types of goods traded). Therefore, there are 9835 "169" 1662115 positions in the matrix. We can conclude that a total of 1662115 "0.02609146" 43367 items were purchased in 30 days. Further, it can be concluded that in each transaction, 43367Universe 9835 = 4.409 items are purchased. In the average column, we can see that (Mean=4.409) our calculation is correct.

② most frequent items: lists the most frequently purchased items in transactional data. Whole milk has been purchased 2513 times out of 9835 transactions, so we can conclude that there is a 25.6% probability that whole milk will appear in all transactions.

③ element (itemset/transaction) length distribution: presents a set of statistics on the size of transactions, with a total of 2159 transactions containing one commodity and one transaction containing 32 items. As can be seen from the quantile distribution, 25% of transactions contain two or fewer goods, and about half of the transactions have three items.

(2) the support of visual goods-- the frequency diagram of goods.

To present the statistics visually, you can use the itemFrequenctyPlot () function to generate a bar chart that depicts the trading ratio of the specific goods it contains. Because it contains many kinds of goods, it is impossible to display them at the same time, so you can use support or topN parameters to exclude some items for display.

(3) Visualization of transaction data-- drawing sparse matrix

The entire sparse matrix can be visualized by using the image () function.

Image (Groceries [1:5]) # generates a matrix with five rows and 169columns, and the cells filled with black indicate that the item (column) was purchased in this transaction (row).

As can be seen from the image above, the * * row records (transactions) contain four commodities (black squares). This visual diagram is a useful tool for data exploration. It may help to identify potential data problems, such as: because the column represents the product name, if the column has been filled from top to bottom, the item has been purchased in every transaction On the other hand, the patterns in the figure may help to reveal interesting parts of transactions or commodities, especially when the data is sorted in interesting ways, for example, if transactions are sorted by date, then the black square pattern may reveal that the quantity or type of goods people buy is affected by seasonality. This visualization makes no sense for very large transaction datasets because it is difficult to find interesting patterns when the units are too small.

3. Training model

Grocery_rules apriori (Groceries) set of 0 rules # because support = 0.1 means that the item must appear in at least 0.1 * 9835 = 983.5 transactions. In the previous analysis, we found that there are only 8 items with support > = 0.1, so it is not surprising that using the default setting does not produce any rules.

One way to solve the support setting problem is to think about the minimum number of transactions required before considering an interesting model. For example, we can think that if a commodity is purchased twice a day, that is, 60 transactions a month, this may be of interest to us, based on which we can calculate the required support support=60/9835=0.006.

About confidence: the setting is too low, it may be overwhelmed by a large number of unreliable rules, the setting is too high, and there may be a lot of obvious rules that prevent us from finding interesting patterns. The selection of an appropriate confidence level depends on our analysis goal, we can try to start with a conservative value, if we find that there are no feasible rules, we can reduce the confidence to broaden the search scope of the rules.

In this case, we will start with confidence 0.25, which means that in order to include the rule in the result, the correct rate of the rule is at least 25%, which excludes the most unreliable rules.

Minlen = 2 means that the rule contains at least two items, which can prevent useless rules created simply because certain goods are purchased frequently. For example, in the above analysis, we find that the probability (support) of whole milk is 25.6%. The following rule is likely to appear: {} = > whole milk, this rule is meaningless.

Finally, based on the above analysis, we determine the following parameter settings:

Grocery_rules grocery_rules set of 463 rules

4. Evaluate the performance of the model

> summary (grocery_rules) set of 463 rules rule length distribution (lhs + rhs): rule length distribution of sizes # antecedent + antecedent 2 34 150 297 16 # there are 150 rules containing only 2 commodities, 297 rules containing 3 commodities, 16 rules containing 4 commodities Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 2.000 3.000 2.711 3.000 4.000 summary of quality measures: support confidence lift Min. : 0.006101 Min. : 0.2500 Min. : 0.9932 1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:1.6229 Median: 0.008744 Median: 0.3554 Median: 1.9332 Mean: 0.011539 Mean: 0.3786 Mean: 2.0351 3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:2.3565 Max. : 0.074835 Max. : 0.6600 Max. : 3.9565 mining info: data ntransactions support confidence Groceries 9835 0.006 0.25

> inspect (grocery_rules [1:5]) lhs rhs support confidence lift 1 {pot plants} = > {whole milk} 0.006914082 0.4000000 1.565460 2 {pasta} = > {whole milk} 0.006100661 0.4054054 1.586614 3 {herbs} = > {root vegetables} 0.007015760 0.4312500 3.956477 4 {herbs} = > {other vegetables} 0.007727504 0.4750000 2.454874 5 {herbs} = > {whole milk} 0.007727504 0.4750000 1.858983

Here we need to explain lift, which is used to measure the general purchase rate of a product relative to it, and how likely it is to be purchased at this time. In popular terms, for example, * * Rule {pot plants} = > {whole milk}, lift = 1.565, indicating that (the likelihood of buying whole milk goods after buying pot plants) is 1.565 times higher than that of (not buying pot plants but buying whole milk).

* * interpretation of the rule: if a customer buys pot plants, he will also buy whole milk with a support support of 0.0070 and a confidence confidence of 0.4000. We can be sure that the rule covers about 0.7% of transactions, and after buying pot plants, he has a 40% probability of buying whole milk and a lift value of 1.565. It shows that his probability of buying pot plant goods is 1.565 times higher than that of customers who do not buy whole milk goods. we know in the above analysis that 25.6% of customers have purchased whole milk, so the degree of improvement is 0.40 support 0.256 times 1.56, which is consistent with the results shown. Note: the column marked with support indicates the support of the rule, not the support of the lhs or rhs.

The lifting degree lift (X → Y) = P (Y | X) / P (Y), lift (X → Y) is the same as lift (Y → X).

If the lift value is greater than 1, it is more common for these two types of goods to be purchased together than for only one category. A large lift value is an important indicator, which indicates that a rule is important and reflects the real relationship between goods.

5. Improve the performance of the model

(1) sort the set of association rules

According to the goals of shopping basket analysis, perhaps the most useful rules are those with high support, reliability and promotion. The arules package contains a sort () function that reorders the list of rules by specifying the parameter by to be "support", "confidence", or "lift". By default, the sort is in descending order, and you can specify the parameter decreasing=FALSE to reverse the sort method.

> inspect (sort (grocery_rules) By= "lift") [1:10]) lhs rhs support confidence lift 3 {herbs} = > {root vegetables} 0.007015760 0.4312500 3.956477 57 {berries} = > {whipped/sour cream } 0.009049314 0.2721713 3.796886 450 {tropical fruit Other vegetables,whole milk} = > {root vegetables} 0.007015760 0.4107143 3.768074 174 {beef,other vegetables} = > {root vegetables} 0.007930859 0.4020619 3.688692 285 {tropical fruit,other vegetables} = > {pip fruit} 0.009456024 0.2634561 3.482649 176 {beef,whole milk} = > {root vegetables} 0.008032537 0.3779904 3.467851 284 {pip fruit Other vegetables} = > {tropical fruit} 0.009456024 0.3618677 3.448613 282 {pip fruit,yogurt} = > {tropical fruit} 0.006405694 0.3559322 3.392048 319 {citrus fruit,other vegetables} = > {root vegetables} 0.010371124 0.3591549 3.295045 455 {other vegetables,whole milk,yogurt} = > {tropical fruit} 0.007625826 0.3424658 3.263712

(2) extract a subset of association rules: we can extract the rules we are interested in through the subset () function.

> fruit_rules fruit_rules set of 21 rules > inspect (fruit_rules [1:5]) lhs rhs support confidence lift 127 {pip fruit} = > {tropical fruit} 0.020437214 0.2701613 2.574648 128 {pip fruit} = > {other vegetables} 0.026131164 0.3454301 1.785237 129 {pip fruit} = > {whole milk} 0.030096594 0.3978495 1.557043 281 {tropical fruit Pip fruit} = > {yogurt} 0.006405694 0.3134328 2.246802 282 {pip fruit,yogurt} = > {tropical fruit} 0.006405694 0.3559322 3.392048 so much about how to use the Apriori algorithm in R language. I hope the above content can be of some help to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.