In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
In this issue, the editor will bring you about the principle of Apriori algorithm. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.
Association algorithm is a kind of important algorithm in data mining. In 1993, R.Agrawal et al proposed the problem of mining association rules between item sets in customer transaction data, and its core is a recursive algorithm based on the idea of two-stage frequent sets. The association rules belong to single-dimensional, single-layer and Boolean association rules in classification, and the typical algorithm is Apriori algorithm.
Apriori algorithm divides the process of discovering association rules into two steps: the first step is to retrieve all frequent itemsets in transaction database 1 through iteration, that is, itemsets whose support is not lower than the threshold set by users; the second step is to use frequent itemsets to construct rules that satisfy the minimum trust of users. Among them, mining or identifying all frequent itemsets is the core of the algorithm, which accounts for most of the whole computation.
Apriori algorithm is a commonly used algorithm for mining data association rules. It is used to find out the data sets that occur frequently in data values, and finding out the patterns of these sets is helpful for us to make some decisions. For example, in common supermarket shopping data sets, or e-commerce online shopping data sets, if we find frequent data sets, then for supermarkets, we can optimize the location of products, for e-commerce, we can optimize the warehouse location of goods to achieve the purpose of saving costs and increasing economic benefits. Let's make a summary of the Apriori algorithm.
1. Evaluation criteria of frequent itemsets
What kind of data is a frequent itemset? Perhaps you will say, this is not simple, a naked eye scan, the data set that appears many times together is the frequent itemset! Indeed, this is true, but there are two problems. the first is that when the amount of data is very large, we can not find frequent itemsets directly with the naked eye, which gives birth to algorithms for mining association rules, such as Apriori, PrefixSpan, CBA. The second is that we lack a standard for frequent itemsets. For example, 10 records in which An and B appear three times at the same time, can we say that An and B together constitute a frequent itemset? Therefore, we need a criterion for evaluating frequent itemsets.
The commonly used evaluation criteria of frequent itemsets are support, confidence and promotion.
Support is the proportion of the number of times that several associated data appear in the data set to the total data set. Or the probability of several data associations. If we have two data X and Y that want to analyze the correlation, the corresponding support is
By analogy, if we have three data XPY and Z that want to analyze the correlation, the corresponding support is:
Generally speaking, data with high support does not necessarily constitute frequent itemsets, but data with too low support certainly does not constitute frequent itemsets.
Confidence reflects the probability of one data appearing after another, or the conditional probability of the data. If we have two data that want to analyze the correlation, X and Y, the confidence of X to Y is
It can also be inferred to the associated confidence of multiple data. For example, for three data X _ MagneY _ Z, the confidence of X to Y and Z is:
For example, in the shopping data, the confidence of paper towels corresponding to chicken feet is 40%, and the support degree is 1%. It means that in the shopping data, 1% of users buy both chicken feet and paper towels, while 40% of those who buy chicken feet buy paper towels.
The lifting degree represents the ratio of the probability of X to the probability of occurrence of X in the case of Y, that is:
The degree of promotion is first related to the relationship between X and Y. if the correlation is high, the degree of promotion is small. In a special case, if X and Y are independent, there will be
Reach *, because at this time
In general, to select a frequent dataset in a data set, you need to customize the evaluation criteria. The most commonly used evaluation criteria are custom support, or a combination of custom support and confidence.
2. Apriori algorithm thought
For Apriori algorithm, we use support degree as our criterion for judging frequent itemsets. The goal of Apriori algorithm is to find K-item frequent sets of *. There are two meanings here. First, we need to find frequent sets that meet the support criteria. But there may be many such frequent sets. The second meaning is that we need to find the number of frequent sets. For example, if we find frequent sets AB and ABE that match support, then we will abandon AB and keep only ABE, because AB is a 2-item frequent set and ABE is a 3-item frequent set. So specifically, how does the Apriori algorithm mine K-item frequent sets?
The Apriori algorithm uses an iterative method to search out the candidate itemset and the corresponding support, prune to remove the itemset lower than the support, and get a frequent itemset. Then join the remaining frequent itemsets to get the candidate frequent itemsets, filter out the candidate frequent itemsets that are lower than support, get the real frequent itemsets, and so on, iterate until the frequent k itemsets can not be found, and the corresponding set of frequent k itemsets is the output of the algorithm.
It can be seen that the algorithm is still very simple. The I iterative process includes three steps: scanning to calculate the support of candidate frequent itemsets, pruning to get real frequent itemsets and connection to generate candidate frequent itemsets.
Let's take a look at this simple example:
There are four records in our dataset D, which are 134, 235 and 25, respectively. Now we use the Apriori algorithm to find frequent k itemsets, and the minimum support is set to 50%. First of all, we generate a candidate frequent itemset, including all our five data and calculate the support of the five data. after the calculation, we prune the data, and the data 4 is cut because the support is only 25%. Our final frequent itemset is 1235. Now we link to generate candidate frequent itemsets, including 6 groups, including 12, 13, 15, 15, 23, 25, 35. This is the end of our iteration.
Entering the second iteration, we scanned the data set to calculate the support of the candidate frequent 2 itemsets, then pruned, and were screened out because the support of 12 and 15 was only 25%, and the real frequent 2 itemsets were obtained, including 13, 13, 23, 25, 25, 35. Now we link to generate candidate frequent itemsets of 123,235 and 125135, which are not shown in this part of the diagram. By calculating the support degree of candidate frequent 3 itemsets, we found that the support of 123125 and 135is 25%, so it is pruned, and the real frequent itemset is a group of 235s. Because we can no longer make data connection at this time, we can get the candidate frequent 4 itemsets, and the final result is frequent 3 triitemsets 235.
3. Aprior algorithm flow
Let's make a summary of the flow of Aprior algorithm.
Input: data set D, support threshold α α
Output: frequent k itemsets of *
1) scan the whole data set to get all the data that has appeared, as a candidate frequent itemset. 1, frequent 0 itemsets are empty.
2) Mining frequent k itemsets
A) scan data to calculate the support of candidate frequent k itemsets
B) remove the dataset whose support in the candidate frequent k itemset is lower than the threshold, and get the frequent k itemset. If the frequent k itemsets are empty, the set of frequent kmur1 itemsets is returned directly as the result of the algorithm, and the algorithm ends. If there is only one item in the frequent k itemsets, the set of frequent k itemsets is directly returned as the result of the algorithm, and the algorithm ends.
C) based on the frequent k itemsets, the candidate frequent k itemsets are generated by connection.
3) k=k+1, proceed to step 2.
From the steps of the algorithm, we can see that the Aprior algorithm scans the data set in each iteration, so when the data set is very large and there are many kinds of data, the efficiency of the algorithm is very low.
4. Summary of Aprior algorithm
Aprior algorithm is a very classical algorithm for mining frequent itemsets. Many algorithms are based on Aprior algorithm, including FP-Tree,GSP, CBA and so on. These algorithms make use of the idea of Aprior algorithm, but the algorithm is improved, and the efficiency of data mining is better, so now it is rare to use Aprior algorithm to mine data directly, but understanding Aprior algorithm is the premise of understanding other Aprior algorithms, and the algorithm itself is not complex, so it is worth studying.
However, there is no algorithm library related to frequent set mining in scikit-learn, which has to be said to be a pity. I don't know if it will be added in the later version.
The above is the principle of the Apriori algorithm shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.