In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces how to carry out association analysis with Apriori algorithm. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.
Finding implicit relationships between items from large data sets is called association analysis (associationanalysis) or association rule learning (associationrulelearning).
1. Apriori algorithm
(1) Association analysis
Association analysis is a task of finding interesting relationships in large-scale data sets. These relationships can take two forms: frequent itemsets or association rules. A frequent itemset (frequentitemsets) is a collection of items that often appear together, and association rules (associationrules) suggest that there may be a strong relationship between the two items.
Sample examples:
Trade number 0 soy milk, lettuce 1 lettuce, diaper, wine, beet 2 soy milk, diaper, wine, orange juice 3 lettuce, soy milk, diaper, wine 4 lettuce, soy milk, diaper, orange juice
The support of an itemset is defined as the percentage of records in the dataset that contain the itemset, and {soy milk) has a support rating of 4x5. Three of the five transactions contain {soy milk, diaper}, so the support rating of {soy milk, diaper} is 3max 5. Support is for itemsets, so you can define a minimum support and retain only the itemsets that meet the minimum support.
Credibility or confidence is defined for an association rule such as {diaper}-> {wine}. The credibility of this rule is defined as "support ({diaper, wine}) / support ({diaper})". As can be seen from the above table, since the support degree of {diaper, wine} is 3max 5 and that of diaper is 4pm 5, the credibility of "diaper-> wine" is 3max 4pm 0.75. This means that our rules apply to 75% of all records that contain "diapers".
Infrequent itemset definition: there is no such item in the project, and then there must be no frequent itemset of this item. That is, if there is no small project item, there will be no collection that contains it.
Apriori algorithm is a method to find frequent itemsets. The two input parameters of Apriori algorithm are minimum support and data set, respectively. The algorithm first generates a list of itemsets for all individual items. Then scan the transaction to see which itemsets meet the minimum support requirements, and those that do not meet the minimum support requirements will be removed. The remaining collections are then combined to generate an itemset that contains two elements. Next, re-scan the transaction records to remove the itemsets that do not meet the minimum support. The process is repeated until all itemsets are removed.
The pseudo code is as follows:
Tran each transaction in the dataset
For each candidate set can:
Check to see if can is a subset of tran:
If so, increase the count value of can
For each candidate itemset:
If its support is not less than the minimum, the set is retained.
Returns a list of all frequent itemsets
(1) construct the first candidate itemset set
Def createC1 (dataSet):
C1 = []
For transaction in dataSet:
For item in transaction:
If not [item] in C1:
C1.append ([item])
C1.sort ()
Return list (map (frozenset, C1))
(2) build frequent itemsets that are greater than the minimum support.
Def scanD (D, Ck, minSupport):
SsCnt = {}
For tid in D:
For can in Ck:
If can.issubset (tid):
If can not in ssCnt:
SsCnt [can] = 1
Else: ssCnt [can] + = 1
NumItems = float (len (D))
RetList = []
SupportData = {}
For key in ssCnt:
Support = ssCnt [key] / numItems
If support > = minSupport:
RetList.insert (0PowerKey)
SupportData [key] = support
Return retList, supportData# returns the subitems that match the support, as well as the calculated support of all projects.
When the number of items in the collection is greater than 0
Build a list of candidate itemsets consisting of k items
Check the data to make sure that each itemset is frequent
Keep frequent itemsets and build a list of candidate itemsets composed of knew1 items
# build multiple frequent itemsets
Def aprioriGen (Lk, k): # creates Ck
RetList = []
LenLk = len (Lk)
For i in range (lenLk):
For j in range (iTun1, lenLk):
L1 = list (LK [I]) [: KMel 2]; L2 = list (LK [j]) [: KMel 2]
L1.sort (); L2.sort ()
If L1==L2: # if first kmur2 elements are equal
RetList.append (LK [I] | Lk [j]) # set union
Return retList
# overall Apriori function code to get all frequent itemsets and sets that meet the minimum support requirements
Def apriori (dataSet, minSupport = 0.5):
C1 = createC1 (dataSet)
D = list (map (set, dataSet))
L1, supportData = scanD (D, C1, minSupport)
L = [L1]
K = 2
While (len (L [k-2]) > 0):
Ck = aprioriGen (L [k-2], k)
Lk, supK = scanD (D, Ck, minSupport) # scan DB to get Lk
The supportData.update (supK) # update () function updates the key / value pair of the dictionary dict2 to dict
L.append (Lk)
K + = 1
Return L, supportData
(2) Mining association rules from frequent itemsets
We also have a similar quantification method for association rules, which is called credibility. The credibility of a rule P-> H is defined as Support (P | H) / support (P). In python, the operator represents the union operation of the set, while the symbol of the mathematical set merge is U. P | H refers to all the elements that appear in set P or set H. Support for all frequent itemsets has been calculated in the previous section. Now all you need to do to gain credibility is to take out those support values and do a division operation.
If a rule does not meet the minimum credibility requirement, then all subsets of the rule will not meet the minimum credibility requirement.
Def generateRules (L, supportData, minConf=0.7): # supportData is a dict coming from scanD
BigRuleList = []
For i in range (1, len (L)): # only get the sets with two or more items
For freqSet in L [i]:
Print (L [I], freqSet)
H1 = [frozenset ([item]) for item in freqSet]
Print ("I", I, "H1", H1)
If (I > 1):
RulesFromConseq (freqSet, H1, supportData, bigRuleList, minConf)
Else:
CalcConf (freqSet, H1, supportData, bigRuleList, minConf)
Return bigRuleList
Def calcConf (freqSet, H, supportData, brl, minConf=0.7):
PrunedH = [] # create new list to return
For conseq in H:
Conf = support data [freqSet] / supportData [freqSet-conseq] # calc confidence
If conf > = minConf:
Print (freqSet-conseq,'-- >', conseq,'conf:',conf)
Brl.append ((freqSet-conseq, conseq, conf))
PrunedH.append (conseq)
Return prunedH
Def rulesFromConseq (freqSet, H, supportData, brl, minConf=0.7):
Print ("freqSet:", freqSet)
M = len (H [0])
If (len (freqSet) > (m + 1)): # try further merging
Hmp1 = aprioriGen (H, mu 1) # create Hm+1 new candidates
Print ("Hmp1:", Hmp1)
Hmp1 = calcConf (freqSet, Hmp1, supportData, brl, minConf)
Print ("Hmp1:", Hmp1)
If (len (Hmp1) > 1): # need at least two sets to merge
Print ("len (Hmp1):", len (Hmp1))
RulesFromConseq (freqSet, Hmp1, supportData, brl, minConf)
Print ("_ _")
On the Apriori algorithm on how to carry out association analysis to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.