How to do Association Analysis with Apriori algorithm 07/11 Update SLTechnology News&Howtos

How to do Association Analysis with Apriori algorithm

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces how to carry out association analysis with Apriori algorithm. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

Finding implicit relationships between items from large data sets is called association analysis (associationanalysis) or association rule learning (associationrulelearning).

1. Apriori algorithm

(1) Association analysis

Association analysis is a task of finding interesting relationships in large-scale data sets. These relationships can take two forms: frequent itemsets or association rules. A frequent itemset (frequentitemsets) is a collection of items that often appear together, and association rules (associationrules) suggest that there may be a strong relationship between the two items.

Sample examples:

Trade number 0 soy milk, lettuce 1 lettuce, diaper, wine, beet 2 soy milk, diaper, wine, orange juice 3 lettuce, soy milk, diaper, wine 4 lettuce, soy milk, diaper, orange juice

The support of an itemset is defined as the percentage of records in the dataset that contain the itemset, and {soy milk) has a support rating of 4x5. Three of the five transactions contain {soy milk, diaper}, so the support rating of {soy milk, diaper} is 3max 5. Support is for itemsets, so you can define a minimum support and retain only the itemsets that meet the minimum support.

Credibility or confidence is defined for an association rule such as {diaper}-> {wine}. The credibility of this rule is defined as "support ({diaper, wine}) / support ({diaper})". As can be seen from the above table, since the support degree of {diaper, wine} is 3max 5 and that of diaper is 4pm 5, the credibility of "diaper-> wine" is 3max 4pm 0.75. This means that our rules apply to 75% of all records that contain "diapers".

Infrequent itemset definition: there is no such item in the project, and then there must be no frequent itemset of this item. That is, if there is no small project item, there will be no collection that contains it.

Apriori algorithm is a method to find frequent itemsets. The two input parameters of Apriori algorithm are minimum support and data set, respectively. The algorithm first generates a list of itemsets for all individual items. Then scan the transaction to see which itemsets meet the minimum support requirements, and those that do not meet the minimum support requirements will be removed. The remaining collections are then combined to generate an itemset that contains two elements. Next, re-scan the transaction records to remove the itemsets that do not meet the minimum support. The process is repeated until all itemsets are removed.

The pseudo code is as follows:

Tran each transaction in the dataset

For each candidate set can:

Check to see if can is a subset of tran:

If so, increase the count value of can

For each candidate itemset:

If its support is not less than the minimum, the set is retained.

Returns a list of all frequent itemsets

(1) construct the first candidate itemset set

Def createC1 (dataSet):

C1 = []

For transaction in dataSet:

For item in transaction:

If not [item] in C1:

C1.append ([item])

C1.sort ()

Return list (map (frozenset, C1))

(2) build frequent itemsets that are greater than the minimum support.

Def scanD (D, Ck, minSupport):

SsCnt = {}

For tid in D:

For can in Ck:

If can.issubset (tid):

If can not in ssCnt:

SsCnt [can] = 1

Else: ssCnt [can] + = 1

NumItems = float (len (D))

RetList = []

SupportData = {}

For key in ssCnt:

Support = ssCnt [key] / numItems

If support > = minSupport:

RetList.insert (0PowerKey)

SupportData [key] = support

Return retList, supportData# returns the subitems that match the support, as well as the calculated support of all projects.

When the number of items in the collection is greater than 0

Build a list of candidate itemsets consisting of k items

Check the data to make sure that each itemset is frequent

Keep frequent itemsets and build a list of candidate itemsets composed of knew1 items

# build multiple frequent itemsets

Def aprioriGen (Lk, k): # creates Ck

RetList = []

LenLk = len (Lk)

For i in range (lenLk):

For j in range (iTun1, lenLk):

L1 = list (LK [I]) [: KMel 2]; L2 = list (LK [j]) [: KMel 2]

L1.sort (); L2.sort ()

If L1==L2: # if first kmur2 elements are equal

RetList.append (LK [I] | Lk [j]) # set union

Return retList

# overall Apriori function code to get all frequent itemsets and sets that meet the minimum support requirements

Def apriori (dataSet, minSupport = 0.5):

C1 = createC1 (dataSet)

D = list (map (set, dataSet))

L1, supportData = scanD (D, C1, minSupport)

L = [L1]

K = 2

While (len (L [k-2]) > 0):

Ck = aprioriGen (L [k-2], k)

Lk, supK = scanD (D, Ck, minSupport) # scan DB to get Lk

The supportData.update (supK) # update () function updates the key / value pair of the dictionary dict2 to dict

L.append (Lk)

K + = 1

Return L, supportData

(2) Mining association rules from frequent itemsets

We also have a similar quantification method for association rules, which is called credibility. The credibility of a rule P-> H is defined as Support (P | H) / support (P). In python, the operator represents the union operation of the set, while the symbol of the mathematical set merge is U. P | H refers to all the elements that appear in set P or set H. Support for all frequent itemsets has been calculated in the previous section. Now all you need to do to gain credibility is to take out those support values and do a division operation.

If a rule does not meet the minimum credibility requirement, then all subsets of the rule will not meet the minimum credibility requirement.

Def generateRules (L, supportData, minConf=0.7): # supportData is a dict coming from scanD

BigRuleList = []

For i in range (1, len (L)): # only get the sets with two or more items

For freqSet in L [i]:

Print (L [I], freqSet)

H1 = [frozenset ([item]) for item in freqSet]

Print ("I", I, "H1", H1)

If (I > 1):

RulesFromConseq (freqSet, H1, supportData, bigRuleList, minConf)

Else:

CalcConf (freqSet, H1, supportData, bigRuleList, minConf)

Return bigRuleList

Def calcConf (freqSet, H, supportData, brl, minConf=0.7):

PrunedH = [] # create new list to return

For conseq in H:

Conf = support data [freqSet] / supportData [freqSet-conseq] # calc confidence

If conf > = minConf:

Print (freqSet-conseq,'-- >', conseq,'conf:',conf)

Brl.append ((freqSet-conseq, conseq, conf))

PrunedH.append (conseq)

Return prunedH

Def rulesFromConseq (freqSet, H, supportData, brl, minConf=0.7):

Print ("freqSet:", freqSet)

M = len (H [0])

If (len (freqSet) > (m + 1)): # try further merging

Hmp1 = aprioriGen (H, mu 1) # create Hm+1 new candidates

Print ("Hmp1:", Hmp1)

Hmp1 = calcConf (freqSet, Hmp1, supportData, brl, minConf)

Print ("Hmp1:", Hmp1)

If (len (Hmp1) > 1): # need at least two sets to merge

Print ("len (Hmp1):", len (Hmp1))

RulesFromConseq (freqSet, Hmp1, supportData, brl, minConf)

Print ("_ _")

On the Apriori algorithm on how to carry out association analysis to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.