小编今天学习了Apriori算法,用于生成频繁项集,感觉好厉害。 ——参考 Robert Layton的 数据挖掘
所以小编特此记录加深印象。所以代码 仅仅放上核心。
# 选择训练集200 ratings = all_ratings[all_ratings[‘UserID‘].isin(range(200))] # 选出喜欢的数据 favorable_ratings = ratings[ratings[‘Favorable‘]] # 将数记录成 一个用户对应多部 喜欢的电影 favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings.groupby("UserID")["MovieID"]) # print(favorable_reviews_by_users) len(favorable_reviews_by_users) # 得到最后训练集的人数 199 # 生成一个电影有多少人喜欢 num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum() # print(num_favorable_by_movie) num_favorable_by_movie.sort_values("Favorable", ascending=False)[:5] from collections import defaultdict def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support): counts = defaultdict(int) for user, reviews in favorable_reviews_by_users.items(): for itemset in k_1_itemsets.keys(): if itemset.issubset(reviews): for other_reviewed_movie in reviews - itemset: # 用到了集合的特性 | 表示取与 解决了重复的问题(集合有去重的功能) current_superset = itemset | frozenset((other_reviewed_movie,)) counts[current_superset] += 1 return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support]) import sys frequent_itemsets = {} min_support = 50 # print(type(frequent_itemsets)) dict # k=1 frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"]) for movie_id, row in num_favorable_by_movie.iterrows() if row["Favorable"] > min_support) # print(frequent_itemsets) # exit() print("There are {} movies with more than {} favorable reviews".format(len(frequent_itemsets[1]), min_support)) sys.stdout.flush()
# 通过循环,求出了所有情况 for k in range(2, 20): # k代表每个数据集中 值得推荐电影总共得个数,看看结果就懂了 # 得到结果 cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1], min_support) if len(cur_frequent_itemsets) == 0: print("Did not find any frequent itemsets of length {}".format(k)) sys.stdout.flush() break else: print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k)) #print(cur_frequent_itemsets) sys.stdout.flush() frequent_itemsets[k] = cur_frequent_itemsets # 去除个数为1 的,这样的数据没办法推荐 del frequent_itemsets[1] print(frequent_itemsets[9])
There are 20 movies with more than 50 favorable reviews I found 145 frequent itemsets of length 2 I found 508 frequent itemsets of length 3 I found 989 frequent itemsets of length 4 I found 1146 frequent itemsets of length 5 I found 811 frequent itemsets of length 6 I found 338 frequent itemsets of length 7 I found 73 frequent itemsets of length 8 I found 5 frequent itemsets of length 9 Did not find any frequent itemsets of length 10 {frozenset({260, 356, 2571, 2028, 1196, 1198, 593, 1210, 858}): 72, frozenset({260, 356, 296, 2571, 1198, 47, 527, 593, 318}): 72, frozenset
({260, 296, 2571, 2028, 1196, 1198, 593, 1210, 858}): 72, frozenset({260, 356, 296, 2571, 1198, 527, 593, 858, 318}): 81, frozenset({260, 356
, 296, 2571, 1196, 593, 1210, 858, 318}): 72}
--------------------------
上面得到了频繁项集,下面来生成规则。规则由前提和结论组成。
# 现在创建一个列表, candidate_rules = [] for itemset_length, itemset_counts in frequent_itemsets.items(): for itemset in itemset_counts.keys(): for conclusion in itemset: premise = itemset - set((conclusion,)) # 减号为集合itemset除去conclusion之后得 candidate_rules.append((premise, conclusion)) print("There are {} candidate rules".format(len(candidate_rules)))
创建规则成功,只是conclusion只有一个电影推荐。
原文:https://www.cnblogs.com/a-runner/p/12283952.html