在NBA的媒体报道,体育记者通常集中在少数几个球员。由于我们的数据科学的帽子,我们不禁感到一阵怀疑为什么这个球员与其他球员不一样。那么就使用数据科学进一步探索该这个问题。 本文的数据集nba_2013.csv是2013 - 2014赛季的NBA球员的表现。
import pandas as pd
import numpy as np
nba = pd.read_csv("nba_2013.csv")
nba.head(3)
‘‘‘
player pos age bref_team_id g gs mp fg fga fg. \
0 Quincy Acy SF 23 TOT 63 0 847 66 141 0.468
1 Steven Adams C 20 OKC 81 20 1197 93 185 0.503
2 Jeff Adrien PF 27 TOT 53 12 961 143 275 0.520
... drb trb ast stl blk tov pf pts season season_end
0 ... 144 216 28 23 26 30 122 171 2013-2014 2013
1 ... 190 332 43 40 57 71 203 265 2013-2014 2013
2 ... 204 306 38 24 36 39 108 362 2013-2014 2013
[3 rows x 31 columns]
‘‘‘
控球后卫(Point Guards)往往是全队进攻的组织者,并通过对球的控制来决定在恰当的时间传球给适合的球员,是球场上拿球机会最多的人。他要把球从后场安全地带到前场,再把球传给其他队友,这才有让其他人得分的机会。 一个合格的控球后卫必须要能够在只有一个人防守他的情况下,毫无问题地将球带过半场。然后,他还要有很好的传球能力,能够在大多数的时间里,将球传到球应该要到的地方:有时候是一个可以投篮的空档,有时候是一个更好的导球位置。
point_guards = nba[nba[‘pos‘] == ‘PG‘]
point_guards[‘ppg‘] = point_guards[‘pts‘] / point_guards[‘g‘]
# Sanity check, make sure ppg = pts/g
point_guards[[‘pts‘, ‘g‘, ‘ppg‘]].head(5)
‘‘‘
pts g ppg
24 930 71 13.098592
29 150 20 7.500000
30 660 79 8.354430
38 666 72 9.250000
50 378 55 6.872727
‘‘‘
NBA中专门有一项数据统计叫assist/turnover,是用这个队员助攻数比上他的失误数,这项统计能准确的反映一个控卫是否称职。
point_guards = point_guards[point_guards[‘tov‘] != 0]
point_guards[‘atr‘] = point_guards[‘ast‘] / point_guards[‘tov‘]
plt.scatter(point_guards[‘ppg‘], point_guards[‘atr‘], c=‘y‘)
plt.title("Point Guards")
plt.xlabel(‘Points Per Game‘, fontsize=13)
plt.ylabel(‘Assist Turnover Ratio‘, fontsize=13)
kmeans_model = KMeans(n_clusters=2, random_state=1)
senator_distances = kmeans_model.fit_transform(votes.iloc[:, 3:])
num_clusters = 5
# Use numpy‘s random function to generate a list, length: num_clusters, of indices
random_initial_points = np.random.choice(point_guards.index, size=num_clusters)
# Use the random indices to create the centroids
centroids = point_guards.ix[random_initial_points]
plt.scatter(point_guards[‘ppg‘], point_guards[‘atr‘], c=‘yellow‘)
plt.scatter(centroids[‘ppg‘], centroids[‘atr‘], c=‘red‘)
plt.title("Centroids")
plt.xlabel(‘Points Per Game‘, fontsize=13)
plt.ylabel(‘Assist Turnover Ratio‘, fontsize=13)
def centroids_to_dict(centroids):
dictionary = dict()
# iterating counter we use to generate a cluster_id
counter = 0
# iterate a pandas data frame row-wise using .iterrows()
for index, row in centroids.iterrows():
coordinates = [row[‘ppg‘], row[‘atr‘]] #list对象
dictionary[counter] = coordinates
counter += 1
return dictionary
centroids_dict = centroids_to_dict(centroids)
import math
# 计算两个点距离的函数
def calculate_distance(centroid, player_values): # 参数都是list对象
root_distance = 0
for x in range(0, len(centroid)):
difference = centroid[x] - player_values[x]
squared_difference = difference**2
root_distance += squared_difference
euclid_distance = math.sqrt(root_distance)
return euclid_distance
# 返回离每个点最近的簇的键
def assign_to_cluster(row):
lowest_distance = -1
closest_cluster = -1
for cluster_id, centroid in centroids_dict.items():
df_row = [row[‘ppg‘], row[‘atr‘]]
euclidean_distance = calculate_distance(centroid, df_row)
if lowest_distance == -1:
lowest_distance = euclidean_distance
closest_cluster = cluster_id
elif euclidean_distance < lowest_distance:
lowest_distance = euclidean_distance
closest_cluster = cluster_id
return closest_cluster
# 生成一个新的属性:存储每个节点的簇号
point_guards[‘cluster‘] = point_guards.apply(lambda row: assign_to_cluster(row), axis=1)
def visualize_clusters(df, num_clusters):
colors = [‘b‘, ‘g‘, ‘r‘, ‘c‘, ‘m‘, ‘y‘, ‘k‘]
for n in range(num_clusters):
clustered_df = df[df[‘cluster‘] == n]
plt.scatter(clustered_df[‘ppg‘], clustered_df[‘atr‘], c=colors[n-1])
plt.xlabel(‘Points Per Game‘, fontsize=13)
plt.ylabel(‘Assist Turnover Ratio‘, fontsize=13)
visualize_clusters(point_guards, 5)
def recalculate_centroids(df):
new_centroids_dict = dict()
for cluster_id in range(0, num_clusters):
values_in_cluster = df[df[‘cluster‘] == cluster_id]
# Calculate new centroid using mean of values in the cluster
new_centroid = [np.average(values_in_cluster[‘ppg‘]), np.average(values_in_cluster[‘atr‘])]
new_centroids_dict[cluster_id] = new_centroid
return new_centroids_dict
centroids_dict = recalculate_centroids(point_guards)
point_guards[‘cluster‘] = point_guards.apply(lambda row: assign_to_cluster(row), axis=1)
visualize_clusters(point_guards, num_clusters)
centroids_dict = recalculate_centroids(point_guards)
point_guards[‘cluster‘] = point_guards.apply(lambda row: assign_to_cluster(row), axis=1)
visualize_clusters(point_guards, num_clusters)
观察前几次迭代,每次节点改变都不是很大,主要是因为:
为了解决这些问题,sklearn包中的K-Means实现中做了一些智能的功能,比如重复聚类,每次随机选取质心,这比只采用一次质心选取所带来的偏差要少很多。
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(point_guards[[‘ppg‘, ‘atr‘]])
point_guards[‘cluster‘] = kmeans.labels_
visualize_clusters(point_guards, num_clusters)
原文:http://blog.csdn.net/zm714981790/article/details/51247945