In this report, I will use python to analyse the trend in movie market.
Packages: Pandas, Numpy, Matplotlib, Seaborn, Json
IDE: Pycharm
Major questions:
1. Data import and cleaning
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_style(‘darkgrid‘) import json import numpy as np moviesdf = pd.read_csv(‘movies.csv‘) movdf = pd.read_csv(‘credits.csv‘)
1) Fill missing values
null=moviesdf["release_date"].isnull() moviesdf.loc[null,:] moviesdf[‘release_date‘] = moviesdf[‘release_date‘].fillna( ‘2017-11-01‘ )
2) Convert data type
Date
moviesdf.loc[:,‘release_date‘]=pd.to_datetime(moviesdf.loc[:,‘release_date‘],
format=‘%Y-%m-%d‘,
errors=‘coerce‘)
Json into characters
#genres
moviesdf[‘genres‘] = moviesdf[‘genres‘].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf[‘genres‘]):
l = []
for j in range(len(i)):
l.append((i[j][‘name‘]))
moviesdf.loc[index, ‘genres‘] = str(l)
#keywords
moviesdf[‘keywords‘] = moviesdf[‘keywords‘].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf[‘keywords‘]):
l = []
for j in range(len(i)):
l.append((i[j][‘name‘]))
moviesdf.loc[index, ‘keywords‘] = str(l)
#production_companies
moviesdf[‘production_companies‘] = moviesdf[‘production_companies‘].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf[‘production_companies‘]):
l = []
for j in range(len(i)):
l.append((i[j][‘name‘]))
moviesdf.loc[index, ‘production_companies‘] = str(l)
#production_countries
moviesdf[‘production_countries‘] = moviesdf[‘production_countries‘].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf[‘production_countries‘]):
l = []
for j in range(len(i)):
l.append((i[j][‘name‘]))
moviesdf.loc[index, ‘production_countries‘] = str(l)
2. Data processing and visualising
Summarise genres in list
moviesdf[‘genres‘]=moviesdf[‘genres‘].str.strip(‘[]‘).str.replace(‘ ‘,‘‘).str.replace("‘",‘‘)
moviesdf[‘genres‘]=moviesdf[‘genres‘].str.split(‘,‘)
list1=[]
for i in moviesdf[‘genres‘]:
list1.extend(i)
genres=pd.Series(list1).value_counts().sort_values(ascending=False)
genres[:10]
genres=pd.DataFrame(genres[:10])
genres.rename(columns={0:"total"},inplace=True)
1) Barplot: Genres of movies & Amount
f,ax=plt.subplots(figsize=(12,10)) g=sns.barplot(y=genres.index,x="total",data=genres,palette="Blues_d",ax=ax) plt.show()

2) Q1: How genres of movies change over time?
years=[]
for x in moviesdf["release_date"]:
year=x.year
years.append(year)
Years=pd.Series(years)
moviesdf[‘year‘]=Years
moviesdf[‘year‘].head()
min_year = moviesdf[‘year‘].min()
max_year = moviesdf[‘year‘].max()
liste_genres = set()
for s in moviesdf[‘genres‘]:
liste_genres = set().union(s, liste_genres)
liste_genres = list(liste_genres)
liste_genres
genre_df = pd.DataFrame( index = liste_genres,columns= range(min_year, max_year + 1))
genre_df = genre_df.fillna(value = 0)
year = np.array(moviesdf[‘year‘])
z = 0
for i in moviesdf[‘genres‘]:
split_genre = list(i)
for j in split_genre:
genre_df.loc[j, year[z]] = genre_df.loc[j, year[z]] + 1
z+=1
genre_df
plt.figure(figsize=(15,8))
plt.plot(genre_df.T)
plt.title(‘rrr‘)
plt.xticks(range(1910,2020,5))
plt.legend(genre_df.index)
plt.show()

*Genres of movies increase over time, booming from 1975-1995.
*After 1995, dramas, comedies and thrillers increased dramatically.
3) Q2: How is the comparison between universal pictures and paramount pictures?
plt.figure(figsize = (7,4)) two = [‘Universal Pictures‘, ‘Paramount Pictures‘] num = [77015832,70100000] plt.bar(np.arange(len(two)), num, color = ‘c‘, width = 0.1, align = ‘center‘) plt.ylabel(‘revenue‘) plt.xticks(np.arange(len(two)), two) plt.title(‘Universal Pictures VS Paramount Pictures ‘) plt.grid(True) plt.show()

*Until 2017, Universal Pictures has a slightly higher revenue than Paramount Pictures.
4) Q3: How is the comparison between the movies based on novel and original?
keylist = [‘based on novel‘,‘original‘] nums = [197,4606] plt.figure(figsize=(7, 4)) plt.bar(np.arange(len(keylist)), nums, color = ‘c‘ , width = 0.1, align = ‘center‘) plt.ylabel(‘Amount‘,fontsize = 12) plt.xticks(np.arange(len(keylist)), keylist,fontsize = 12) plt.title(‘Original VS Based on novel‘,fontsize = 14) plt.grid(True) plt.show()

*Until 2017, most movies are original rather than based on novel.
原文:https://www.cnblogs.com/zfkepic/p/12208083.html