第一步:明确需求
1. 分析数据来源的规律
2. 获取豆瓣高分电影的具体信息的访问链接
3. 利用具体信息的url 获取所有信息
4. 将2和3两张数据表连接成一张表格,并保存在Excel中
第二步:分析数据存储路径
豆瓣高分电影存储位置:
源访问链接:
url = ‘https://movie.douban.com/explore#!type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start=0‘
通过此链接寻找到数据加载链接:
url = ‘https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start=0‘
发现通过改变page_limit=xxxx可以获取更多信息,当page_limit=500时电影数量不在增加。
因此可以通过这个url获取所有高分电影的电影名和访问链接:
1 # 访问链接 2 url = ‘https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=1000&page_start=0‘ 3 # 设置请求头 4 headers = {‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36‘} 5 r = requests.get(url, headers = headers, timeout = 30).json() 6 columns = [‘title‘, ‘rate‘, ‘id‘, ‘url‘] 7 movie_info = pd.DataFrame(r[‘subjects‘], columns=columns) 8 movie_info.head(2)
因为是json数据格式储存所以将其解析,获取需要的信息。
接下来利用获取的电影id,构建具体信息访问链接:
url_info = ‘https://movie.douban.com/j/subject_abstract?subject_id=‘ + id
代码实现:
1 m_info = [] 2 for url_i in movie_info[‘id‘]: 3 url_info = ‘https://movie.douban.com/j/subject_abstract?subject_id=‘ + url_i 4 r = requests.get(url_info, headers = headers, timeout = 30).json() 5 info = {} 6 7 try: 8 info[‘actors1‘] = r[‘subject‘][‘actors‘][0] 9 info[‘actors2‘] = r[‘subject‘][‘actors‘][1] 10 info[‘actors3‘] = r[‘subject‘][‘actors‘][2] 11 except: 12 info[‘actors2‘] = ‘/‘ 13 info[‘actors3‘] = ‘/‘ 14 info[‘directors‘] = r[‘subject‘][‘directors‘][0] 15 info[‘duration‘] = r[‘subject‘][‘duration‘] 16 info[‘rate‘] = r[‘subject‘][‘rate‘] 17 info[‘types1‘] = r[‘subject‘][‘types‘][0] 18 try: 19 info[‘types2‘] = r[‘subject‘][‘types‘][1] 20 info[‘types3‘] = r[‘subject‘][‘types‘][2] 21 except: 22 info[‘types2‘] = ‘/‘ 23 info[‘types3‘] = ‘/‘ 24 info[‘region‘] = r[‘subject‘][‘region‘] 25 info[‘release_year‘] = r[‘subject‘][‘release_year‘] 26 m_info.append(info)
利用pandas将具体信息转换成表格形式:
1 df_info = pd.DataFrame(m_info) 2 # 删除重复字段 3 del df_info[‘rate‘] 4 movie_data = movie_info.join(df_info) 5 6 # 写入到Excel中 7 movie_data.to_excel(‘豆瓣高分电影500部.xlsx‘,index = False)
原文:https://www.cnblogs.com/syd123/p/12271509.html