1、Pycharm安装好Scrapy模块:scrapy的安装之前需要安装这个模块:方案一:lxml->zope.interface->pyopenssl->twisted->scrapy。方案二:wheel(安装.whl文件)、lxml(lxml是用来做xpath提取)、Twisted、pywin32。
2、通过pycharm创建scrapy项目:pycharm不能直接创建scrapy项目,必须通过命令行创建。在Pycharm的Terminal终端,cd命令指定文件,然后通过命令:scrapy startproject scrapyspider,创建名为scrapyspider的scrapy项目。
3、通过cmd创建scrapy项目:设置环境变量path,增加scrapy.exe的所在文件夹的路径。cd命令指定文件,然后通过命令:scrapy startproject scrapyspider,创建名为scrapyspider的scrapy项目。
scrapyspider/
scrapy.cfg
scrapyspider/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
1、目录下各个文件:
import scrapy
class TutorialItem(scrapy.Item): # 创建一个类,继承scrapy.item类,就是继承人家写好的容器
# define the fields for your item here like:
# name = scrapy.Field()
movieid = scrapy.Field()
moviename = scrapy.Field()
directors = scrapy.Field()
actors = scrapy.Field()
posterPath = scrapy.Field()
plotSummary = scrapy.Field()
averageratings = scrapy.Field()
numRatings = scrapy.Field()
1、创建一个Spider,您必须继承 scrapy.spiders.Spider 类
from scrapy.spiders import Spider
2、根据需求定义属性和方法:
2.1、name: 用于区别Spider。 该名字必须是唯一的,您不可以为不同的Spider设定相同的名字
2.2、start_urls: 包含了Spider在启动时进行爬取的url列表
2.3、def parse() 方法:方法被调用时,每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。该方法负责解析返回的数据(response data),提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象
2.4、可以根据需求定义其他方法:def start_requests、def make_requests_from_url、def update_settings、def handles_request、def close等。
2.5、读取配置文件setting.py的方法:
由原来的:from scrapy.conf import settings,改成“装饰器方法”:from scrapy.utils.project import get_project_settings;settings = get_project_settings()
设置name、start_urls和一些变量:
name = "demo"
movie_id = 1
#handle_httpstatus_list = [401]
allowed_domains = ["movielens.org"]
start_urls = ["https://movielens.org/api/movies/"]
def parse() 方法:
def parse(self, response):
#filename = response.url.split("/")[-2]
#filename = "movies"
#with open(filename, ‘ab‘) as f:
# f.write(response.body)
item = MovieItem()
entity = json.loads(response.body)
movie = entity[‘data‘][‘movieDetails‘][‘movie‘]
item[‘movieid‘]= entity[‘data‘][‘movieDetails‘][‘movieId‘]
item[‘moviename‘] = movie[‘title‘]
item[‘directors‘] = ",".join(movie[‘directors‘])
item[‘actors‘] = ",".join(movie[‘actors‘])
item[‘posterPath‘] = movie[‘posterPath‘]
item[‘plotSummary‘] = movie[‘plotSummary‘]
item[‘averageratings‘] = movie[‘avgRating‘]
item[‘numRatings‘] = movie[‘numRatings‘]
yield item
while self.movie_id<140215:
self.movie_id += 1
url = self.start_urls[0]+str(self.movie_id)
yield scrapy.Request(url, dont_filter=True, callback=self.parse)
import json
from tutorial.items import MovieItem
class TutorialPipeline(object):
def __init__(self):
self.conn = mysql.connector.connect(user=‘root‘, password=‘123456‘, database=‘how2java‘)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
insert_sql = """
insert into movie(movieid, moviename, directors, actors, posterPath, plotSummary, averageratings, numRatings)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
"""
self.cursor.execute(insert_sql, (
item["movieid"], item["moviename"], item["directors"], item["actors"], item["posterPath"], item["plotSummary"], item["averageratings"],
item["numRatings"]))
self.conn.commit()
原文:https://www.cnblogs.com/yinminbo/p/11825503.html