scrapy爬虫

时间：2020-01-30 10:22:50 阅读：97 评论：0 收藏：0 [点我收藏+]

控制台命令

scrapy startproject 项目名

scrapy crawl XX

scrapy crawl quotes -o quotes.json

scrapy crawl quotes -o quotes.jl

scrapy shell http://www.scrapyd.cn

scrapy genspider example example.com#创建蜘蛛，蜘蛛名为example

scrapy选择器

.extract_first()　　.extract()　　.get()　　.getall()

.intro　　#class = "intro"

#firstname　　#id = "firstname"

标签名::attr(属性名)　　#“a::attr(href)” "img::attr(src)"

标签名::text　　#"a::text" “a *::text”#a标签的所有文字

div,p#选择<div>元素内的所有<p>元素　　div p#选择<div>元素内的所有<p>元素　　

div>p#选择所有父级是 <div> 元素的 <p> 元素　　div+p#选择所有紧接着<div>元素之后的<p>元素

[target]#选择所有带有target属性元素,[target=blank],[target~=blank],[target|=blank]

string()#文本整段提取（拼接）

xpath

/#从根节点选取　　//#从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。

@#选取属性“//@href” 标签[@属性名=属性值]

//text()#标签文本内容

正则

https://docs.python.org/3/library/re.html

技术分享图片

实例（模板）

import scrapy


class AuthorSpider(scrapy.Spider):
    name = ‘author‘

    start_urls = [‘http://quotes.toscrape.com/‘]

    def parse(self, response):
        # follow links to author pages
        for href in response.css(‘.author + a::attr(href)‘):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css(‘li.next a::attr(href)‘):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default=‘‘).strip()

        yield {
            ‘name‘: extract_with_css(‘h3.author-title::text‘),
            ‘birthdate‘: extract_with_css(‘.author-born-date::text‘),
            ‘bio‘: extract_with_css(‘.author-description::text‘),
        }

scrapy爬虫

原文：https://www.cnblogs.com/puddingsmall/p/12242183.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)

scrapy爬虫

控制台命令

scrapy选择器

css

xpath

正则

实例（模板）