爬虫3：html页面+webdriver模块+demo

时间：2016-05-21 01:14:54 阅读：232 评论：0 收藏：0 [点我收藏+]

　　保密性好的网站，不能使用request请求页面信息，这样可以使用webdriver模块先开启一个浏览器，然后爬去信息，甚至还可以click等操作对页面操作，再爬取。

　　demo 一般流程：

　　1）包含selenium 模块

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

　　2）设置采用火狐浏览器（chrome也可以）

driver = webdriver.Firefox()

　　3）get方式打开（为了保密，url省略）

driver.get("http://www.---------------")

　　4）css方式筛选

elements = driver.find_elements_by_css_selector("span.c9.ng-binding")

　　5）由于webdriver模块的筛选功能不是很好用，这里推荐转成html形式，然后使用beautifulsoap筛选

html = driver.page_source

　　6）BeautifulSoup筛选信息-find_all 和 css 选择器方式更好用

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html)
# soup.find_all(‘div‘,text=re.compile(u"信息"))[0]
for i in soup.select(‘a[href*="human"]‘):
    print i

爬虫3：html页面+webdriver模块+demo

原文：http://www.cnblogs.com/rongyux/p/5513780.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)