RCurl网络数据抓取

时间：2015-10-14 14:14:00 阅读：375 评论：0 收藏：0 [点我收藏+]

观察基础信息（服务器信息和提交给服务器的信息）

d=debugGatherer()
xpath="http://123.sogou.com/"
url=getURL(xpath,debugfuNction=d$update,verbose=T)
cat(d$value()[1])#服务器地址以及端口号
cat(d$value()[2])#服务器返回的头信息
cat(d$value()[3])#提交给服务器的头信息

观察是否连接到该网址。

curl=getCurlHandle()
url=getURL(xpath,curl=curl,httpheader=myheader)
getCurlInfo(curl)$response.code

显示为200 表示获取成功。

有时候网页获取信息不全，可能是头信息导致的错误

#设置头信息
myheader<-c(
"User-Agent"="Mozilla/5.0 (Linux; U; Android 2.3.3; zh-cn; HTC_DesireS_S510e Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language"="en-us",
"Connection"="keep-alive",
"Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7"
)

xpath="http://t.dianping.com/list/guangzhou?q=%E7%94%B5%E5%BD%B1"
url=getURL(xpath,httpheader=myheader)

可以观察增加头信息和不添加头信息之间的区别

有时候网页获取信息乱码，总共三种处理方法。第一，增加参数 .encoding（观察html的编码情况）第二，可以尝试设置头信息去解决。第三，windows 出现乱码问题，需在Linux系统下执行

通过与XML包的结合也可以直接抓取表格信息

xpath="http://www.hbksw.com/html/13/26369.shtml"
url=getURL(xpath,httpheader=myheader,.encoding="gb2312")
write(url,"f://url.txt")
doc<-htmlParse(url,asText=T)
tables<-readHTMLTable(doc,which=4);tables

正则表达式的一些使用

# \ 转义字符 . 除了换行后的任意字符 ^ 开头 $ 结尾 * 0个或者多个
# + 一个或者多个？ 0个或者一个
#正则表达式的匹配
pattern="[A-Za-z0-9\\._%+-]+@[A-Za-z0-9\\._%+-]+\\.[A-Za-z]{2,4}"
list=c("sunshine@.163.com","niubi","421946059@qq.com")
list1<-paste(list,collapse=",")
grepl(pattern,list)
grep(pattern,list1)
regexpr(pattern,list1)
regexec(pattern,list1)
gregexpr(pattern,list1)

通过正则表达式抓取到自己想要数据的位置，通过字符串分割去提取

RCurl网络数据抓取

原文：http://www.cnblogs.com/gwz922/p/4877255.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)