在京东页面找到一款手机复制网址
2.1) 爬取代码
import requests url = "https://item.jd.com/100003534811.html" r = requests.get(url) print(r.status_code) #返回值为200,访问正常 print(r.text[:1000])#仅打印需要内容
2.2) 返回信息
<!DOCTYPE HTML> <html lang="zh-CN"> <head> <!-- shouji --> <meta http-equiv="Content-Type" content="text/html; charset=gbk" /> <title>【小米Redmi K20 Pro】小米 Redmi K20Pro 4800万超广角三摄 8GB+128GB 冰川蓝 骁龙855 全网通4G 双卡双待 全面屏拍照智能游戏手机【行情 报价 价格 评测】-京东</title> <meta name="keywords" content="MIRedmi K20 Pro,小米Redmi K20 Pro,小米Redmi K20 Pro报价,MIRedmi K20 Pro报价"/> <meta name="description" content="【小米Redmi K20 Pro】京东JD.COM提供小米Redmi K20 Pro正品行货,并包括MIRedmi K20 Pro网购指南,以及小米Redmi K20 Pro图片、Redmi K20 Pro参数、Redmi K20 Pro评论、Redmi K20 Pro心得、Redmi K20 Pro技巧等信息,网购小米Redmi K20 Pro上京东,放心又轻松" /> <meta name="format-detection" content="telephone=no"> <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/100003534811.html"> <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/100003534811.html"> <meta http-equiv="X-UA-Compatible" content="IE=Edge"> <link rel="canonical" href="//item.jd.com/100003534811.html"/> <link
import requests url = "https://item.jd.com/100003534811.html" try: r = requests.get(url) # 返回值为200则不会产生异常 r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000]) except: print("爬取失败")
在亚马逊页面找到一本书复制网址
2.1) 爬取代码
import requests url = "https://www.amazon.cn/dp/B01H36S9MO/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%99%BD%E8%AF%B4&qid=1565584830&s=gateway&sr=8-1" r = requests.get(url) print(r.status_code)
2.2) 状态码反思
状态码返回值是503,不是200,说明访问出错
2.3) 打印文本内容
<!DOCTYPE html> <!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]--> <!--[if IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]--> <!--[if IE 8]> <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]--> <!--[if gt IE 8]><!--> <html class="a-no-js" lang="zh-CN"><!--<![endif]--><head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> <title dir="ltr">Amazon CAPTCHA</title> <meta name="viewport" content="width=device-width"> <link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css"> <script> if (true === true) { var ue_t0 = (+ new Date()), ue_csm = window, ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } }, ue_furl = "fls-cn.amazon.cn", ue_mid = "AAHKV2X7AFYLW", ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1], ue_sn = "opfcaptcha.amazon.cn", ue_id = ‘7M7370PKHPW590MJV57S‘; } </script> </head> <body> <!-- To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases. --> <!-- Correios.DoNotSend --> <div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important"> <div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto"> <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div> <div class="a-box a-alert a-alert-info a-spacing-base"> <div class="a-box-inner"> <i class="a-icon a-icon-alert"></i> <h4>请输入您在下方看到的字符</h4> <p class="a-last">抱歉,我们只是想确认一下当前访问者并非自动程序。为了达到最佳效果,请确保您浏览器上的 Cookie 已启用。</p> </div> </div> <div class="a-section"> <div class="a-box a-color-offset-background"> <div class="a-box-inner a-padding-extra-large"> <form method="get" action="/errors/validateCaptcha" name=""> <input type=hidden name="amzn" value="3vXJDVQq+SKJ44y9xdfMeA==" /><input type=hidden name="amzn-r" value="/dp/B01H36S9MO/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%99%BD%E8%AF%B4&qid=1565584830&s=gateway&sr=8-1" /> <div class="a-row a-spacing-large"> <div class="a-box"> <div class="a-box-inner"> <h4>请输入您在这个图片中看到的字符:</h4> <div class="a-row a-text-center"> <img src="https://images-na.ssl-images-amazon.com/captcha/xzqdsmvh/Captcha_ngaflmibnn.jpg"> </div> <div class="a-row a-spacing-base"> <div class="a-row"> <div class="a-column a-span6"> <label for="captchacharacters">输入字符</label> </div> <div class="a-column a-span6 a-span-last a-text-right"> <a onclick="window.location.reload()">换一张图</a> </div> </div> <input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text"> </div> </div> </div> </div> <div class="a-section a-spacing-extra-large"> <div class="a-row"> <span class="a-button a-button-primary a-span12"> <span class="a-button-inner"> <button type="submit" class="a-button-text">继续购物</button> </span> </span> </div> </div> </form> </div> </div> </div> </div> <div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div> <div class="a-text-center a-spacing-small a-size-mini"> <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_claim?ie=UTF8&nodeId=200347160">使用条件</a> <span class="a-letter-space"></span> <span class="a-letter-space"></span> <span class="a-letter-space"></span> <span class="a-letter-space"></span> <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=200347130">隐私声明</a> </div> <div class="a-text-center a-size-mini a-color-secondary"> © 1996-2015, Amazon.com, Inc. or its affiliates <script> if (true === true) { document.write(‘<img src="https://fls-cn.amaz‘+‘on.cn/‘+‘1/oc-csi/1/OP/requestId=7M7370PKHPW590MJV57S&js=1" />‘); }; </script> <noscript> <img src="https://fls-cn.amazon.cn/1/oc-csi/1/OP/requestId=7M7370PKHPW590MJV57S&js=0" /> </noscript> </div> </div> <script> if (true === true) { var elem = document.createElement("script"); elem.src = "https://images-cn.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js"; document.getElementsByTagName(‘head‘)[0].appendChild(elem); } </script> </body></html>
根据打印文本内容中包含Marketplace APIs 判断该次访问出错由于API造成,事实上,如果我们能够从服务器获得网页信息,那么这个错误不再是网络错误。
2.4) 打印头部信息
# 打印发给亚马逊网站的头部信息 print(r.request.headers) # 头部信息内容 {‘User-Agent‘: ‘python-requests/2.21.0‘, ‘Accept-Encoding‘: ‘gzip, deflate‘, ‘Accept‘: ‘*/*‘, ‘Connection‘: ‘keep-alive‘}
根据打印的头部信息我们可以看出我们的爬虫忠实的告诉了服务器我们的访问是一个python-requests库程序发起的,如果亚马逊服务器启动了来源审查,则此类访问会产生错误。
2.5) 修改头部信息
kv = {‘user-agent‘:‘Mozilla/5.0‘} r = requests.get(url, headers = kv) print(r.status_code) print(r.request.headers)
打印内容
200 {‘user-agent‘: ‘Mozilla/5.0‘, ‘Accept-Encoding‘: ‘gzip, deflate‘, ‘Accept‘: ‘*/*‘, ‘Connection‘: ‘keep-alive‘}
2.6) 与京东爬取的区别
修改header字段,模拟浏览器向亚马逊服务器申请访问。
import requests url = "https://www.amazon.cn/dp/B01H36S9MO/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%99%BD%E8%AF%B4&qid=1565584830&s=gateway&sr=8-1" try: kv = {‘user-agent‘:‘Mozilla/5.0‘} r = requests.get(url, headers=kv) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000]) except: print("爬取失败")
原文:https://www.cnblogs.com/Robin5/p/11339005.html