一:httpclient 简介
HttpClient 是 Apache Jakarta Common 下的子项目,可以用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本和建议。
超文本传输协议(HTTP)可能是当今Internet上使用的最重要的协议。Web服务,支持网络的设备和网络计算的发展继续将HTTP协议的作用扩展到用户驱动的Web浏览器之外,同时增加了需要HTTP支持的应用程序的数量。尽管java.net包提供了通过HTTP访问资源的基本功能,但它并未提供许多应用程序所需的完全灵活性或功能。HttpClient旨在通过提供一个高效,最新且功能丰富的包来实现这一空白,该包实现了最新HTTP标准和建议的客户端。HttpClient专为扩展而设计,同时为基本HTTP协议提供强大支持,HttpClient可能对构建支持HTTP的客户端应用程序(如Web浏览器,Web服务客户端或利用或扩展HTTP协议进行分布式通信的系统)感兴趣。
HttpClient主页: http://hc.apache.org/
HttpClient下载:http://hc.apache.org/downloads.cgi
最新版本4.5 http://hc.apache.org/httpcomponents-client-4.5.x/
官方文档: http://hc.apache.org/httpcomponents-client-4.5.x/tutorial/html/index.html
maven地址:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>
二:httpclient使用流程
使用 HttpClient 发送请求、接收响应很简单,一般需要如下几步即可。
三:HelloWorld 程序
1.创建helloworld程序
public class HelloWorld2 {
public static void main(String[] args) throws ClientProtocolException, IOException {
//1.创建httpclient实例
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建httpget实例(请求)
HttpGet httpGet = new HttpGet("http://www.java1234.com");
//3.httpclient执行(httpget)请求
CloseableHttpResponse response = httpClient.execute(httpGet); //执行http get请求
//4.获取返回的实体(entity)
HttpEntity entity = response.getEntity();
String context = EntityUtils.toString(entity, "utf-8"); //获取网页内容
System.out.println("网页内容是:"+context);
//5.关闭资源
response.close(); //response关闭
httpClient.close(); //httpClient关闭
}
}
2.创建HttpGet请求
添加依赖
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>fluent-hc</artifactId>
<version>4.5.5</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpmime</artifactId>
<version>4.5.5</version>
</dependency>
public class MyTest {
public static void main(String[] args) {
get();
}
private static void get() {
// 创建 HttpClient 客户端,打开浏览器
CloseableHttpClient httpClient = HttpClients.createDefault();
// 创建 HttpGet 请求,输入url
HttpGet httpGet = new HttpGet("http://localhost:8080/content/page?draw=1&start=0&length=10");
// 设置长连接
httpGet.setHeader("Connection", "keep-alive");
// 设置代理(模拟浏览器版本)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36");
// 设置 Cookie
httpGet.setHeader("Cookie", "UM_distinctid=16442706a09352-0376059833914f-3c604504-1fa400-16442706a0b345; CNZZDATA1262458286=1603637673-1530123020-%7C1530123020; JSESSIONID=805587506F1594AE02DC45845A7216A4");
//发送请求,回车
CloseableHttpResponse httpResponse = null;
try {
// 请求并获得响应结果
httpResponse = httpClient.execute(httpGet);
HttpEntity httpEntity = httpResponse.getEntity();
// 输出请求结果
System.out.println(EntityUtils.toString(httpEntity));
} catch (IOException e) {
e.printStackTrace();
} finally { // 无论如何必须关闭连接
if (httpResponse != null) {
try {
httpResponse.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (httpClient != null) {
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
3.创建HttpPost请求
public class MyTest {
public static void main(String[] args) {
post();
}
private static void post() {
// 创建 HttpClient 客户端
CloseableHttpClient httpClient = HttpClients.createDefault();
// 创建 HttpPost 请求
HttpPost httpPost = new HttpPost("http://localhost:8080/content/page");
// 设置长连接
httpPost.setHeader("Connection", "keep-alive");
// 设置代理(模拟浏览器版本)
httpPost.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36");
// 设置 Cookie
httpPost.setHeader("Cookie", "UM_distinctid=16442706a09352-0376059833914f-3c604504-1fa400-16442706a0b345; CNZZDATA1262458286=1603637673-1530123020-%7C1530123020; JSESSIONID=805587506F1594AE02DC45845A7216A4");
// 创建 HttpPost 参数
List<BasicNameValuePair> params = new ArrayList<BasicNameValuePair>();
params.add(new BasicNameValuePair("draw", "1")); //请求参数中的key-value值
params.add(new BasicNameValuePair("start", "0"));
params.add(new BasicNameValuePair("length", "10"));
CloseableHttpResponse httpResponse = null;
try {
// 设置 HttpPost 参数
httpPost.setEntity(new UrlEncodedFormEntity(params, "UTF-8"));
httpResponse = httpClient.execute(httpPost);
HttpEntity httpEntity = httpResponse.getEntity();
// 输出请求结果
System.out.println(EntityUtils.toString(httpEntity));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally { // 无论如何必须关闭连接
try {
if (httpResponse != null) {
httpResponse.close();
}
} catch (IOException e) {
e.printStackTrace();
}
try {
if (httpClient != null) {
httpClient.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
四:模拟浏览器抓取网页
1.设置请求头消息User-Agent模拟浏览器(此处是chrome浏览器)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
2.获取响应内容类型Content-Type
//获取响应内容类型Content-Type; getName()是获取key,getValue()是获取value
entity.getContentType().getValue();
3.获取响应状态Status
response.getStatusLine().getStatusCode();
200 -- 正常
403 -- 拒绝
500 -- 服务器报错
400 -- 未找到页面
4.示例
public class Demo2 {
public static void main(String[] args) throws ClientProtocolException, IOException {
//1.创建httpclient实例
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建httpget实例(请求)
HttpGet httpGet = new HttpGet("http://www.tuicool.com");
//设置请求头消息User-Agent模拟浏览器(此处是chrome浏览器)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
//3.httpclient执行(httpget)请求
CloseableHttpResponse response = httpClient.execute(httpGet); //执行http get请求
System.out.println("Status:"+response.getStatusLine().getStatusCode()); //获取响应状态Status
//4.获取返回的实体(entity)
HttpEntity entity = response.getEntity();
//获取响应内容类型Content-Type; getName()是获取key,getValue()是获取value
System.out.println("Content-Type:"+entity.getContentType().getValue());
//获取网页内容
// String context = EntityUtils.toString(entity, "utf-8");
// System.out.println("网页内容是:"+context);
//5.关闭资源
response.close(); //response关闭
httpClient.close(); //httpClient关闭
}
}
五:httpclient 抓取图片
public class Demo1 {
public static void main(String[] args) throws ClientProtocolException, IOException {
//1.创建httpclient实例
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建httpget实例(请求)
HttpGet httpGet = new HttpGet("http://www.java1234.com/uploads/allimg/161105/1-161105150121954.jpg");
//设置请求头消息User-Agent模拟浏览器(此处是chrome浏览器)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
//3.httpclient执行(httpget)请求
CloseableHttpResponse response = httpClient.execute(httpGet); //执行http get请求
//4.获取返回的实体(entity)
HttpEntity entity = response.getEntity();
if(entity!=null) {
//打印实体的内容类型
System.out.println("Content-Type:"+entity.getContentType().getValue());
//获取实体的输入流
InputStream inputStream = entity.getContent();
//将输入流复制到新建的文件
FileUtils.copyToFile(inputStream, new File("E://mysource/picture/aaa.jpg"));
}
//5.关闭资源
response.close(); //response关闭
httpClient.close(); //httpClient关闭
}
}
六:httpclient 使用代理ip
在爬取网页的时候,有的目标站点有反爬虫机制,对于频繁访问站点以及规则性访问站点的行为,会采集屏蔽IP措施。
关于代理IP的话 也分几种 透明代理、匿名代理、混淆代理、高匿代理。
1.透明代理(Transparent Proxy)
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP
透明代理虽然可以直接“隐藏”你的IP地址,但是还是可以从HTTP_X_FORWARDED_FOR来查到你是谁。
2.匿名代理(Anonymous Proxy)
REMOTE_ADDR = proxy IP
HTTP_VIA = proxy IP
HTTP_X_FORWARDED_FOR = proxy IP
匿名代理比透明代理进步了一点:别人只能知道你用了代理,无法知道你是谁。
3.混淆代理(Distorting Proxies)
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Random IP address
如上,与匿名代理相同,如果使用了混淆代理,别人还是能知道你在用代理,但是会得到一个假的IP地址,伪装的更逼真.
4.高匿代理(Elite proxy或High Anonymity Proxy)
REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined
可以看出来,高匿代理让别人根本无法发现你是在用代理,所以是最好的选择.
那代理IP 从哪里搞呢 很简单 百度一下,你就知道 一大堆代理IP站点。 一般都会给出一些免费的,但是花点钱搞收费接口更加方便;比如 http://www.66ip.cn/
5.示例
public class Demo1 {
public static void main(String[] args) throws ClientProtocolException, IOException {
//1.创建httpclient实例
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建httpget实例(请求)
HttpGet httpGet = new HttpGet("http://www.tuicool.com");
//设置代理ip
HttpHost proxy = new HttpHost("42.121.15.99",3128);
RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
httpGet.setConfig(config);
//设置请求头消息User-Agent模拟浏览器(此处是chrome浏览器)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
//3.httpclient执行(httpget)请求
CloseableHttpResponse response = httpClient.execute(httpGet); //执行http get请求
//4.获取返回的实体(entity)
HttpEntity entity = response.getEntity();
String context = EntityUtils.toString(entity, "utf-8"); //获取网页内容
System.out.println("网页内容是:"+context);
//5.关闭资源
response.close(); //response关闭
httpClient.close(); //httpClient关闭
}
}
七:httpclient 连接超时及读取超时
httpClient在执行具体http请求时候 有一个连接的时间和读取内容的时间;
HttpClient连接时间,所谓连接的时候 是HttpClient发送请求的地方开始到连接上目标url主机地址的时间。
HttpClient读取时间,所谓读取的时间 是HttpClient已经连接到了目标服务器,然后进行内容数据的获取。
国外maven仓库地址:http://central.maven.org/maven2/
示例:
public class Demo1 {
public static void main(String[] args) throws ClientProtocolException, IOException {
//1.创建httpclient实例
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建httpget实例(请求)
HttpGet httpGet = new HttpGet("http://central.maven.org/maven2/");
//设置连接超时及读取超时
RequestConfig config=RequestConfig.custom()
.setConnectTimeout(1000) //设置连接超时时间(单位毫秒)
.setSocketTimeout(1000) //设置读取超时时间(单位毫秒)
.build();
httpGet.setConfig(config);
//设置请求头消息User-Agent模拟浏览器(此处是chrome浏览器)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
//3.httpclient执行(httpget)请求
CloseableHttpResponse response = httpClient.execute(httpGet); //执行http get请求
//4.获取返回的实体(entity)
HttpEntity entity = response.getEntity();
String context = EntityUtils.toString(entity, "utf-8"); //获取网页内容
System.out.println("网页内容是:"+context);
//5.关闭资源
response.close(); //response关闭
httpClient.close(); //httpClient关闭
}
}
原文:https://www.cnblogs.com/itzlg/p/10699496.html