首页 > 编程语言 > 详细

Java网络编程-爬虫

时间:2020-07-15 00:28:43      阅读:44      评论:0      收藏:0      [点我收藏+]

前言

其实我深知Java的爬虫比python要复杂很多,原因python先天的优势加上其丰富的第三方库,而就目前我的水平来看,只能用Java

思路:
用HttpClient模拟请求html 获取html源码;用jsoup方法抓取解析网页数据
用HttpClient模拟请求html 获取html源码;用正则抓取解析网页数据
补充:
jsoup是一款Java的HTML解析器,可直接解析某个URL地址、HTML文本内容。

因为学校让在家搞个项目,我实在是想不出来有什么比较新颖的东西,就觉得爬虫这东西比较神奇,于是就报了一个项目,不管有什么结果,我还是想把它做完善,其实是我写项目介绍的时候写过了,说我的爬虫能爬音频,视频,文本,图片,HTML响应,尽管项目很小但bug层出不穷,所以我决定边做,边写....

数据的作用

网上很多人说爬虫违法,其实我们只要以学习为目的,不损害他人利益就不会触碰到法律,毕竟技术无罪,人有罪。

其实不然,爬虫的作用很多,百度就是一个爬虫,当你做了一个个人网站,或者博客,当你的站点有一定的流量的时候,你会发现你的网站在百度一下当中可以被搜索到,这就是因为,百度爬取了你的站点信息,并且把信息收录到了它的搜索引擎库内。

身处大数据的影响下,数据就是金钱。

金融,做市场分析,电商,做产品调研,很多东西的排名都离不开数据分析,而获取数据的工具就是爬虫。

项目截图

技术分享图片

技术分享图片
界面使用的是FX,目前对于FX依然很懵...项目后面会打包发布,双击.exe文件就能启动,与源码一并放入网盘,附在文档结尾处,我会将项目当中功能逐一分解,以便单独需要时使用

URL

URL是统一资源定位符的简称,它表示Internet上某资源的地址。通过URL我们可以访问网络上的各种资源。

URL对象是一个绝对的URL地址,但URL对象可用绝对URL、相对URL和部分URL来构建。

使用Java程序上网

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class t {
public static void main(String[] args) throws Exception{
    URL url = new URL("http://www.baidu.com/");
    //拿到UrL地址
    URLConnection connection = url.openConnection();
    //打开连接
    connection.connect();
    //访问资源,下面就是IO流取回信息
    BufferedReader br = new BufferedReader(new InputStreamReader(
            url.openStream(),"UTF-8"));
    String line = null;
    while (null != (line = br.readLine())) {
        System.out.println(line);
    }
    br.close();
}
}

使用URL可以远程访问资源,URL有openConnection()方法,用此来创建一个URLConnection对象,与调用URL对象相关,它返回一个URLConnection对象。但是它可能会发生I/O异常。

技术分享图片

获取网页上的文字信息

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.List;
import java.util.Scanner;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
 
/**
 * 抓取网页数据工具类
 * @author asus-pc
 *
 */
public class GrabUrl {


public static String getUrlText(String url) throws Exception{
	URL getUrl=new URL(url);
	HttpURLConnection connection=(HttpURLConnection)getUrl.openConnection();
	connection.connect();
	BufferedReader reader=new BufferedReader(new InputStreamReader(connection.getInputStream(),"utf-8"));
    StringBuffer buffer=new StringBuffer();
    String lines=null;
    while ((lines=reader.readLine())!=null){
		lines=new String(lines.getBytes(),"utf-8");
		buffer=buffer.append(lines+"\n");
		
	}
    
    reader.close();
	connection.disconnect();
	
	return buffer.toString();
}


private static String extractText(Node node){
    if(node instanceof TextNode){    
        return ((TextNode) node).text();  
    } 
    List<Node> children = node.childNodes();   
    StringBuffer buffer = new StringBuffer();    
    for (Node child: children) {   
        buffer.append(extractText(child)); 
    }   
    return buffer.toString();
}


public static String html2Str(String html){   
    Document doc = Jsoup.parse(html);    
    return extractText(doc);
}


/**测试*/
public static void main(String[] args) {
	try {
		System.out.println("请输入网址:");
		Scanner scanner=new Scanner(System.in);
		String urlString=scanner.next();
		String aString=getUrlText(urlString);
		String aSt=html2Str(aString);
		System.out.println(aSt);
	} catch (Exception e) {
		// TODO Auto-generated catch block
		e.printStackTrace();
	}
}


}

——————————————————————————————————

jar包依赖:
<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.8.3</version>
</dependency>

获取网站响应信息

   import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.Map;
import java.util.Set;

public class GetHttpHead {
    public static String HeadMessage(String Url)  {
        URL url = null;         //创建一个URL对象,赋值为空;
        String result;
        result = "非正常的响应信息!";           //返回结果集
        try {
            url = new URL(Url);         //实例化url
        } catch (MalformedURLException e) {
            e.printStackTrace();
            System.out.println("非正常访问的URL");
        }
        URLConnection conn = null;
        try {
            conn = url.openConnection();
        } catch (IOException e) {
            e.printStackTrace();
            System.out.println("系统抛出了异常信息");
        }
        Map headers = conn.getHeaderFields();
        Set<String> keys = headers.keySet();
        for( String key : keys ){
            String val = conn.getHeaderField(key);
            result+=key+"    "+val+"\n";
           if (val.equals("HTTP/1.1 200 OK")){
               result="站点可以正常访问\n响应信息:";}
           }

        System.out.println( conn.getLastModified() );

        return result;
    }

//    public static void main(String[] args) throws IOException {
//
//        System.out.println( HeadMessage("http://z.ahdy.top"));
//    }
}

获取网站图片

致谢!

package GetImg;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

/**
 *
 * @ClassName: ImgTool
 * @Description:IMG tool ,请先赋值URL 获取指定ULR的某个网页里面
 *                  1.返回所有图像url list
 *                  2.图像文件夹路径.txt
 *                  3.所有图像保存在img文件夹里面
 * @author penny
 * @date 2020年3月7日 下午13:07:49
 *
 */
public class ImgTool {
    // 参数域
    /** 指定URL */
    public static String URL = "http://z.ahdy.top/";
    private int imgNumbs = 0;
    public static List<String> downloadMsg=new ArrayList<String>();
    public String imgUrlTxt = "imgURLs.txt";
    public static String regex= "^((https|http|ftp|rtsp|mms)?://)"
            + "?(([0-9a-z_!~*‘().&=+$%-]+: )?[0-9a-z_!~*‘().&=+$%-]+@)?" //ftp的user@
            + "(([0-9].)[0-9]" // IP形式的URL- 199.194.52.184
            + "|" // 允许IP和DOMAIN(域名)
            + "([0-9a-z_!~*‘()-]+.)*" // 域名-
            + "([0-9a-z][0-9a-z-])?[0-9a-z]." // 二级域名
            + "[a-z])" // first level domain- .com or .museum
            + "(:[0-9])?" // 端口- :80
            + "((/?)|" // a slash isn‘t required if there is no file name
            + "(/[0-9a-z_!~*‘().;?:@&=+$,%#-]+)+/?)$";
    private ImgTool() {
    };

    private static ImgTool instance = new ImgTool();

    /** 获取ImgTool 单例 */
    public static ImgTool getInstance() {
        return instance;
    }

    public List<String> getURLs() {
        return getURLs(null);
    }
    public boolean isURL(String str) {
        if(StringUtil.isBlank(str)){
            return false;
        }else{
//      String regex = "^(?:https?://)?[\\w]{1,}(?:\\.?[\\w]{1,})+[\\w-_/?&=#%:]*$";
//      String regex = "^([hH][tT]{2}[pP]:/*|[hH][tT]{2}[pP][sS]:/*|[fF][tT][pP]:/*)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\/])+(\\?{0,1}(([A-Za-z0-9-~]+\\={0,1})([A-Za-z0-9-~]*)\\&{0,1})*)$";
            Pattern pattern = Pattern.compile(regex);
            Matcher matcher = pattern.matcher(str);
            if(matcher.matches()){
                return true;
            }else{
                return false;
            }
        }};
    /***
     * @Title: getURLs
     * @Description: 给定cssQuery对象
     * @param @param cssQuery HTML中的CSS(或者 JQuery)选择器语法,更多详细用法见Jsoup介绍 < a
     *        href="https://jsoup.org/apidocs/org/jsoup/select/Selector.html"
     *        ></a>
     * @param @return List
     * @throws
     *
     */
    public List<String> getURLs(String cssQuery) {
        List<String> urls = null;
        Document doc;
        Elements imgElements ;
        if (!isURL(URL)) {
            return null;
        }
        if(StringUtil.isBlank(cssQuery)){
            cssQuery="img";
        }
        try {
            doc = Jsoup.connect(URL).get();
        } catch (IOException e) {
            e.printStackTrace();
            return null;
        }
        if(doc==null)return null;
        imgElements = doc.select(cssQuery);
        urls = new ArrayList<String>();
        for (Object eleObj : imgElements) {
            //"(https?|ftp|http)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]"
            Pattern pattern = Pattern.compile("(https?|http)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]");
            Matcher matcher = pattern.matcher(eleObj.toString());
            if (matcher.find()) {
                String url = matcher.group();
                urls.add(url);
            }
        }
        imgNumbs = imgElements.size();
        return urls;
    }


    /**
     *
     * @Title: createImgURLTxt
     * @Description:
     * @param @param cssQuery:默认使用img HTML中的CSS(或者 JQuery)选择器语法,更多详细用法见Jsoup介绍 <
     *        a href="https://jsoup.org/apidocs/org/jsoup/select/Selector.html">
     *        </a>
     * @throws
     */
    public String createImgURLTxt(String cssQuery) {
        long start = System.currentTimeMillis();
        List<String> urls;
        urls = getURLs(cssQuery);
        BufferedWriter os = null;
        File urlsFiles = new File("D:\\work\\imgURLs.txt");
        if(urlsFiles.exists()){
            try {
                urlsFiles.createNewFile();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        try {
            os = new BufferedWriter(new FileWriter(urlsFiles));
            if(urls==null)return null;
            for (int i = 0; i < urls.size(); i++) {
                os.write(urls.get(i) + "\n");
            }
            String result = "执行完毕,生成imgURLs.txt,耗时"
                    + (System.currentTimeMillis() - start) / 1000 + "s";
            System.out.println(result);
            return result;
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (os != null)
                    os.close();

            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return null;
    }

    /**
     *
     * @Title: createImgs
     * @Description:
     * @param @param cssQuery
     * @param @throws IOException
     * @throws
     */
    public void createImgs(String cssQuery) throws IOException {
        long startTime = System.currentTimeMillis();
        downloadMsg.add("Your images is downloading ");
        BufferedReader br = null;
        OutputStream out = null;
        InputStream in = null;
        ArrayList<String> imgList = null;
        HttpURLConnection con = null;

        String url;             //待下载文件url
        int fileSize = 0;       //单个文件大小
        int totalFileNum=0;     //总文件数
        int downLoadFileNum=0;  //已下载文件
        long totalTime=0;       //总耗时/s
        long singleTime=0;      //单个文件耗时/ms

        br = new BufferedReader(new FileReader(imgUrlTxt));
        imgList = new ArrayList<String>();
        while ((url = br.readLine()) != null) {
            imgList.add(url);
        }
        downLoadFileNum=totalFileNum= imgList.size();
        downloadMsg.add("总文件数"+(totalFileNum));
        for (String listUrl : imgList) {
            startTime = System.currentTimeMillis();
            String fileName = listUrl.substring(listUrl.lastIndexOf(‘/‘) + 1);// 截取文件名
            URL imgUrl = new URL(listUrl.trim());
            if (con != null)
                con.disconnect();
            con = (HttpURLConnection) imgUrl.openConnection();
            con.setRequestMethod("GET");
            con.setDoInput(true);
            con.setConnectTimeout(1000 * 30);
            con.setReadTimeout(1000 * 30);
            fileSize = con.getContentLength();
            con.connect();
            try {
                in = con.getInputStream();
                File file = new File("D:\\work\\img" + File.separator, fileName);
                if (!file.exists()) {
                    file.getParentFile().mkdirs();
                    file.createNewFile();
                }
                out = new BufferedOutputStream(new FileOutputStream(file));
                int len = 0;
                byte[] buff = new byte[1024 * 1024];
                while ((len = new BufferedInputStream(in).read(buff)) != -1) {
                    out.write(buff, 0, len);
                }
                out.flush();
                downLoadFileNum--;
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                if (br != null)
                    try {
                        br.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                if (in != null)
                    try {
                        in.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                if (out != null)
                    try {
                        out.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                singleTime=System.currentTimeMillis() - startTime;
                totalTime+=singleTime;
                downloadMsg.add("文件名" + fileName + "  文件大小" + fileSize
                        +"  未下载文件数"+(downLoadFileNum)
                        +"  下载耗时"+ singleTime + "ms" );
                System.out.println(downloadMsg.get(downloadMsg.size()-1));
            }
        }
        downloadMsg.add("总耗时"+totalTime/1000+"s");
    }

    /**
     * @throws IOException
     *
     * @Title: main
     * @Description: test
     * @param @param args
     * @throws
     */
    public static void main(String[] args) throws IOException {
        ImgTool img = ImgTool.getInstance();
        img.createImgURLTxt("img");
        List<String> urls = img.getURLs("img");
        for (Object str : urls) {

            System.out.println(str.toString());
        }
        img.createImgs(null);
    }
}

获取HTML

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class GetPage {
    public static String getPage(String url){
        File file =new File("D:\\work");
        FileWriter fwriter=null;
        HttpClientBuilder httpClientBuilder= HttpClients.custom();
        CloseableHttpClient client =httpClientBuilder.build();
        HttpGet request=new HttpGet(url);
        String content=null;
        String IoMessage="";
        try {
            CloseableHttpResponse response = client.execute(request);
            HttpEntity entity=response.getEntity();
            content = EntityUtils.toString(entity);


        } catch (IOException e) {
            e.printStackTrace();
        }
        try {
            if(!file.exists()){
                file.mkdirs(); //创建目录
            }
            // true表示不覆盖原来的内容,而是加到文件的后面。若要覆盖原来的内容,直接省略这个参数就好
            file=new File(file+"\\"+"page"+".txt"); //若文件夹已经存在,不会重复创建和覆盖原有内容
            fwriter = new FileWriter(file, true);
            fwriter.write(content);
        } catch (IOException ex) {
            ex.printStackTrace();
        } finally {
            try {
                fwriter.flush();
                fwriter.close();
                IoMessage="\n数据已抓取:内容写入至D:\\work\\page.txt\n";
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
        return IoMessage+content ;
    }

//    public static void main(String[] args) {
//
//        System.out.println( getPage("https://study.163.com"));
//    }
}

获取视频

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class TencentVideo {
   static FileWriter fileWrite=null;
    public static void getTencentVideoName(String url) {
        List<Map<String,String>> resultList = new ArrayList<Map<String,String>>();
        File file=new File("D:\\work");
        if (!file.exists()){
            file.mkdirs();
        }
        file=new File(file+File.separator+"MoveMessage"+".txt");

        try {
           fileWrite= new FileWriter(file, true);
        } catch (IOException e) {
            e.printStackTrace();
        }


        Document document = null;
        int pageSize = 30;
        int index = 1;
        try {
            for(int i = 0 ; i < 167; i ++) {
                String urlget =  url + (i*pageSize);
                Thread.sleep(1000);
                System.out.println("URL:" + urlget.toString());
                document = Jsoup.connect(urlget).userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36")
                        //加上cookie信息
                        .cookie("auth", "token")
                        //设置超时
                        .timeout(30000)
                        //用get()方式请求网址,也可以post()方式
                        .get();
                Elements elements = document.select("li.list_item");
                if(elements == null || "".equals(elements.toString())) {
                    break;
                }
                for (Element ele : elements) {
                    Map<String,String> obj = new HashMap<String,String>();
                    Elements name = ele.select("strong.figure_title");
                    String nameStr = name.select("a").attr("title");
                    String leader  = ele.select("div.figure_desc").text();
                    String count  = ele.select("div.figure_count").text();
                    String remark = ele.select("span.figure_info").text();
                    String score = ele.select("div.figure_score").text();
                    System.out.println("================== " + index + " =====================");
                    System.out.println("名称:" + nameStr.toString());
                    System.out.println("主演:" + leader.toString());
                    System.out.println("评分:" + score.toString());
                    System.out.println("描述:" + remark.toString());
                    System.out.println("点播量:" + count.toString());
                    obj.put("name", nameStr);
                    obj.put("lead", leader);
                    obj.put("desc", remark);
                    obj.put("score", score);
                    obj.put("dianbo", count);
                    resultList.add(obj);
                    fileWrite.write(String.valueOf(obj)+"\n");
                    index ++;
                }
            }
            //new ExportExcel().exportTencentExcle(resultList);
        } catch (IOException e) {
            e.printStackTrace();
        }catch (Exception ae) {
            ae.printStackTrace();
        }finally {
            try {
                fileWrite.flush();
                fileWrite.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

        }

    }

    public static void main(String[] args) {
        //电影
        //getTencentVideoName("http://v.qq.com/x/list/movie?itype=-1&offset=");
        //电视剧
//        getTencentVideoName("http://v.qq.com/x/list/tv?feature=-1&offset=");
        //动漫
//        getTencentVideoName("http://v.qq.com/x/list/cartoon?itype=-1&offset=");
        //少儿
//        getTencentVideoName("http://v.qq.com/x/list/child?iarea=-1&offset=");
        //综艺
//        getTencentVideoName("http://v.qq.com/x/list/variety?exclusive=-1&offset=");
        //演唱会
//        getTencentVideoName("http://v.qq.com/x/list/music?istate=2&offset=");
        //纪录片
//        getTencentVideoName("http://v.qq.com/x/list/doco?itrailer=-1&offset=");
        //电影独播
//        getTencentVideoName("https://v.qq.com/x/list/movie?characteristic=5&offset=");
        //电视剧独播
        getTencentVideoName("https://v.qq.com/x/list/tv?feature=44&offset=");
    }
}

获取站点所有连接

import java.io.*;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WebCrawlerDemo {

    public static void main(String[] args) throws IOException {
        WebCrawlerDemo webCrawlerDemo = new WebCrawlerDemo();
        webCrawlerDemo.myPrint("https://v.qq.com");
    }

    public String myPrint(String baseUrl) throws IOException {
        Map<String, Boolean> oldMap = new LinkedHashMap<String, Boolean>(); // 存储链接-是否被遍历
        // 键值对
        FileWriter fileWriter=null;
        String oldLinkHost = "";  //host
        File file=new File("D:\\work");
        if (!file.exists()){
          file.mkdirs();
        }
        file=new File(file+File.separator+"LinkUrl"+".txt");
         fileWriter=new FileWriter(file,true);
        Pattern p = Pattern.compile("(https?://)?[^/\\s]*"); //比如:http://www.zifangsky.cn
        Matcher m = p.matcher(baseUrl);
        if (m.find()) {
            oldLinkHost = m.group();
        }

        oldMap.put(baseUrl, false);
        try {
            oldMap = crawlLinks(oldLinkHost, oldMap);
        } catch (IOException e) {
            e.printStackTrace();
        }
        StringBuffer stringBuffer=new StringBuffer();
        for (Map.Entry<String, Boolean> mapping : oldMap.entrySet()) {
            stringBuffer.append("链接:" + mapping.getKey()+"\n");
            System.out.println("链接:" + mapping.getKey());


        }
        fileWriter.write(stringBuffer.toString());
        fileWriter.flush();
        fileWriter.close();

        return stringBuffer.toString();
    }

    /**
     * 抓取一个网站所有可以抓取的网页链接,在思路上使用了广度优先算法
     * 对未遍历过的新链接不断发起GET请求,一直到遍历完整个集合都没能发现新的链接
     * 则表示不能发现新的链接了,任务结束
     *
     * @param oldLinkHost 域名,如:http://www.zifangsky.cn
     * @param oldMap      待遍历的链接集合
     * @return 返回所有抓取到的链接集合
     */
    private Map<String, Boolean> crawlLinks(String oldLinkHost,
                                            Map<String, Boolean> oldMap) throws IOException {
        Map<String, Boolean> newMap = new LinkedHashMap<String, Boolean>();
        String oldLink = "";

        for (Map.Entry<String, Boolean> mapping : oldMap.entrySet()) {
            System.out.println("link:" + mapping.getKey() + "--------check:"
                    + mapping.getValue());

            // 如果没有被遍历过
            if (!mapping.getValue()) {
                oldLink = mapping.getKey();
                // 发起GET请求
                try {
                    URL url = new URL(oldLink);
                    HttpURLConnection connection = (HttpURLConnection) url
                            .openConnection();
                    connection.setRequestMethod("GET");
                    connection.setConnectTimeout(2000);
                    connection.setReadTimeout(2000);

                    if (connection.getResponseCode() == 200) {
                        InputStream inputStream = connection.getInputStream();
                        BufferedReader reader = new BufferedReader(
                                new InputStreamReader(inputStream, "UTF-8"));
                        String line = "";
                        Pattern pattern = Pattern
                                .compile("<a.*?href=[\"‘]?((https?://)?/?[^\"‘]+)[\"‘]?.*?>(.+)</a>");
                        Matcher matcher = null;
                        while ((line = reader.readLine()) != null) {
                            matcher = pattern.matcher(line);
                            if (matcher.find()) {
                                String newLink = matcher.group(1).trim(); // 链接
                                // String title = matcher.group(3).trim(); //标题
                                // 判断获取到的链接是否以http开头
                                if (!newLink.startsWith("http")) {
                                    if (newLink.startsWith("/"))
                                        newLink = oldLinkHost + newLink;
                                    else
                                        newLink = oldLinkHost + "/" + newLink;
                                }
                                //去除链接末尾的 /
                                if (newLink.endsWith("/"))
                                    newLink = newLink.substring(0, newLink.length() - 1);
                                //去重,并且丢弃其他网站的链接
                                if (!oldMap.containsKey(newLink)
                                        && !newMap.containsKey(newLink)
                                        && newLink.startsWith(oldLinkHost)) {
                                    // System.out.println("temp2: " + newLink);
                                    newMap.put(newLink, false);
                                }
                            }
                        }
                    }
                } catch (MalformedURLException e) {
                    e.printStackTrace();
                } catch (IOException e) {
                    e.printStackTrace();
                }

                try {
                    Thread.sleep(1000);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                oldMap.replace(oldLink, false, true);
            }
        }
        //有新链接,继续遍历
        if (!newMap.isEmpty()) {
            oldMap.putAll(newMap);
            oldMap.putAll(crawlLinks(oldLinkHost, oldMap));  //由于Map的特性,不会导致出现重复的键值对
        }
        return oldMap;
    }

}

最终截图

技术分享图片

项目下载

网盘当中包含了,我打包好的运行程序,和项目源码。
腾讯微云

FX程序打包

打包有什么用呢?假如你的朋友不会编程那么他也想使用你开发的程序那么该怎么办,他的电脑上也没有下载jdk,这时候我们项目打包,经过打包发布的程序在每台电脑上都能够运行起来。

以IDEA为例子

详情参考此处

启动jar程序

技术分享图片
1.你需要将jdk子目录当中的jre文件夹。也就是Java程序运行时需要的环境拿过来。

2.再把打包好的jar包放进来

3.最后一步,也是关键的一步,创建批处理命令(.bat)文件,使用工具编辑该文件,在文件内写入脚本

start  jre\bin\java.exe  -jar  Web_Spider.jar

注意:最后等待.jar文件以你所打包的报名为准,代码中的jar名不是固定的。

其中 java.exe为带有doc命令的窗口程序,如果不需要则可以修改为Javaw.exe。

关于打包成.exe可执行文件问题

fx程序可以打包成.exe执行程序,详情参考网络教程。

Java网络编程-爬虫

原文:https://www.cnblogs.com/effortfordream/p/13301718.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!