首页 > 解决方案 > 使用 jsoup 的 Java 网络爬虫问题

问题描述

我在让我的网络爬虫类为我的 java 类项目工作时遇到问题。这是提示:

在这个项目中,您将创建一个网络爬虫类。

在我们讨论如何处理课程之前有两个注意事项:

确保在点击每个链接之间至少睡 0.5 秒。这是为了确保您最终不会对您希望抓取的任何站点进行 DDOS。要让您的程序进入睡眠状态,请查看 oracle 的以下文档:https ://docs.oracle.com/javase/tutorial/essential/concurrency/sleep.html (链接到外部站点。)为了不遍历广告网站,我们将使用维基百科链接。

这个类需要做两件事:

具有遍历 1000 个链接的功能。有一个计算单词的函数,也就是说,每次你看到一个特定的单词时,都会增加一个与该单词关联的数字。(提示:套装可能是一个不错的选择)。注意:您的函数应该计算单词,而不是 html 元素/属性。

我不一定关心你如何实现这个项目,只要你有一个类,在你的 1000 个链接遍历你的类的末尾:


这是我到目前为止的代码,不完全确定我做错了什么?它立即得到与数据类型有关的各种错误(下面的示例),但我认为我完全正确。没有运气研究,因为我对编程很陌生。

//爬虫.java

package edu.umsl

import java.io.IOException;
import java.util.HashMap;
import java.util.HashSet;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class crawler {
    // define max number of pages to visit
    private static final int MAX_PAGES = 1000;
    // collect all titles
    private HashSet<String> titles = new HashSet<>();
    // keep track of url visited
    private HashSet<String> urlVisited = new HashSet<>();
    // keep track of words and count
    private HashMap<String, Integer> map = new HashMap<>();

    // recursive function to crawl web
    public void getLinks(String startURL) {
        // condition to end the recursion
        if ((titles.size() < MAX_PAGES) && !urlVisited.contains(startURL)) {
            // add new url to set
            urlVisited.add(startURL);
            try {
                Document doc = Jsoup.connect(startURL).get();
                Elements linksFromPage = doc.select("a[href]");
                // take all text to count words
                String title = doc.select("title").first().text();
                titles.add(title);
                String text = doc.body().text();
                CountWords(text);
                for (Element link : linksFromPage) {
                    if (titles.size() <= MAX_PAGES) {
                        Thread.sleep(1000);
                        getLinks(link.attr("abs:href"));
                    } else {
                        System.out.println("URL couldnt visit");
                        System.out.println(startURL + ", " + urlVisited.size());
                    }
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
//           catch (InterruptedException e) {
//               e.printStackTrace();
//           }
            catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    // method to print all titles
    public void PrintAllTitles() {
        for (String t : titles) {
            System.out.println(t);
        }
    }

    // method to print word and count
    public void PrintAllWordsAndCount() {
        for (String key : map.keySet()) {
            System.out.println(key + " : " + map.get(key));
        }
    }

    private void CountWords(String text) {
        String[] lines = text.split(" ");
        for (String word : lines) {
            if (map.containsKey(word)) {
                int val = map.get(word);
                val += 1;
                map.remove(word);
                map.put(word, val);
            } else {
                map.put(word, 1);
            }
        }
    }
}

//driver.java

package edu.umsl


public class driver {

   public static void main(String[] args) {
  
       crawler c = new crawler();
       c.getLinks("https://en.wikipedia.org/wiki/Science");
       System.out.println("*******************************Printing all titles*******************************");
       c.PrintAllTitles();
       System.out.println("*******************************Printing all Words*******************************");
       c.PrintAllWordsAndCount();
   }

}

这是我得到的第一个错误。所有其他错误都是相同的,另外重复了该行: at edu.umsl.Crawler.getLinks(Crawler.java:50)

org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/*+xml. Mimetype=image/svg+xml, URL=https://upload.wikimedia.org/wikipedia/commons/3/37/People_icon.svg
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:772)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:707)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:297)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:286)
at edu.umsl.Crawler.getLinks(Crawler.java:33)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Crawler.getLinks(Crawler.java:50)
at edu.umsl.Driver.main(Driver.java:8)

更新:我将 link.atr 中的属性键设置为“abs:href”而不是“a[href]”

这解决了错误问题!谢谢!

标签: javawebweb-crawlercumulative-sum

解决方案


更新:我将 link.atr 中的属性键设置为“abs:href”而不是“a[href]”

这解决了错误问题!谢谢!

这发生在 Crawler.java 的第 50 行

更新代码:

//爬虫.java

package edu.umsl;

import java.io.IOException;
import java.util.HashMap;
import java.util.HashSet;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Crawler {

    // define max number of pages to visit
    private static final int MAX_PAGES = 1000;
    // collect all titles
    private HashSet<String> titles = new HashSet<>();
    // keep track of url visited
    private HashSet<String> urlVisited = new HashSet<>();
    // keep track of words and count
    private HashMap<String, Integer> map = new HashMap<>();

    // recursive function to crawl web
    public void getLinks(String startURL) {

        // condition to end the recursion
        if ((titles.size() < MAX_PAGES) && !urlVisited.contains(startURL)) {

            //add new url to set
            urlVisited.add(startURL);

            try {

                Document doc = Jsoup.connect(startURL).get();
                Elements linksFromPage = doc.select("a[href]");

                // take all text to count words
                String title = doc.select("title").first().text();

                titles.add(title);
                String text = doc.body().text();

                CountWords(text);

                for (Element link : linksFromPage) {

                    if(titles.size() <= MAX_PAGES) {

                        //sleep 1 sec before hitting another link. uncomment this to wait for a sec
                        //Thread.sleep(1000);
                        getLinks(link.attr("a[href]"));
                     }
                    else {
                        System.out.println("URL couldnt visit");
                        System.out.println(startURL + ", " + urlVisited.size());
                    }

                }

            } catch (IOException e) {

                e.printStackTrace();

            }
//           catch (InterruptedException e) {
//
//               e.printStackTrace();
//
//           }
            catch (Exception e) {

                e.printStackTrace();
            }
        }

    }

    //method to print all titles
    public void PrintAllTitles() {

        for (String t : titles) {
            System.out.println(t);
        }
    }

    //method to print word and count
    public void PrintAllWordsAndCount() {

        for (String key : map.keySet()) {

            System.out.println(key + " : " + map.get(key));
        }
    }

    private void CountWords(String text) {

        String[] lines = text.split(" ");

        for (String word : lines) {

            if (map.containsKey(word)) {
                int val = map.get(word);
                val += 1;
                map.remove(word);
                map.put(word, val);
            } else {
                map.put(word, 1);
            }

        }
    }
}

现在我的问题是它只遍历一个链接,它仍然计算一些我想忽略的 HTML 元素和属性。它还完成了程序,但出现以下错误:

java.lang.IllegalArgumentException: Must supply a valid URL
    at org.jsoup.helper.Validate.notEmpty(Validate.java:102)
    at org.jsoup.helper.HttpConnection.url(HttpConnection.java:127)
    at org.jsoup.helper.HttpConnection.connect(HttpConnection.java:70)
    at org.jsoup.Jsoup.connect(Jsoup.java:73)
    at edu.umsl.Crawler.getLinks(Crawler.java:33)
    at edu.umsl.Crawler.getLinks(Crawler.java:50)
    at edu.umsl.Driver.main(Driver.java:8)
*******************************Printing all titles*******************************
Science - Wikipedia
*******************************Printing all Words*******************************
half : 6
21–73. : 1
Western : 15
diverges : 1
Sheet". : 1
ten : 1
sake : 1
Robbins, : 1
energy : 4
Engineering". : 1
(SCAMs): : 1
" : 1
p.3—Lindberg, : 1
Fleck,] : 1
BCE : 3
& : 12
) : 1
require : 2
Harold; : 1
- : 1
completion, : 1
Strauss : 2
/ : 3
1 : 4
2 : 3
3 : 4
4 : 2
5 : 2
6 : 1
7 : 4
8 : 1
9 : 1
: : 3
A : 26
role : 7
G : 1
1000 : 1
Widespread : 1
H : 2
Abi : 1
I : 4
K : 1
O : 1
result : 2
R : 2
same : 7
Lightman, : 1
(1932) : 1

推荐阅读