首页 > 技术文章 > java动态爬虫jsoup以及正则表达式的运用

dobestself-994395 2015-06-30 14:14 原文

1.jsoup是java的HTML解析器,可直接解析某个URL地址,HTML文本内容。http://jsoup.org/官网

2.解析URL地址

1  Document doc = Jsoup
2                     .connect(url)
3                     .userAgent(
4                             "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0)") // 设置User-Agent
5                     .timeout(5000) // 设置连接超时时间
6                     .get();
View Code

 

 1 Elements elements = doc.getElementsByClass("desc");
 2 Elements subelements = elements.get(0).getElementsByTag("li");
 3  Elements dayElements = eachDayElement.getElementsByTag("tr");
 4  Elements firstSubElements = firstElement.getElementsByTag("td");
 5 String text = elements.get(0).text();
 6 private static String regEx_publishDate = "由中央气象台\\s*(\\d+):(\\d+)\\s*发布的";
 7     private static Pattern pattern_publishDate = Pattern
 8             .compile(regEx_publishDate);
 9 Matcher matcher = pattern_publishDate.matcher(text);
10 if (matcher.find()) {
11             int hour = Integer.parseInt(matcher.group(1));
12             int minute = Integer.parseInt(matcher.group(2));}
View Code

3.要有jsoup的jar包

4. \s 匹配任意的空白符 \S匹配任意不是空白符的字符  \d匹配数字 +重复一次或更多次  * 重复零次或更多次

demo:

1 (\\d{4})-(\\d{2})-(\\d{2})\\s+(\\d{2}):(\\d{2})发布
2 (\\S+过敏\\S+):\\s+(\\S+)\\s+(\\S+)
3 \\s+(感冒\\S+):\\s+(\\S+)\\s+(\\S+)
4 \\s*(\\S+)\\s*
5 首要污染物:\\s*(\\S+)\\s*"
View Code

正则表达式语法:

https://msdn.microsoft.com/zh-cn/library/ae5bf541%28v=vs.80%29.aspx

推荐阅读