python - 如何制作一个解析名称为“patch”或“fix”的链接的网络爬虫?
问题描述
我正在尝试为 Debian GSoC 项目的应用程序任务编程,并且我已经能够解析从 Internet 下载的文本文件,但是我很难尝试从页面上的链接下载补丁通过抓取页面,尤其是出现的第一页:来自 sourceware.org 的 BugZilla 站点。
这是我尝试过的代码:
#!/usr/bin/env python3 This program uses Python 3, don't use with 2.
import requests
from bs4 import BeautifulSoup
import re
import os
PAGES_CAH = ["https://sourceware.org/bugzilla/show_bug.cgi?id=23685", "https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=f055032e4e922f1e1a5e11026c7c2669fa2a7d19", "https://github.com/golang/net/commit/4b62a64f59f73840b9ab79204c94fee61cd1ba2c", "http://www.ca.tcpdump.org/cve/0002-test-case-files-for-CVE-2015-2153-2154-2155.patch" ]
patches = []
def searchy(pages):
for link in pages:
global patches
if "github.com" in link and "commit" in link: # detect that in each page that it's from GitHub
if 'patch' not in link: # detect if it's a patch page or not
link = link + '.patch' # add .patch to link if the patch link lacks it
request = requests.get(link) # connect to page
patches.append(request.text) # download patch to patches variable
elif ".patch" in link: # any other page with ".patach" in the end is downloaded like GitHub patches by default
request = requests.get(link) # connect to page
patches.append(request) #downmload patch to patches variable
else:
request = requests.get(link) # connect to page
soup = BeautifulSoup(request.text, "lxml") # turn the page into something parsable
if "sourceware.org/git" in link: # if it's from sourceware.org's git:
patch_link = soup.find_all('a', string="patch") # find all patch links
patch_request = requests.get(patch_link[0]) # connect to patch link
patches.append(patch_request.text) # download patch
elif "sourceware.org/bugzilla" in link: # if it's from sourceware's bugzilla
patch_link_possibilities = soup.find('a', id="attachment_table") # find all links from the attachment table
local_patches_links = patch_link_possibilities.find_all(string="patch") # find all links with the "patch" name
local_fixes_links = patch_link_possibilities.find_all(string="fix") # find all links with the "fix" name
for lolpatch in local_patches_links: # for each local patch in the local patch links list
patch_request = requests.get(lolpatch) # connect to page
patches.append(patch_request.text) #download patch
for fix in local_fixes_links: # for each fix in the local fix links list
patch_request = requests.get(fix) # connect to page
patches.append(patch_request.text) #download patch
searchy(PAGES_CAH)
print(patches)
解决方案
您可以尝试添加:contains
伪类选择器以patch
在链接文本中查找。需要 BeautifulSoup 4.7.1
import requests
from bs4 import BeautifulSoup
url = 'https://sourceware.org/bugzilla/show_bug.cgi?id=23685'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
links = [item['href'] for item in soup.select('a:contains(patch)')]
print(links)
您可以使用 css 或语法进行扩展:
links = [item['href'] for item in soup.select('a:contains(patch), a:contains(fix)')]
推荐阅读
- excel - 如何显示excel表格中的值?
- c# - 从 Task.Run 获取结果而不等待
- reactjs - React - 如何使导出组件对其目录私有?
- ios - 自定义 ViewController 过渡总是从中心开始
- entity-framework - 将具有展平集合的表映射回集合
- c# - 打印适合 A4 纸的 Windows 窗体
- r - R 比例填充手册 - 一个值两种颜色
- javascript - 如何找到加载 iframe html 的站点?
- command-line - Node.js - 使用 npm 安装 webpack - npm ERR!400 错误请求 - 获取 http://registry.npmjs.org/webpack
- perl - Find and replace a string in Perl