python - BeautifulSoup 在 Python 中抓取带有和不带有 ID 的表
问题描述
我正在尝试抓取网站,它们都有表格。但是,第一个 url 有一个名为的表 ID .table-translations
,而另一个没有 ID,因此它不会抓取。
但如果我不包括它,它就不会爬行。
如何使用 BeautifulSoup 抓取有和没有表 ID 的数据?
下面是我的代码
import requests
from bs4 import BeautifulSoup
urls = ['http://www.mongols.eu/mongolian-language/mongolian-tale-six-silver-stars', 'http://www.mongols.eu/mongolian-language/mongolian-tale-yanzin-jaal']
for url in urls:
print(url)
out_fileName = url.rsplit('/', 1)[-1]
out_mn = out_fileName + "_mn.txt"
out_en = out_fileName + "_en.txt"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.table-translations tr')[1:]:
mongolian, english = map(lambda t: t.get_text(strip=True), row.select('td')[1:])
all_data.append((mongolian, english))
for row in all_data:
with open(out_mn, "a") as text_file:
text_file.write(row[0] + "\n")
with open(out_en, "a") as text_file:
text_file.write(row[1] + "\n")
解决方案
此脚本将从这两个 URL 获取所有翻译。但如果还有其他结构不同的页面,则需要调整:
import requests
from bs4 import BeautifulSoup
urls = ['http://www.mongols.eu/mongolian-language/mongolian-tale-six-silver-stars', 'http://www.mongols.eu/mongolian-language/mongolian-tale-yanzin-jaal']
for url in urls:
print(url)
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('tr')[1:]:
tds = [*map(lambda t: t.get_text(strip=True), row.select('td'))]
if len(tds) == 3:
mongolian, english = map(lambda t: t.get_text(strip=True), row.select('td')[1:])
else:
mongolian, english = map(lambda t: t.get_text(strip=True), row.select('td'))
print(mongolian)
print(english)
print('-' * 80)
all_data.append((mongolian, english))
印刷:
http://www.mongols.eu/mongolian-language/mongolian-tale-six-silver-stars
Зургаан мөнгөн мичид
Six silver stars
--------------------------------------------------------------------------------
Эрт урьд цагт зургаан өнчин хүүхэд товцог толгой дээр наадан суудаг юм санжээ.
Long ago, there were six orphan brothers playing on the top of a hill.
--------------------------------------------------------------------------------
... and so on.
推荐阅读
- java - 数组中数字的最后一个索引
- avalanche - Avalanche - eth_getBalance - 将余额从十六进制转换为十进制
- docker - 为什么 Docker For Windows(进程隔离)会在所有参数中创建带有“Any”的防火墙规则?
- git - 使用 venv 在 VS 代码中提交 Git
- asp.net - 使用 Vb.Net 根据 Asp.Net 中的另一个 DropDownList 值更改 DropDownList
- jquery - 使用 jQuery 获取嵌套元素
- python - rpm cx_oracle 兼容 oracle 21
- javascript - Javascript 匹配的次数少于 n 次,即使它已明确设置为匹配 n
- ios - Xcode 12.4 模拟器问题
- firebase - 将图像上传到 FirebaseStorage - 权限被拒绝