首页 > 解决方案 > 您如何遍历表格中的 HTML 链接以从表格中提取数据?

问题描述

我正在尝试浏览https://bgp.he.net/report/world的表格。我想浏览每个指向国家/地区页面的 HTML 链接,然后获取数据,然后迭代到下一个列表。我正在使用漂亮的汤,并且已经可以获取我想要的数据,但不能完全弄清楚如何遍历 HTML 列。

from bs4 import BeautifulSoup
import requests
import json


headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'}

url = "https://bgp.he.net/country/LC"
html = requests.get(url, headers=headers)

country_ID = (url[-2:])
print("\n")

soup = BeautifulSoup(html.text, 'html.parser')
#print(soup)
data = []
for row in soup.find_all("tr")[1:]: # start from second row
    cells = row.find_all('td')
    data.append({
        'ASN': cells[0].text,
        'Country': country_ID,
        "Name": cells[1].text,
        "Routes V4": cells[3].text,
        "Routes V6": cells[5].text
    })



i = 0

with open ('table_attempt.txt', 'w') as r:
    for item in data:
        r.write(str(data[i]))
        i += 1
        r.write("\n")


print(data)

我希望能够将每个国家的数据收集到一个书面文本文件中。

标签: pythonhtmljsonweb-scrapingbeautifulsoup

解决方案


You can iterate over the main table, and send a request to scrape the "report" listing:

import requests, re
from bs4 import BeautifulSoup as soup
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'}
def scrape_report(_id):
  _d = soup(requests.get(f'https://bgp.he.net/country/{_id}', headers=headers).text, 'html.parser')
  _headers = [i.text for i in _d.find_all('th')]
  _, *data = [[i.text for i in b.find_all('td')] for b in _d.find_all('tr')]
  return [dict(zip(_headers, i)) for i in data]

d = soup(requests.get('https://bgp.he.net/report/world', headers=headers).text, 'html.parser')
_, *_listings = [[re.sub('[\t\n]+', '', i.text) for i in b.find_all('td')] for b in d.find_all('tr')]
final_result = [{**dict(zip(['Name', 'Country', 'ASN'], [a, b, c])), 'data':scrape_report(b)} for a, b, c, *_ in _listings]

推荐阅读