首页 > 解决方案 > 使用 Beautiful Soup 和 Python 从 wiki 抓取表格数据

问题描述

如何使用 python 中的美丽汤从以下 wiki 页面的前两个表中提取 Alpha-3 代码?

https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

r = requests.get('https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language')
soup = bs(r.content, 'lxml')

table = soup.find_all('table', class_='wikitable')[0]

output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)

output_rows[1][2].rstrip('\n')
output_rows[2][2].rstrip('\n')
output_rows[3][2].rstrip('\n')
output_rows[4][2].rstrip('\n')

标签: python-3.xweb-scrapingbeautifulsoup

解决方案


使用 pandas 获取表,然后只附加前 2 个表(如果你想要所有数据),或者只获取Alpha-3列。

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language'
dfs = pd.read_html(url)

df = pd.DataFrame()
for table in dfs[:3]:
    df = df.append(table, sort=True).reset_index(drop=True)

alpha3 = list(df['Alpha-3 code'].dropna())

输出:

print (alpha3)
['AUS', 'NZL', 'GBR', 'USA', 'ATG', 'BHS', 'BRB', 'BLZ', 'BWA', 'BDI', 'CMR', 'CAN', 'COK', 'DMA', 'SWZ', 'FJI', 'GMB', 'GHA', 'GRD', 'GUY', 'IND', 'IRL', 'JAM', 'KEN', 'KIR', 'LSO', 'LBR', 'MWI', 'MLT', 'MHL', 'MUS', 'FSM', 'NAM', 'NGA', 'NIU', 'PAK', 'PLW', 'PNG', 'PHL', 'RWA', 'KNA', 'LCA', 'VCT', 'WSM', 'SYC', 'SLE', 'SGP', 'SLB', 'ZAF', 'SSD', 'SDN', 'TZA', 'TON', 'TTO', 'TUV', 'UGA', 'VUT', 'ZMB', 'ZWE']

推荐阅读