python - 如何在 Python 中从一页中抓取和索引多个表?
问题描述
我正在尝试使用维基百科页面将地区编号与芝加哥的社区区域相匹配:https ://en.wikipedia.org/wiki/Community_areas_in_Chicago
我知道如何一张一张地做,但我相信有一个循环可以使这项任务变得更容易。
但是,表中不包含区域名称,因此我可能不得不以更手动的方式将它们与连接或字典匹配。
下面的代码有效,但它将所有表格刮成一个,所以我无法区分“侧面”。
import pandas as pd
df_list = []
for i in range(0, 9):
url_head = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
df_list.append(pd.read_html(url, header = 0)[i])
df = pd.concat(df_list).drop_duplicates()
主要任务:我想用每个表唯一的附加索引列来抓取所有表(侧名将是完美的)。熊猫可以做到吗?
一个小问题:有 9 个区,但是,当我使用 (0:8) 公式时,最后一个表丢失了,我不知道为什么。有没有办法用像 len 这样的东西来自动化这个范围?
解决方案
问题read_html()
是,当您需要解析<table>
标签时它很棒,但是标签之外的任何东西<table>
都不会抓取。因此,您需要使用 BeautifulSoup 来更具体地了解如何获取数据。
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all('table')
results_df = pd.DataFrame()
for table in tables:
#table = tables[0]
main_area = table.findPrevious('h3').text.split('[')[0].strip()
try:
sub_area = table.find('caption').text.strip()
except:
sub_area = 'N/A'
rows = table.find_all('tr')
for row in rows:
#row = rows[1]
data = row.find_all('td')
try:
number = data[0].text.strip()
com_area = data[1].text.strip()
n_list = [ each.text.strip() for each in data[2].find_all('li') ]
if n_list == []:
n_list = ['']
for each in n_list:
temp_df = pd.DataFrame([[main_area, sub_area, number, com_area, each]], columns = ['Community area by side', 'Sub community area by side', 'Number', 'Community area', 'Neighborhoods'])
results_df = results_df.append(temp_df).reset_index(drop=True)
except:
continue
输出:
print (results_df.to_string())
Community area by side Sub community area by side Number Community area Neighborhoods
0 Central N/A 08 Near North Side Cabrini–Green
1 Central N/A 08 Near North Side The Gold Coast
2 Central N/A 08 Near North Side Goose Island
3 Central N/A 08 Near North Side Magnificent Mile
4 Central N/A 08 Near North Side Old Town
5 Central N/A 08 Near North Side River North
6 Central N/A 08 Near North Side River West
7 Central N/A 08 Near North Side Streeterville
8 Central N/A 32 Loop Loop
9 Central N/A 32 Loop New Eastside
10 Central N/A 32 Loop South Loop
11 Central N/A 32 Loop West Loop Gate
12 Central N/A 33 Near South Side Dearborn Park
13 Central N/A 33 Near South Side Printer's Row
14 Central N/A 33 Near South Side South Loop
15 Central N/A 33 Near South Side Prairie Avenue Historic District
16 North Side North Side 05 North Center Horner Park
17 North Side North Side 05 North Center Roscoe Village
18 North Side North Side 06 Lake View Boystown
19 North Side North Side 06 Lake View Lake View East
20 North Side North Side 06 Lake View Graceland West
21 North Side North Side 06 Lake View South East Ravenswood
22 North Side North Side 06 Lake View Wrigleyville
23 North Side North Side 07 Lincoln Park Old Town Triangle
24 North Side North Side 07 Lincoln Park Park West
25 North Side North Side 07 Lincoln Park Ranch Triangle
26 North Side North Side 07 Lincoln Park Sheffield Neighbors
27 North Side North Side 07 Lincoln Park Wrightwood Neighbors
28 North Side North Side 21 Avondale Belmont Gardens
29 North Side North Side 21 Avondale Chicago's Polish Village
30 North Side North Side 21 Avondale Kosciuszko Park
31 North Side North Side 22 Logan Square Belmont Gardens
32 North Side North Side 22 Logan Square Bucktown
33 North Side North Side 22 Logan Square Kosciuszko Park
34 North Side North Side 22 Logan Square Palmer Square
35 North Side Far North side 01 Rogers Park East Rogers Park
36 North Side Far North side 02 West Ridge Arcadia Terrace
37 North Side Far North side 02 West Ridge Peterson Park
38 North Side Far North side 02 West Ridge West Rogers Park
39 North Side Far North side 03 Uptown Buena Park
40 North Side Far North side 03 Uptown Argyle Street
41 North Side Far North side 03 Uptown Margate Park
42 North Side Far North side 03 Uptown Sheridan Park
43 North Side Far North side 04 Lincoln Square Ravenswood
44 North Side Far North side 04 Lincoln Square Ravenswood Gardens
...
推荐阅读
- r - 如何得到5的倍数之和
- node.js - 为什么当mongoose的console.log(object.property)时我得到未定义
- forms - 根据需要标记文件/路径文件夹以便在表单设计器中继续?
- excel - 在 VBA Office 2013 中使用自动筛选或循环文件
- python - 图像序列的 LSTM 和 CNN 实现
- asp.net-mvc - 如何在验证摘要中显示 Html?
- docker - 在 Docker 和 Azure 中拉取和推送图像
- nlp - pytesseract 的配置(乌尔都语)
- swift - 听写(语音识别)文本与 Swift 中的字符串不匹配
- java - Springboot 中的 KafkaListener.java 何时提交偏移量?