首页 > 解决方案 > 如何在 Python 中从一页中抓取和索引多个表?

问题描述

我正在尝试使用维基百科页面将地区编号与芝加哥的社区区域相匹配:https ://en.wikipedia.org/wiki/Community_areas_in_Chicago

我知道如何一张一张地做,但我相信有一个循环可以使这项任务变得更容易。

但是,表中不包含区域名称,因此我可能不得不以更手动的方式将它们与连接或字典匹配。

下面的代码有效,但它将所有表格成一个,所以我无法区分“侧面”。

import pandas as pd

df_list = []
for i in range(0, 9): 
    url_head = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago' 
    df_list.append(pd.read_html(url, header = 0)[i])

df = pd.concat(df_list).drop_duplicates()
  1. 主要任务:我想用每个表唯一的附加索引列来抓取所有表(侧名将是完美的)。熊猫可以做到吗?

  2. 一个小问题:有 9 个区,但是,当我使用 (0:8) 公式时,最后一个表丢失了,我不知道为什么。有没有办法用像 len 这样的东西来自动化这个范围?

标签: pythonpython-3.xpandasloopsweb-scraping

解决方案


问题read_html()是,当您需要解析<table>标签时它很棒,但是标签之外的任何东西<table>都不会抓取。因此,您需要使用 BeautifulSoup 来更具体地了解如何获取数据。

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

tables = soup.find_all('table')
results_df = pd.DataFrame()
for table in tables:
    #table = tables[0]
    main_area = table.findPrevious('h3').text.split('[')[0].strip()

    try:
        sub_area = table.find('caption').text.strip()
    except:
        sub_area = 'N/A'

    rows = table.find_all('tr')
    for row in rows:
        #row = rows[1]
        data = row.find_all('td')

        try:    
            number = data[0].text.strip()
            com_area = data[1].text.strip()

            n_list = [ each.text.strip() for each in data[2].find_all('li') ]
            if n_list == []:
                n_list = ['']

            for each in n_list:
                temp_df = pd.DataFrame([[main_area, sub_area, number, com_area, each]], columns = ['Community area by side', 'Sub community area by side', 'Number', 'Community area', 'Neighborhoods'])

                results_df = results_df.append(temp_df).reset_index(drop=True)
        except:
            continue

输出:

print (results_df.to_string())
    Community area by side Sub community area by side Number          Community area                     Neighborhoods
0                  Central                        N/A     08         Near North Side                     Cabrini–Green
1                  Central                        N/A     08         Near North Side                    The Gold Coast
2                  Central                        N/A     08         Near North Side                      Goose Island
3                  Central                        N/A     08         Near North Side                  Magnificent Mile
4                  Central                        N/A     08         Near North Side                          Old Town
5                  Central                        N/A     08         Near North Side                       River North
6                  Central                        N/A     08         Near North Side                        River West
7                  Central                        N/A     08         Near North Side                     Streeterville
8                  Central                        N/A     32                    Loop                              Loop
9                  Central                        N/A     32                    Loop                      New Eastside
10                 Central                        N/A     32                    Loop                        South Loop
11                 Central                        N/A     32                    Loop                    West Loop Gate
12                 Central                        N/A     33         Near South Side                     Dearborn Park
13                 Central                        N/A     33         Near South Side                     Printer's Row
14                 Central                        N/A     33         Near South Side                        South Loop
15                 Central                        N/A     33         Near South Side  Prairie Avenue Historic District
16              North Side                 North Side     05            North Center                       Horner Park
17              North Side                 North Side     05            North Center                    Roscoe Village
18              North Side                 North Side     06               Lake View                          Boystown
19              North Side                 North Side     06               Lake View                    Lake View East
20              North Side                 North Side     06               Lake View                    Graceland West
21              North Side                 North Side     06               Lake View             South East Ravenswood
22              North Side                 North Side     06               Lake View                      Wrigleyville
23              North Side                 North Side     07            Lincoln Park                 Old Town Triangle
24              North Side                 North Side     07            Lincoln Park                         Park West
25              North Side                 North Side     07            Lincoln Park                    Ranch Triangle
26              North Side                 North Side     07            Lincoln Park               Sheffield Neighbors
27              North Side                 North Side     07            Lincoln Park              Wrightwood Neighbors
28              North Side                 North Side     21                Avondale                   Belmont Gardens
29              North Side                 North Side     21                Avondale          Chicago's Polish Village
30              North Side                 North Side     21                Avondale                   Kosciuszko Park
31              North Side                 North Side     22            Logan Square                   Belmont Gardens
32              North Side                 North Side     22            Logan Square                          Bucktown
33              North Side                 North Side     22            Logan Square                   Kosciuszko Park
34              North Side                 North Side     22            Logan Square                     Palmer Square
35              North Side             Far North side     01             Rogers Park                  East Rogers Park
36              North Side             Far North side     02              West Ridge                   Arcadia Terrace
37              North Side             Far North side     02              West Ridge                     Peterson Park
38              North Side             Far North side     02              West Ridge                  West Rogers Park
39              North Side             Far North side     03                  Uptown                        Buena Park
40              North Side             Far North side     03                  Uptown                     Argyle Street
41              North Side             Far North side     03                  Uptown                      Margate Park
42              North Side             Far North side     03                  Uptown                     Sheridan Park
43              North Side             Far North side     04          Lincoln Square                        Ravenswood
44              North Side             Far North side     04          Lincoln Square                Ravenswood Gardens 
...

推荐阅读