首页 > 解决方案 > Wiki Scraping 缺失数据

问题描述

我正在尝试从https://en.wikipedia.org/wiki/Megacity中提取表格,作为我最初涉足抓取世界的尝试(以完全透明的方式,我从阅读的博客中获取了这段代码)。我让程序正常工作,但我没有得到城市,而是 \n (也在每个字段上。问题:为什么我在每个字段的末尾都有 \n ,为什么我的第一个字段(城市)是空白的?下面列出是代码和输出的一部分。

import requests
scrapeLink = 'https://en.wikipedia.org/wiki/Megacity'
page = requests.get(scrapeLink)

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

megaTable = soup.find_all('table')[1]


rowValList = []    
for i in range(len(megaTable.find_all('td'))):
    rowVal = megaTable.find_all('td')[i].get_text()
    rowValList.append(rowVal)

cityList = []
for i in range(0, len(rowValList), 6):
    cityList.append(rowValList[i])

countryList = []
for i in range(1, len(rowValList), 6):
    countryList.append(rowValList[i])

contList = []
for i in range(2, len(rowValList), 6):
    contList.append(rowValList[i])

popList = []
for i in range(3, len(rowValList), 6):
    popList.append(rowValList[i])

import pandas as pd

megaDf = pd.DataFrame()
megaDf['City'] = cityList
megaDf['Country'] = countryList
megaDf['Continent'] = contList
megaDf['Population'] = popList
megaDf

输出

标签: python

解决方案


原因是城市不在td标签内,而是在th标签内。

<th scope="row"><a href="/wiki/Bangalore" title="Bangalore">Bangalore</a></th>

您所指的第一个 td 实际上是图像列。th您可以通过获取标签来选择城市名称。

此外,您可以通过首先获取表格的行然后为每一行选择必要的标签来简化您的爬虫,即thtd

import requests
from bs4 import BeautifulSoup

scrapeLink = "https://en.wikipedia.org/wiki/Megacity"
page = requests.get(scrapeLink)


soup = BeautifulSoup(page.content, "html.parser")

megaTable = soup.find_all("table")[1]

cities = []
# [:2] slices the array since the first 2 `tr` contains the headers 
for row in megaTable.find_all("tr")[2:]:
    city = row.th.get_text().strip()
    tds = row.find_all("td")
    country = tds[1].get_text().strip()
    continent = tds[2].get_text().strip()
    population = tds[3].get_text().strip()
    cities.append({
        "city": city,
        "country": country,
        "continent": continent,
        "popluation": population,
    })

print(cities)
[
    {
        "city": "Bangalore",
        "country": "India",
        "continent": "Asia",
        "population": "12,200,00"
    },
    # and so on
]

然后,您可以将列表转换为数据框:

df = pd.DataFrame(cities)

推荐阅读