首页 > 解决方案 > 用beautifulsoup在python中创建表

问题描述

这里是python的新手,我有一个关于使用Beautiful soup从刮擦中创建表格的问题。这是我正在使用的代码:

import requests
page=requests.get("https://www.opensecrets.org/lobby/lobbyist.php?id=Y0000008510L&year=2018")
from bs4 import BeautifulSoup
soup=BeautifulSoup(page.content, 'lxml')
table=soup.find(‘table’,{‘id’:’lobbyist_summary’})
for row in table:
    cells=row.find_all(‘a’)
    rn=cells[0].get_text()

错误是:

AttributeError: 'NavigableString' object has no attribute 'find_all'

打印(表)看起来像这样:

[<a href="firmsum.php?id=D000037635&amp;year=2018">Ballard Partners</a>, <a href="clientsum.php?id=F203227&amp;year=2018">Advanced Roofing Inc</a>, <a href="clientsum.php?id=F214670&amp;year=2018">Africell Holding</a>, <a href="clientsum.php?id=D000023883&amp;year=2018">Amazon.com</a>, ...]

我想(最终)得到一个表格,其中每个感兴趣的元素都放在一个单独的列中,这样它看起来像:

[[firmsum,D000037635,2018,Ballard Partners],[clientsum,F203227,2018,Advanced Roofing Inc],[clientsum,F214670,2018,Africell Holding],[clientsum,D000023883, 2018, Amazon.com]...]

标签: pythonparsingbeautifulsoup

解决方案


分配 4 个空列表:

col1List = list()
col2List = list()
col3List = list()
col4List = list()

首先,让我们获取第 4 列的值:

trs = table.find_all('tr')[1]
tds = trs.find_all('a')

for i in range(len(tds)):
    col4List.append(tds[i].get_text())

这给出了:

['Ballard Partners', 'Advanced Roofing Inc', 'Africell Holding',....]

现在,让我们从 中提取前 3 列的值href

hrefVal = trs.find_all('a')

for i in hrefVal:
    hVal = i.get('href')
    col11 = hVal.split('.php?id=', 1)
    col1 = col11[0]
    col1List.append(col1)
    col22 = col11[1].split('&', 1)
    col2 = col22[0]
    col2List.append(col2)
    col33 = col22[1].split('=', 1)
    col3 = col33[1]
    col3List.append(col3)

现在,让我们将所有列表放在一个数据框中以使其看起来整洁:

import pandas as pd

df = pd.DataFrame()
df['Col1'] = col1List
df['Col2'] = col2List
df['Col3'] = col3List
df['Col4'] = col4List

如果我输出前几行,它将看起来像您想要的那样:

Col1        Col2        Col3    Col4
firmsum     D000037635  2018    Ballard Partners
clientsum   F203227     2018    Advanced Roofing Inc
clientsum   F214670     2018    Africell Holding
clientsum   D000023883  2018    Amazon.com
clientsum   D000000192  2018    American Health Care Assn
clientsum   D000021839  2018    American Road & Transport Builders Assn

推荐阅读