首页 > 解决方案 > 仅使用 Python 的 BeautifulSoup 提取列内容,以便每行的所有列都在同一行中

问题描述

我在 Jupyter Notebooks 中有以下 python 片段有效。我面临的挑战是仅提取列数据行

这是片段:

from bs4 import BeautifulSoup as bs 
import pandas as pd
page = requests.get("http://lib.stat.cmu.edu/datasets/boston")
page
soup = bs(page.content)
soup
allrows = soup.find_all("p")
print(allrows)

标签: python-3.xbeautifulsoup

解决方案


我有点不清楚你在追求什么,但我认为这是来自提供的 URL 的每一行数据。我找不到使用漂亮汤来解析您所追求的数据的方法,但确实找到了一种使用 .split() 分隔行的方法

    from bs4 import BeautifulSoup as bs 
    import pandas as pd
    import requests

    page = requests.get("http://lib.stat.cmu.edu/datasets/boston")
    soup = bs(page.content)
    allrows = soup.find_all("p")

    text = soup.text # turn soup into text
    text_split = text.split('\n\n') # split the page into 3 sections
    data = text_split[2] # rows of data

    # create df column titles using variable titles on page
    col_titles = text_split[1].split('\n')
    df = pd.DataFrame(columns=range(14))
    df.columns = col_titles[1:]

    # 'try/except' to catch end of index, 
    # loop throw text data building complete rows
    try:
        complete_row = []
        n1 = 0 #used to track index
        n2 = 1
        rows = data.split('\n') 
        for el in range(len(rows)):
            full_row = rows[n1] + rows[n2]
            complete_row.append(full_row)
            n1 = n1 + 2
            n2 = n2 + 2
    except IndexError:
        print('end of loop')

    # loop through rows of data, clean whitespace and append to df  
    for row in complete_row:   
        elem = row.split(' ')
        df.loc[len(df)] = [el for el in elem if el]

    #fininshed dataframe    
    df

推荐阅读