首页 > 解决方案 > Parsing and extracting data to pandas using BeautifulSoup

问题描述

I'm trying to scrape some data off a website, but am new to Python/HTML and could use some help.

Here's the part of the code that works:

from bs4 import BeautifulSoup
import requests
page_link ='http://www.some-website.com'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
data = page_content.find(id='yyy')
print(data)

This successfully grabs the data I'm trying to scrape, which when printed appears as follows

<div class="generalData" id="yyy">
<div class="generalDataBox">

<div class="rowText">
<label class="some-class-here" title="some-title-here">
Title Name
</label>
<span class="" id="">###</span>
</div>

<div class="rowText">
<label class="same-class-here" title="another-title-here">
Another Title Name
</label>
<span class="" id="">###2</span>
</div>

... more rows here ...

</div></div>

What is the best way to get this into a pandas dataframe? Ideally, it would have two columns: one with the label name (i.e. 'Title Name' or 'Another Title Name' above), another column with the data (i.e. ### and ###2 above).

Thanks!

标签: web-scrapingbeautifulsoup

解决方案


首先是提取部分:

html = """<div class="generalData" id="yyy">
<div class="generalDataBox">

<div class="rowText">
<label class="same-class-here" title="some-title-here">Title Name</label>
<span class="" id="">###</span>
</div>

<div class="rowText">
<label class="same-class-here" title="another-title-here">Another Title Name</label>
<span class="" id="">###2</span>
</div>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

hashList = list()
titleList = list()

rangeLen = len(soup.find_all('label', class_="same-class-here"))

for i in range(rangeLen):
    titleList.append(soup.find_all('label', class_="same-class-here")[i].get_text())
    hashList.append(soup.find_all('span')[i].get_text())

现在,一旦您提取了您想要的任何内容,在本例中是两列的值,我们使用 pandas 将其放入数据框。

import pandas as pd

df = pd.DataFrame()
df['Title'] = titleList
df['Hash'] = hashList

输出:

                Title  Hash
0          Title Name   ###
1  Another Title Name  ###2

推荐阅读