web-scraping - Parsing and extracting data to pandas using BeautifulSoup
问题描述
I'm trying to scrape some data off a website, but am new to Python/HTML and could use some help.
Here's the part of the code that works:
from bs4 import BeautifulSoup
import requests
page_link ='http://www.some-website.com'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
data = page_content.find(id='yyy')
print(data)
This successfully grabs the data I'm trying to scrape, which when printed appears as follows
<div class="generalData" id="yyy">
<div class="generalDataBox">
<div class="rowText">
<label class="some-class-here" title="some-title-here">
Title Name
</label>
<span class="" id="">###</span>
</div>
<div class="rowText">
<label class="same-class-here" title="another-title-here">
Another Title Name
</label>
<span class="" id="">###2</span>
</div>
... more rows here ...
</div></div>
What is the best way to get this into a pandas dataframe? Ideally, it would have two columns: one with the label name (i.e. 'Title Name' or 'Another Title Name' above), another column with the data (i.e. ### and ###2 above).
Thanks!
解决方案
首先是提取部分:
html = """<div class="generalData" id="yyy">
<div class="generalDataBox">
<div class="rowText">
<label class="same-class-here" title="some-title-here">Title Name</label>
<span class="" id="">###</span>
</div>
<div class="rowText">
<label class="same-class-here" title="another-title-here">Another Title Name</label>
<span class="" id="">###2</span>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
hashList = list()
titleList = list()
rangeLen = len(soup.find_all('label', class_="same-class-here"))
for i in range(rangeLen):
titleList.append(soup.find_all('label', class_="same-class-here")[i].get_text())
hashList.append(soup.find_all('span')[i].get_text())
现在,一旦您提取了您想要的任何内容,在本例中是两列的值,我们使用 pandas 将其放入数据框。
import pandas as pd
df = pd.DataFrame()
df['Title'] = titleList
df['Hash'] = hashList
输出:
Title Hash
0 Title Name ###
1 Another Title Name ###2
推荐阅读
- npm - 如何纱线忽略需要旧版本节点的包?
- boost-asio - 如何链接没有以 .a、.so、.lib、.dylib、.bundle 结尾的文件的库
- postgresql - pyspark:org.postgresql.util.PSQLException:连接尝试失败
- r - 分别为 geom_tex 和 geom_point 设置 scale_size 的最小值和最大值
- database - 数据库级别的加密
- amazon-web-services - 是否有一种简单的方法可以一次撤销用户的所有 AWS Lake Formation 权限?
- sql - 具有计数查询条件的 ActiveRecord
- html - 如何在`中指定默认值
`?
- python-3.x - 使用 os.getenv 时“函数”对象不可下标
- xml - 在 SnowFlake 中加载大型 XML 文件