python - 如何提取 HTML 表格并添加具有来自早期标记的常量值的新列?
问题描述
解决方案
我希望我理解你的问题是正确的。此脚本会将页面中找到的所有表刮到数据框中并将其保存到 csv 文件:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://oilpriceng.net/03-09-2019/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data, last = {'Enterprise':[], 'Price':[], 'Product':[]}, ''
for tag in soup.select('h1 strong, tr:has(td.vc_table_cell)'):
if tag.name == 'strong':
last = tag.get_text(strip=True)
else:
a, b = tag.select('td')
a, b = a.get_text(strip=True), b.get_text(strip=True)
if a and b != 'DEPOT PRICE':
data['Enterprise'].append(a)
data['Price'].append(b)
data['Product'].append(last)
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv')
印刷:
Enterprise Price Product
0 AVIDOR PH ₦190.0 AGO
1 SHORELINK AGO
2 BULK STRATEGIC PH ₦190.0 AGO
3 TSL AGO
4 MASTERS AGO
.. ... ... ...
165 CHIPET ₦132.0 PMS
166 BOND PMS
167 RAIN OIL PMS
168 MENJ ₦133.0 PMS
169 NIPCO ₦ 2,9000,000 LPG
[170 rows x 3 columns]
data.csv
(来自 LibreOffice的屏幕截图):
推荐阅读
- javascript - React-Native how to update Flat List updating item dynamically?
- svg - 是否可以在 CSS 中指定 viewBox?
- node.js - Getting a strange words array as Hash , why?
- c# - How to get a "Stream" with the GetAsync method of HttpClient in C# MVC?
- tensorflow - conda install -c anaconda protobuf
- css - 使用 Foundation 滚动单个页面时激活顶部导航链接
- javascript - 在加载页面时显示动画,例如:悬停:之前
- amazon-web-services - How to send tcp packet direct to server instead of a routing process
- wix - WIX 复制热收获的文件,但不创建 cab 文件
- eclipse - 从 Eclipse Tycho p2-repository-plugin 中排除目标平台