python - 将不包含表格的 HTML 转换为 pandas Dataframe
问题描述
我有一个我想用 pandas 阅读的 HTML,问题是 HTML 不是表格,尽管在网站上它看起来像一个,我有这样的:
table = '''
<div id="companyResults">
<div class="col-md-12 titles">
<div class="col-md-6"> </div>
<div class="col-md-4">LOCATION</div>
<div class="col-md-2 last">SALES REVENUE ($M)</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.shenzhen_zhaoji_optical_co_ltd.bcf9d7eb4856eb739ec66272a6d9a361.html">
Shenzhen Zhaoji Optical Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
Shenzhen,
Guangdong,
<br/>
China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.foxconn_industrial_internet_co_ltd.0d4c40a311dbfb1169684a21caa8794c.html">
Foxconn Industrial Internet Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
Shenzhen,
Guangdong,
<br/>
China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
$40,833.44M</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.boe_technology_group_co_ltd.61b87aa6bc863b69d8d7689703a3ac52.html">
BOE Technology Group Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
Beijing,
Beijing,
<br/>
China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
$16,495.55M</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.futong_group_co_ltd.85c12cb0d89005d1280cd3c0c13879ff.html">
Futong Group Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
Hangzhou,
Zhejiang,
<br/>
China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.ofilm_group_co_ltd.515f10b35d850547d16fb6d6875a57d9.html">
OFILM Group Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
Shenzhen,
Guangdong,
<br/>
China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
$5,355.25M</div>
</div>
'''
我想要一个看起来像这样的输出:
LOCATION \
0 Shenzhen Zhaoji Optical Co., Ltd. Shenzhen, Guangdong, China
1 Foxconn Industrial Internet Co., Ltd. Shenzhen, Guangdong, China
2 BOE Technology Group Co., Ltd. Beijing, Beijing, China
3 Futong Group Co., Ltd. Hangzhou, Zhejiang, China
4 OFILM Group Co., Ltd. Shenzhen, Guangdong, China
SALES REVENUE ($M)
0
1 $40,833.44M
2 $16,495.55M
3
4 $5,355.25M
我试过了:
pd.read_html(str(table))
但得到了这个:
ValueError: No tables found
那么实现这一目标的最佳方法是什么?PS:建议在行中添加更多详细信息(例如 href 或其他),但不是必须的
更新:网址
解决方案
你可能想试试这个:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = "https://www.dnb.com/business-directory/company-information.semiconductorelectronic-component-manufacturing.cn.html?page=1"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:84.0) Gecko/20100101 Firefox/84.0",
}
page = requests.get(url, headers=headers).text
soup = BeautifulSoup(page, "html5lib").find_all("div", class_="col-md-12 data")
companies = [d.find("a").getText(strip=True) for d in soup]
countries = [
", ".join(
c.strip() for c in
d.find(
"div", class_="col-md-4"
).getText(strip=True).rsplit(":")[-1].split(",")
) for d in soup
]
revenue = [
d.find("div", class_="col-md-2 last").getText(strip=True).rsplit(":")[-1]
for d in soup
]
columns = ["Company", "Location", "Revenue"]
df = pd.DataFrame(
list(zip(companies, countries, revenue)),
columns=columns,
)
print(tabulate(df, headers=columns, tablefmt="pretty"))
页面的示例输出1
:
+----+------------------------------------------------------------+----------------------------+-------------+
| | Company | Location | Revenue |
+----+------------------------------------------------------------+----------------------------+-------------+
| 0 | Shenzhen Zhaoji Optical Co., Ltd. | Shenzhen, Guangdong, China | |
| 1 | Foxconn Industrial Internet Co., Ltd. | Shenzhen, Guangdong, China | $40,833.44M |
| 2 | BOE Technology Group Co., Ltd. | Beijing, Beijing, China | $16,495.55M |
| 3 | Futong Group Co., Ltd. | Hangzhou, Zhejiang, China | |
| 4 | OFILM Group Co., Ltd. | Shenzhen, Guangdong, China | $5,355.25M |
| 5 | Universal Scientific Industrial (Shanghai) Co., Ltd. | Shanghai, Shanghai, China | $5,287.83M |
| 6 | Huizhou Jinyang Electronics Co., Ltd. | Huizhou, Guangdong, China | |
| 7 | Goertek Inc. | Weifang, Shandong, China | $5,018.67M |
| 8 | AUX Group Co., Ltd. | Ningbo, Zhejiang, China | |
| 9 | Jinko Solar Co., Ltd | Shangrao, Jiangxi, China | |
| 10 | Samsung Display Dongguan Co.,Ltd | Dongguan, Guangdong, China | |
| 11 | Wuhan Zhongqiao Electric Co., Ltd. | Wuhan, Hubei, China | |
| 12 | Trina Solar Co., Ltd. | Changzhou, Jiangsu, China | |
| 13 | Lingyi iTech (Guangdong) Company | Jiangmen, Guangdong, China | $3,399.16M |
| 14 | Jcet Group Co., Ltd. | Jiangyin, Jiangsu, China | $3,343.79M |
| 15 | TPV Electronics (Fujian) Co., Ltd. | Fuqing, Fujian, China | |
| 16 | Tianma Microelectronics Co., Ltd. | Shenzhen, Guangdong, China | $3,277.77M |
| 17 | Fortech Electronics (Suzhou) Co., Ltd. | Suzhou, Jiangsu, China | |
| 18 | JingAo Solar Co., Ltd. | Xingtai, Hebei, China | |
| 19 | Suzhou Dongshan Precision Manufacturing Co., Ltd. | Suzhou, Jiangsu, China | $2,637.61M |
| 20 | Holitech Technology Co.,Ltd. | Jian, Jiangxi, China | $2,629.38M |
| 21 | Ezhou Jianfeng Heavy Industry Machinery Co., Ltd. | Ezhou, Hubei, China | |
| 22 | Beijing BOE Display Technology Co., Ltd. | Beijing, Beijing, China | |
| 23 | Avary Holding (Shenzhen) Co., Limited | Shenzhen, Guangdong, China | $2,523.89M |
| 24 | Bright Oceans Corporation | Beijing, Beijing, China | $2,499.47M |
| 25 | Tunghsu Optoelectronic Technology Co.,Ltd. | Beijing, Beijing, China | $2,491.36M |
| 26 | Wuxi Taiji Industry Limited Corporation | Wuxi, Jiangsu, China | $2,404.47M |
| 27 | Tianjin Zhonghuan Semiconductor Co., Ltd. | Tianjin, Tianjin, China | $2,400.15M |
| 28 | Tpk Touch Solutions (Xiamen) Inc. | Xiamen, Fujian, China | |
| 29 | Mektec Manufacturing Corporation (Zhuhai) Ltd. | Zhuhai, Guangdong, China | |
| 30 | Truly Opto-Electronics Ltd. | Shanwei, Guangdong, China | |
| 31 | Guangdong HEC Technology Holding Co., Ltd. | Dongguan, Guangdong, China | $2,098.86M |
| 32 | Lingyi Technology (Shenzhen) Co., Ltd. | Shenzhen, Guangdong, China | |
| 33 | Zhejiang Longji Leye Photovoltaic Technology Co., Ltd. | Quzhou, Zhejiang, China | |
| 34 | Shengyi Technology Co., Ltd. | Dongguan, Guangdong, China | $1,881.96M |
| 35 | Shenzhen Kaifa Technology Co., Ltd. | Shenzhen, Guangdong, China | $1,879.50M |
| 36 | Shanghai Huahong(Group) Co.,Ltd | Shanghai, Shanghai, China | |
| 37 | Wuhan P&S Information Technology Co.,Ltd. | Wuhan, Hubei, China | $1,866.39M |
| 38 | Longi Solar Technology Co.,Ltd. | Xian, Shaanxi, China | |
| 39 | Sungrow Power Supply Co., Ltd. | Hefei, Anhui, China | $1,720.88M |
| 40 | Henan Shuangchen Electronic Science & Technology Co., Ltd. | Zhoukou, Henan, China | |
| 41 | Fujian Furi Electronics Co., Ltd. | Fuzhou, Fujian, China | $1,617.07M |
| 42 | Risen Energy Co.,Ltd | Ningbo, Zhejiang, China | $1,564.92M |
| 43 | Dongguan Fuqiang Electronics Co.,Ltd. | Dongguan, Guangdong, China | |
| 44 | Hongfujin Precision Industry (Shenzhen) Co., Ltd. | Shenzhen, Guangdong, China | |
| 45 | Gcl-Poly (Su Zhou) Energy Limited | Suzhou, Jiangsu, China | |
| 46 | Shennan Circuits Co., Ltd. | Shenzhen, Guangdong, China | $1,495.80M |
| 47 | Futaihua Industry (Shenzhen) Co., Ltd. | Shenzhen, Guangdong, China | |
| 48 | Hefei JA Solar Technology Co., Ltd. | Hefei, Anhui, China | |
| 49 | Foxconn Kunshan Computer Connector Co., Ltd. | Kunshan, Jiangsu, China | |
+----+------------------------------------------------------------+----------------------------+-------------+
推荐阅读
- html - 输入框字体大小不起作用。想不通为什么
- javascript - 使用纯 JavaScript 折叠树表
- php - Laravel 验证正则表达式不允许汉字字符
- react-native - React Native:标题需要位于 createMaterialTopTabNavigator 之上
- javascript - 在这个简单的 React 应用程序的最后两条路由之间循环
- javascript - 如何从videojs中删除多余的组件
- java - LeetCode - LeetCode 提交时通过的解决方案,但在 IDE 中返回 null
- angular - 反应式表单/使用提供者服务在组件之间传递数据。- REFRESH/RELOAD 期间应用程序中断
- c# - 用户干预我的应用程序时立即触发?
- php - JQuery Ajax 不向数据库发送数据