首页 > 解决方案 > 将不包含表格的 HTML 转换为 pandas Dataframe

问题描述

我有一个我想用 pandas 阅读的 HTML,问题是 HTML 不是表格,尽管在网站上它看起来像一个,我有这样的:

table = '''
<div id="companyResults">
<div class="col-md-12 titles">
<div class="col-md-6"> </div>
<div class="col-md-4">LOCATION</div>
<div class="col-md-2 last">SALES REVENUE ($M)</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.shenzhen_zhaoji_optical_co_ltd.bcf9d7eb4856eb739ec66272a6d9a361.html">
                                        Shenzhen Zhaoji Optical Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
                                Shenzhen,
                                Guangdong,
                                <br/>
                                China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.foxconn_industrial_internet_co_ltd.0d4c40a311dbfb1169684a21caa8794c.html">
                                        Foxconn Industrial Internet Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
                                Shenzhen,
                                Guangdong,
                                <br/>
                                China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
                                $40,833.44M</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.boe_technology_group_co_ltd.61b87aa6bc863b69d8d7689703a3ac52.html">
                                        BOE Technology Group Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
                                Beijing,
                                Beijing,
                                <br/>
                                China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
                                $16,495.55M</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.futong_group_co_ltd.85c12cb0d89005d1280cd3c0c13879ff.html">
                                        Futong Group Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
                                Hangzhou,
                                Zhejiang,
                                <br/>
                                China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.ofilm_group_co_ltd.515f10b35d850547d16fb6d6875a57d9.html">
                                        OFILM Group Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
                                Shenzhen,
                                Guangdong,
                                <br/>
                                China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
                                $5,355.25M</div>
</div>
'''

我想要一个看起来像这样的输出:

                                                            LOCATION  \
0      Shenzhen Zhaoji Optical Co., Ltd.  Shenzhen, Guangdong, China   
1  Foxconn Industrial Internet Co., Ltd.  Shenzhen, Guangdong, China   
2         BOE Technology Group Co., Ltd.     Beijing, Beijing, China   
3                 Futong Group Co., Ltd.   Hangzhou, Zhejiang, China   
4                  OFILM Group Co., Ltd.  Shenzhen, Guangdong, China   

  SALES REVENUE ($M)  
0                     
1        $40,833.44M  
2        $16,495.55M  
3                     
4         $5,355.25M  

我试过了:

pd.read_html(str(table))

但得到了这个:

ValueError: No tables found

那么实现这一目标的最佳方法是什么?PS:建议在行中添加更多详细信息(例如 href 或其他),但不是必须的

更新:网址

标签: pythonhtmlbeautifulsoup

解决方案


你可能想试试这个:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

url = "https://www.dnb.com/business-directory/company-information.semiconductorelectronic-component-manufacturing.cn.html?page=1"
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:84.0) Gecko/20100101 Firefox/84.0",
}
page = requests.get(url, headers=headers).text
soup = BeautifulSoup(page, "html5lib").find_all("div", class_="col-md-12 data")

companies = [d.find("a").getText(strip=True) for d in soup]

countries = [
    ", ".join(
        c.strip() for c in
        d.find(
            "div", class_="col-md-4"
        ).getText(strip=True).rsplit(":")[-1].split(",")
    ) for d in soup
]

revenue = [
    d.find("div", class_="col-md-2 last").getText(strip=True).rsplit(":")[-1]
    for d in soup
]

columns = ["Company", "Location", "Revenue"]
df = pd.DataFrame(
    list(zip(companies, countries, revenue)),
    columns=columns,
)

print(tabulate(df, headers=columns, tablefmt="pretty"))

页面的示例输出1

+----+------------------------------------------------------------+----------------------------+-------------+
|    |                          Company                           |          Location          |   Revenue   |
+----+------------------------------------------------------------+----------------------------+-------------+
| 0  |             Shenzhen Zhaoji Optical Co., Ltd.              | Shenzhen, Guangdong, China |             |
| 1  |           Foxconn Industrial Internet Co., Ltd.            | Shenzhen, Guangdong, China | $40,833.44M |
| 2  |               BOE Technology Group Co., Ltd.               |  Beijing, Beijing, China   | $16,495.55M |
| 3  |                   Futong Group Co., Ltd.                   | Hangzhou, Zhejiang, China  |             |
| 4  |                   OFILM Group Co., Ltd.                    | Shenzhen, Guangdong, China | $5,355.25M  |
| 5  |    Universal Scientific Industrial (Shanghai) Co., Ltd.    | Shanghai, Shanghai, China  | $5,287.83M  |
| 6  |           Huizhou Jinyang Electronics Co., Ltd.            | Huizhou, Guangdong, China  |             |
| 7  |                        Goertek Inc.                        |  Weifang, Shandong, China  | $5,018.67M  |
| 8  |                    AUX Group Co., Ltd.                     |  Ningbo, Zhejiang, China   |             |
| 9  |                    Jinko Solar Co., Ltd                    |  Shangrao, Jiangxi, China  |             |
| 10 |              Samsung Display Dongguan Co.,Ltd              | Dongguan, Guangdong, China |             |
| 11 |             Wuhan Zhongqiao Electric Co., Ltd.             |    Wuhan, Hubei, China     |             |
| 12 |                   Trina Solar Co., Ltd.                    | Changzhou, Jiangsu, China  |             |
| 13 |              Lingyi iTech (Guangdong) Company              | Jiangmen, Guangdong, China | $3,399.16M  |
| 14 |                    Jcet Group Co., Ltd.                    |  Jiangyin, Jiangsu, China  | $3,343.79M  |
| 15 |             TPV Electronics (Fujian) Co., Ltd.             |   Fuqing, Fujian, China    |             |
| 16 |             Tianma Microelectronics Co., Ltd.              | Shenzhen, Guangdong, China | $3,277.77M  |
| 17 |           Fortech Electronics (Suzhou) Co., Ltd.           |   Suzhou, Jiangsu, China   |             |
| 18 |                   JingAo Solar Co., Ltd.                   |   Xingtai, Hebei, China    |             |
| 19 |     Suzhou Dongshan Precision Manufacturing Co., Ltd.      |   Suzhou, Jiangsu, China   | $2,637.61M  |
| 20 |                Holitech Technology Co.,Ltd.                |    Jian, Jiangxi, China    | $2,629.38M  |
| 21 |     Ezhou Jianfeng Heavy Industry Machinery Co., Ltd.      |    Ezhou, Hubei, China     |             |
| 22 |          Beijing BOE Display Technology Co., Ltd.          |  Beijing, Beijing, China   |             |
| 23 |           Avary Holding (Shenzhen) Co., Limited            | Shenzhen, Guangdong, China | $2,523.89M  |
| 24 |                 Bright Oceans Corporation                  |  Beijing, Beijing, China   | $2,499.47M  |
| 25 |         Tunghsu Optoelectronic Technology Co.,Ltd.         |  Beijing, Beijing, China   | $2,491.36M  |
| 26 |          Wuxi Taiji Industry Limited Corporation           |    Wuxi, Jiangsu, China    | $2,404.47M  |
| 27 |         Tianjin Zhonghuan Semiconductor Co., Ltd.          |  Tianjin, Tianjin, China   | $2,400.15M  |
| 28 |             Tpk Touch Solutions (Xiamen) Inc.              |   Xiamen, Fujian, China    |             |
| 29 |       Mektec Manufacturing Corporation (Zhuhai) Ltd.       |  Zhuhai, Guangdong, China  |             |
| 30 |                Truly Opto-Electronics Ltd.                 | Shanwei, Guangdong, China  |             |
| 31 |         Guangdong HEC Technology Holding Co., Ltd.         | Dongguan, Guangdong, China | $2,098.86M  |
| 32 |           Lingyi Technology (Shenzhen) Co., Ltd.           | Shenzhen, Guangdong, China |             |
| 33 |   Zhejiang Longji Leye Photovoltaic Technology Co., Ltd.   |  Quzhou, Zhejiang, China   |             |
| 34 |                Shengyi Technology Co., Ltd.                | Dongguan, Guangdong, China | $1,881.96M  |
| 35 |            Shenzhen Kaifa Technology Co., Ltd.             | Shenzhen, Guangdong, China | $1,879.50M  |
| 36 |              Shanghai Huahong(Group) Co.,Ltd               | Shanghai, Shanghai, China  |             |
| 37 |         Wuhan P&S Information Technology Co.,Ltd.          |    Wuhan, Hubei, China     | $1,866.39M  |
| 38 |              Longi Solar Technology Co.,Ltd.               |    Xian, Shaanxi, China    |             |
| 39 |               Sungrow Power Supply Co., Ltd.               |    Hefei, Anhui, China     | $1,720.88M  |
| 40 | Henan Shuangchen Electronic Science & Technology Co., Ltd. |   Zhoukou, Henan, China    |             |
| 41 |             Fujian Furi Electronics Co., Ltd.              |   Fuzhou, Fujian, China    | $1,617.07M  |
| 42 |                    Risen Energy Co.,Ltd                    |  Ningbo, Zhejiang, China   | $1,564.92M  |
| 43 |           Dongguan Fuqiang Electronics Co.,Ltd.            | Dongguan, Guangdong, China |             |
| 44 |     Hongfujin Precision Industry (Shenzhen) Co., Ltd.      | Shenzhen, Guangdong, China |             |
| 45 |             Gcl-Poly (Su Zhou) Energy Limited              |   Suzhou, Jiangsu, China   |             |
| 46 |                 Shennan Circuits Co., Ltd.                 | Shenzhen, Guangdong, China | $1,495.80M  |
| 47 |           Futaihua Industry (Shenzhen) Co., Ltd.           | Shenzhen, Guangdong, China |             |
| 48 |            Hefei JA Solar Technology Co., Ltd.             |    Hefei, Anhui, China     |             |
| 49 |        Foxconn Kunshan Computer Connector Co., Ltd.        |  Kunshan, Jiangsu, China   |             |
+----+------------------------------------------------------------+----------------------------+-------------+

推荐阅读