首页 > 解决方案 > 无法在 Python 中解析 html

问题描述

为什么我无法将这个 html 格式的网页解析为 csv?

url='c:/x/x/x/xyz.html' #html(home page of www.cloudtango.org) data is stored inside a local drive



with open(url, 'r',encoding='utf-8') as f:
    html_string = f.read()

soup= bs4.BeautifulSoup('html_string.parser')
data1= html_string.find_all('td',{'class':'company'})
full=[]
for each in data1:
    comp= each.find('img')['alt']
    desc= each.find_next('td').text
    dd={'company':comp,'description':desc}
    full.append(dd)

错误:

AttributeError:“str”对象没有属性“find_all”

标签: pythonhtmlpython-3.xweb-scraping

解决方案


是字符串类型的html_string,它没有.find_all()方法。

要从指定的 URL 获取信息,您可以使用下一个示例:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.cloudtango.org/"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

data1 = soup.find_all("td", {"class": "company"})

full = []
for each in data1:
    comp = each.find("img")["alt"]
    desc = each.find_next("td").text
    dd = {"company": comp, "description": desc}
    full.append(dd)

print(pd.DataFrame(full))

印刷:

                                company                                                                                                                                                                                                description
0                BlackPoint IT Services       BlackPoint’s comprehensive range of Managed IT Services is designed to help you improve IT quality, efficiency and reliability -and save you up to 50% on IT cost. Providing IT solutions for more …
1                  ICC Managed Services        The ICC Group is a global and independent IT solutions company, providing a comprehensive, customer focused service to the SME, enterprise and public sector markets.  \r\n\r\nICC deliver a full …
2                           First Focus      First Focus is Australia’s best managed service provider for medium sized organisations. With tens of thousands of end users supported across hundreds of customers, First Focus has the experience …

...and so on.

编辑:从本地文件中读取:

import pandas as pd
from bs4 import BeautifulSoup

with open('your_file.html', 'r') as f_in
    soup = BeautifulSoup(f_in.read(), "html.parser")

data1 = soup.find_all("td", {"class": "company"})

full = []
for each in data1:
    comp = each.find("img")["alt"]
    desc = each.find_next("td").text
    dd = {"company": comp, "description": desc}
    full.append(dd)

print(pd.DataFrame(full))

推荐阅读