python - 无法在 Python 中解析 html
问题描述
为什么我无法将这个 html 格式的网页解析为 csv?
url='c:/x/x/x/xyz.html' #html(home page of www.cloudtango.org) data is stored inside a local drive
with open(url, 'r',encoding='utf-8') as f:
html_string = f.read()
soup= bs4.BeautifulSoup('html_string.parser')
data1= html_string.find_all('td',{'class':'company'})
full=[]
for each in data1:
comp= each.find('img')['alt']
desc= each.find_next('td').text
dd={'company':comp,'description':desc}
full.append(dd)
错误:
AttributeError:“str”对象没有属性“find_all”
解决方案
是字符串类型的html_string
,它没有.find_all()
方法。
要从指定的 URL 获取信息,您可以使用下一个示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.cloudtango.org/"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data1 = soup.find_all("td", {"class": "company"})
full = []
for each in data1:
comp = each.find("img")["alt"]
desc = each.find_next("td").text
dd = {"company": comp, "description": desc}
full.append(dd)
print(pd.DataFrame(full))
印刷:
company description
0 BlackPoint IT Services BlackPoint’s comprehensive range of Managed IT Services is designed to help you improve IT quality, efficiency and reliability -and save you up to 50% on IT cost. Providing IT solutions for more …
1 ICC Managed Services The ICC Group is a global and independent IT solutions company, providing a comprehensive, customer focused service to the SME, enterprise and public sector markets. \r\n\r\nICC deliver a full …
2 First Focus First Focus is Australia’s best managed service provider for medium sized organisations. With tens of thousands of end users supported across hundreds of customers, First Focus has the experience …
...and so on.
编辑:从本地文件中读取:
import pandas as pd
from bs4 import BeautifulSoup
with open('your_file.html', 'r') as f_in
soup = BeautifulSoup(f_in.read(), "html.parser")
data1 = soup.find_all("td", {"class": "company"})
full = []
for each in data1:
comp = each.find("img")["alt"]
desc = each.find_next("td").text
dd = {"company": comp, "description": desc}
full.append(dd)
print(pd.DataFrame(full))
推荐阅读
- r - 如何根据不同的类找到多少个唯一值
- spring-boot - Spring Cloud Contract:在运行时找不到本地合约生成存根
- java - HQL查询左连接和带有数组的bean
- javascript - Laravel $request->all() 没有返回任何响应 ajax 请求
- javascript - 如何在javascript中加载没有点击事件的函数
- python - 为什么 Pipenv 没有选择我的 Pyenv 版本?
- r - 满足条件时如何添加包含特定值的列?
- c++ - 如何找到丢失的图书馆的位置
- angular - Angular 10 中的 .toPromise() 异常
- database - 创建第一个数据管道