python - BeatifulSoup and single quotes in attributes
问题描述
I am trying to read an Html page and get some information from it. In one of the lines, the information I need is inside an Image's alt attribute. like so:
<img src='logo.jpg' alt='info i need'>
The problem is that, when parsing this, beautifulsoup is surrounding the contents of alt with double quotes, instead of using the single quotes already present. Because of this, the result is something like this:
<img alt="\'info" i="" need="" src="\'logo.jpg\'"/>
Currently, my code consists in this:
name = row.find("td", {"class": "logo"}).find("img")["alt"]
Which should return "info i need" but is currently returning "\'info" What can I be doing wrong? Is there any settings that I need to change in order to beautifulsoup to parse this correctly?
Edit: my code looks something like this ( I used the standard html parser too, but no difference there )
import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup
def main():
url = 'https://myhtml.html'
with urllib.request.urlopen(url) as page:
text = str(page.read())
html = BeautifulSoup(page.read(), "lxml")
table = html.find("table", {"id": "info_table"})
rows = table.find_all("tr")
for row in rows:
if row.find("th") is not None:
continue
info = row.find("td", {"class": "logo"}).find("img")["alt"]
print(info)
if __name__ == '__main__':
main()
and the html:
<div class="table_container">
<table class="info_table" id="info_table">
<tr>
<th class="logo">Important infos</th>
<th class="useless">Other infos</th>
</tr>
<tr >
<td class="logo"><img src='Logo.jpg' alt='info i need'><br></td>
<td class="useless">
<nobr>useless info</nobr>
</td>
</tr>
<tr >
<td class="logo"><img src='Logo2.jpg' alt='info i need too'><br></td>
<td class="useless">
<nobr>useless info</nobr>
</td>
</tr>
解决方案
抱歉,我无法添加评论。
我已经测试了你的情况,对我来说输出似乎是正确的。
HTML:
<html>
<body>
<td class="logo">
<img src='logo.jpg' alt='info i need'>
</td>
</body>
</html>
Python:
from bs4 import BeautifulSoup
with open("myhtml.html", "r") as html:
soup = BeautifulSoup(html, 'html.parser')
name = soup.find("td", {"class": "logo"}).find("img")["alt"]
print(name)
回报:
info i need
我认为您的问题是在将文件写回 html 时出现编码问题。
请提供完整代码和更多信息。
- html
- 你的python代码
更新:
我已经测试了你的代码,你的代码根本不工作:/返工后我能够得到所需的输出。
import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup
def main():
url = 'https://code.mytesturl.net'
with urllib.request.urlopen(url) as page:
soup = BeautifulSoup(page, "html.parser")
name = soup.find("td", {"class": "logo"}).find("img")["alt"]
print(name)
if __name__ == '__main__':
main()
可能的问题:
也许你的解析器应该是 html.parser
Python 版本/bs 版本?
推荐阅读
- django - 在Django中通过字符串更改模型属性
- java - 椭圆形移动程序。椭圆不动,有什么我可以在这里炼制的吗?
- informatica - 如何在 informatica 开发人员中获取在 DIS 上运行的映射和工作流列表
- python - 为什么帐户模块升级取消链接某些表上的记录?
- syntax - Jira 查询带有特定文本的评论
- sql - 使用 IN 语句进行 SQL 查询搜索
- nginx - NGINX:当 404 时,尝试使用前缀相同的 URL
- angular - 在停止调试时终止 vs 代码中的 ng serve 任务
- python - 如何在 DEAP 中设置个体基因的上限和下限?
- sql - SQL Server JSON 查询过滤返回