首页 > 解决方案 > BeatifulSoup and single quotes in attributes

问题描述

I am trying to read an Html page and get some information from it. In one of the lines, the information I need is inside an Image's alt attribute. like so:

<img src='logo.jpg' alt='info i need'>

The problem is that, when parsing this, beautifulsoup is surrounding the contents of alt with double quotes, instead of using the single quotes already present. Because of this, the result is something like this:

<img alt="\'info" i="" need="" src="\'logo.jpg\'"/>

Currently, my code consists in this:

name = row.find("td", {"class": "logo"}).find("img")["alt"]

Which should return "info i need" but is currently returning "\'info" What can I be doing wrong? Is there any settings that I need to change in order to beautifulsoup to parse this correctly?

Edit: my code looks something like this ( I used the standard html parser too, but no difference there )

import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup

def main():     
    url = 'https://myhtml.html'
    with urllib.request.urlopen(url) as page:
        text = str(page.read())
        html = BeautifulSoup(page.read(), "lxml")

        table = html.find("table", {"id": "info_table"})
        rows = table.find_all("tr")

        for row in rows:
            if row.find("th") is not None:
                continue
            info = row.find("td", {"class": "logo"}).find("img")["alt"]
            print(info) 


if __name__ == '__main__':
    main()

and the html:

<div class="table_container">
<table class="info_table" id="info_table">
<tr>
   <th class="logo">Important infos</th>
   <th class="useless">Other infos</th>
</tr>
<tr >
   <td class="logo"><img src='Logo.jpg' alt='info i need'><br></td>
   <td class="useless">
      <nobr>useless info</nobr>
   </td>
</tr>
<tr >
   <td class="logo"><img src='Logo2.jpg' alt='info i need too'><br></td>
   <td class="useless">
      <nobr>useless info</nobr>
   </td>
</tr>

标签: pythonbeautifulsoup

解决方案


抱歉,我无法添加评论。

我已经测试了你的情况,对我来说输出似乎是正确的。

HTML:

<html>
    <body>
        <td class="logo">
            <img src='logo.jpg' alt='info i need'>
        </td>
    </body>
</html>

Python:

from bs4 import BeautifulSoup

with open("myhtml.html", "r") as html:
    soup = BeautifulSoup(html, 'html.parser')
    name = soup.find("td", {"class": "logo"}).find("img")["alt"]
    print(name)

回报:

info i need

我认为您的问题是在将文件写回 html 时出现编码问题。

请提供完整代码和更多信息。

  • html
  • 你的python代码

更新:

我已经测试了你的代码,你的代码根本不工作:/返工后我能够得到所需的输出。

import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup

def main():     
    url = 'https://code.mytesturl.net'
    with urllib.request.urlopen(url) as page:

        soup = BeautifulSoup(page, "html.parser")
        name = soup.find("td", {"class": "logo"}).find("img")["alt"]
        print(name)


if __name__ == '__main__':
    main()

可能的问题:
也许你的解析器应该是 html.parser
Python 版本/bs 版本?


推荐阅读