首页 > 解决方案 > 如何使用 BeautifulSoup 抓取非 HTML 标签

问题描述

我正在尝试从具有标签的网站中删除数据,<a&#32;href="https: evisa.mfa.am ">例如,查看此网站

BeautifulSoup 有什么方法可以从非 html 标签中提取数据?

这是来自上述链接的整个 html 页面的片段

<br/>2.&#32;Airlines&#32;must&#32;provide&#32;advance&#32;passenger&#32;information&#32;of&#32;scheduled&#32;arrival&#32;of&#32;nationals&#32;of&#32;Antigua&#32;and&#32;Barbuda&#32;and&#32;resident&#32;diplomats.&#32;<br/><br/><b>ARGENTINA</b>&#32;-&#32;published&#32;02.04.2020&#32;<br/>Passengers&#32;are&#32;not&#32;allowed&#32;to&#32;enter&#32;Argentina&#32;until&#32;12&#32;April&#32;2020.<br/><br/><b>ARMENIA</b>&#32;-&#32;published&#32;22.03.2020&#32;<br/>1.&#32;Nationals&#32;of&#32;China&#32;(People's&#32;Rep.)&#32;with&#32;a&#32;normal&#32;passport&#32;are&#32;no&#32;longer&#32;visa&#32;exempt.&#32;<br/>2.&#32;Nationals&#32;of&#32;Iran&#32;can&#32;no&#32;longer&#32;obtain&#32;a&#32;visa&#32;on&#32;arrival.&#32;They&#32;must&#32;obtain&#32;a&#32;visa&#32;or&#32;an&#32;e-visa&#32;prior&#32;to&#32;their&#32;arrival&#32;in&#32;Armenia.&#32;The&#32;e-visa&#32;can&#32;be&#32;obtained&#32;at&#32;<a&#32;href="https://evisa.mfa.am/">https://evisa.mfa.am/</a>&#32;<br/>3.&#32;Passengers&#32;who&#32;have&#32;been&#32;in&#32;Austria,&#32;Belgium,&#32;China&#32;(People's&#32;Rep.),&#32;Denmark,&#32;France,&#32;Germany,&#32;Iran,&#32;Italy,&#32;Japan,&#32;Korea&#32;(Rep.),&#32;Netherlands,&#32;Norway,&#32;Spain,&#32;Sweden,&#32;Switzerland&#32;or&#32;United&#32;Kingdom&#32;in&#32;the&#32;past&#32;14&#32;days&#32;are&#32;not&#32;allowed&#32;to&#32;enter&#32;Armenia.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;nationals&#32;or&#32;residents&#32;of&#32;Armenia.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;spouses&#32;or&#32;children&#32;of&#32;nationals&#32;of&#32;Armenia.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;employees&#32;of&#32;foreign&#32;diplomatic&#32;missions&#32;and&#32;consular&#32;institutions.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;representations&#32;of&#32;official&#32;international&#32;missions&#32;or&#32;organizations.<br/>4.&#32;Nationals&#32;of&#32;Armenia&#32;who&#32;have&#32;been&#32;in&#32;Austria,&#32;Belgium,&#32;China&#32;(People's&#32;Rep.),&#32;Denmark,&#32;France,&#32;Germany,&#32;Iran,&#32;Italy,&#32;Japan,&#32;Korea&#32;(Rep.),&#32;Netherlands,&#32;Norway,&#32;Spain,&#32;Sweden,&#32;Switzerland&#32;or&#32;United&#32;Kingdom&#32;in&#32;the&#32;past&#32;14&#32;days&#32;must&#32;undergo&#32;14-days&#32;of&#32;quarantine&#32;or&#32;self-isolation&#32;regime.

标签: pythonbeautifulsoup

解决方案


这就是所谓的AMP字符,你可以看看这里了解它是什么。

不要使用html.parser. 只需使用真实的parser,例如lxmlhtml5lib

from bs4 import BeautifulSoup
import requests

r = requests.get(
    "https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm")
soup = BeautifulSoup(r.content, 'html5lib')


print(soup.prettify())

推荐阅读