python - 如何使用 BeautifulSoup 抓取非 HTML 标签
问题描述
我正在尝试从具有标签的网站中删除数据,<a href="https: evisa.mfa.am ">
例如,查看此网站
BeautifulSoup 有什么方法可以从非 html 标签中提取数据?
这是来自上述链接的整个 html 页面的片段
<br/>2. Airlines must provide advance passenger information of scheduled arrival of nationals of Antigua and Barbuda and resident diplomats. <br/><br/><b>ARGENTINA</b> - published 02.04.2020 <br/>Passengers are not allowed to enter Argentina until 12 April 2020.<br/><br/><b>ARMENIA</b> - published 22.03.2020 <br/>1. Nationals of China (People's Rep.) with a normal passport are no longer visa exempt. <br/>2. Nationals of Iran can no longer obtain a visa on arrival. They must obtain a visa or an e-visa prior to their arrival in Armenia. The e-visa can be obtained at <a href="https://evisa.mfa.am/">https://evisa.mfa.am/</a> <br/>3. Passengers who have been in Austria, Belgium, China (People's Rep.), Denmark, France, Germany, Iran, Italy, Japan, Korea (Rep.), Netherlands, Norway, Spain, Sweden, Switzerland or United Kingdom in the past 14 days are not allowed to enter Armenia.<br/>- This does not apply to nationals or residents of Armenia.<br/>- This does not apply to spouses or children of nationals of Armenia.<br/>- This does not apply to employees of foreign diplomatic missions and consular institutions.<br/>- This does not apply to representations of official international missions or organizations.<br/>4. Nationals of Armenia who have been in Austria, Belgium, China (People's Rep.), Denmark, France, Germany, Iran, Italy, Japan, Korea (Rep.), Netherlands, Norway, Spain, Sweden, Switzerland or United Kingdom in the past 14 days must undergo 14-days of quarantine or self-isolation regime.
解决方案
这就是所谓的AMP
字符,你可以看看这里了解它是什么。
不要使用html.parser
. 只需使用真实的parser
,例如lxml
或html5lib
from bs4 import BeautifulSoup
import requests
r = requests.get(
"https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm")
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
推荐阅读
- asp.net - 如何以编程方式使 GridView 中的字段与 AutoGenerateEditButton 一起可编辑?
- parsing - 朱莉娅:可以(应该)在“解析时间”捕获这种类型的错误吗?
- javascript - RxJS:为什么内部可观察首先触发?
- r - 限制data.frame中的列超过条件
- android - 从三星 S10 Android 手机开始时,应用程序未在 Google Play 商店中列出
- python - 访问在本地 HTTP 主机上运行的视频流
- python - 私人聊天消息 django 如 facebook
- api - 如何将api输出json转换为csv文件
- javascript - 如何检查一个数组是否包含另一个数组的任何值
- asp.net-core - 没有可用于指定 RuntimeIdentifier 'browser-wasm' 的 Microsoft.AspNetCore.App 运行时包