python - 使用 BeautifulSoup 查找使用 Angular 编码器编码的文本
问题描述
我正在尝试在 python (3.7) 中使用 bs4 从 web 文档中加载文本。但是,我没有得到段落之间的所有文本。这是我尝试过的:
import bs4
import requests
url = 'https://www.londonstockexchange.com/news-article/SMT/transaction-in-own-shares/14885136'
page = requests.get(url)
#soup = bs4.BeautifulSoup(page.txt, 'html.parser') <-- adding the html.parser does not work either
soup = bs4.BeautifulSoup(page.content)
soup.find_all('p') # Only returns the text at the bottom of the document
soup.find_all('p', {'class': 'q'}) # returns nothing
soup.getText() # Does not return the document text either
文档文本之前有一个 shadow-root。我不知道,它有什么作用,或者它是否会影响结果?
当我将 page.content 保存到文本文件时,文档文本是文本,看起来与普通的 html 标签完全不同,例如:
;p class=\&q;q\&q;&g;2021 年 3 月 2 日,公司以 &a;#160;1,193.07 p 的价格购买了&a;#160;1,000,000 股普通股。购买的股份将在库房中持有。&l;/ p&g;\r\n&l;p class=\&q;v\&q;&g;交易完成后,公司持有&a;#160;#160;48,356,074股库务股。&l;/p&g
我的问题:是什么导致了这种行为,如何从文档中提取文本?
解决方案
文本使用 Angular 的自定义编码器进行编码,可以在script
标签中找到。json()
您可以在清理后加载此标签中的数据。然后在字典中找到文章文本并用html
再次解析BeautifulSoup
,例如:
import json
from bs4 import BeautifulSoup
import requests
url = 'https://www.londonstockexchange.com/news-article/SMT/transaction-in-own-shares/14885136'
page = requests.get(url)
soup = BeautifulSoup(page.content)
data = soup.select_one('#ng-lseg-state').string.replace('&q;', '"').replace('&l;', '<').replace('&g;', '>').replace('&a;', '&').replace('&s;', "'")
data = json.loads(data)
text = BeautifulSoup(data['G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14885136&path=news-article']['body']['components'][1]['content']['newsArticle']['value'], 'html.parser')
print(text.find('body').get_text(strip=True, separator='\n'))
将输出:
RNS Number : 9260Q
Scottish Mortgage Inv Tst PLC
02 March 2021
Scottish Mortgage Investment Trust PLC
Legal Entity Identifier:
213800G37DCS3Q9IJM38
Purchase of Own Securities
On 2 March 2021 the Company purchased 1,000,000 ordinary shares at a price of 1,193.07 p The shares purchased will be held in Treasury.
Following the transaction the Company holds 48,356,074 shares in Treasury.
The shares in issue less the total number of shares in Treasury are 1,436,424,806
The above figure ( 1,436,424,806 ) may be used by shareholders as the denominator for the calculations by which they will determine if they are required to notify their interest in, or a change to their interest in, Scottish Mortgage Investment Trust PLC under the FCA's Disclosure and Transparency Rules.
Baillie Gifford & Co Limited
Company Secretaries
2 March 2021
Regulated Information Classification:
Acquisition or disposal of the
issuer's own shares
This information is provided by RNS, the news service of the London Stock Exchange. RNS is approved by the Financial Conduct Authority to act as a Primary Information Provider in the United Kingdom. Terms and conditions relating to the use and distribution of this information may apply. For further information, please contact
rns@lseg.com
or visit
www.rns.com
.
RNS may use your IP address to confirm compliance with the terms and conditions, to analyse how you engage with the information contained in this communication, and to share such analysis on an anonymised basis with others as part of our commercial services. For further information about how RNS and the London Stock Exchange use the personal data you provide us, please see our
Privacy Policy
.
END
POSDKBBDOBKDONK
推荐阅读
- sql - Oracle Execute Immediate 没有转义报价?
- arrays - 跨列和整列的数组公式?
- pine-script - Pine Script 中的交易/赢/输计数和盈利百分比?
- java - 在android studio中单击时,如何使单选按钮在另一个活动中显示文本?
- apache-spark - 使用 Kafka 将 10 TB 大小的大文件从 hdfs 发送到 S3
- amazon-web-services - 如何在aws中部署私有微服务
- c++ - 为什么要使用命名空间?
- django-views - Django:使用两个查询集建立投资组合估值表
- excel - 列值等于列表项,如果项是多个列之一中的子字符串
- url - 在 KeePass-DB 中缩短 URL