首页 > 解决方案 > 使用 BeautifulSoup 查找使用 Angular 编码器编码的文本

问题描述

我正在尝试在 python (3.7) 中使用 bs4 从 web 文档中加载文本。但是,我没有得到段落之间的所有文本。这是我尝试过的:

import bs4
import requests

url = 'https://www.londonstockexchange.com/news-article/SMT/transaction-in-own-shares/14885136'
page = requests.get(url)
#soup = bs4.BeautifulSoup(page.txt, 'html.parser')   <-- adding the html.parser does not work either
soup = bs4.BeautifulSoup(page.content)
soup.find_all('p')  # Only returns the text at the bottom of the document
soup.find_all('p', {'class': 'q'})  # returns nothing
soup.getText()  # Does not return the document text either

文档文本之前有一个 shadow-root。我不知道,它有什么作用,或者它是否会影响结果?

当我将 page.content 保存到文本文件时,文档文本是文本,看起来与普通的 html 标签完全不同,例如:

;p class=\&q;q\&q;&g;2021 年 3 月 2 日,公司以 &a;#160;1,193.07 p 的价格购买了&a;#160;1,000,000 股普通股。购买的股份将在库房中持有。&l;/ p&g;\r\n&l;p class=\&q;v\&q;&g;交易完成后,公司持有&a;#160;#160;48,356,074股库务股。&l;/p&g

我的问题:是什么导致了这种行为,如何从文档中提取文本?

标签: pythonpython-3.xbeautifulsoup

解决方案


文本使用 Angular 的自定义编码器进行编码,可以在script标签中找到。json()您可以在清理后加载此标签中的数据。然后在字典中找到文章文本并用html再次解析BeautifulSoup,例如:

import json
from bs4 import BeautifulSoup
import requests

url = 'https://www.londonstockexchange.com/news-article/SMT/transaction-in-own-shares/14885136'
page = requests.get(url)
soup = BeautifulSoup(page.content)
data = soup.select_one('#ng-lseg-state').string.replace('&q;', '"').replace('&l;', '<').replace('&g;', '>').replace('&a;', '&').replace('&s;', "'")
data = json.loads(data)

text = BeautifulSoup(data['G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14885136&path=news-article']['body']['components'][1]['content']['newsArticle']['value'], 'html.parser')

print(text.find('body').get_text(strip=True, separator='\n'))将输出:

RNS Number : 9260Q
Scottish Mortgage Inv Tst PLC
02 March 2021
Scottish Mortgage Investment Trust PLC
Legal Entity Identifier:
213800G37DCS3Q9IJM38
Purchase of Own Securities
On 2 March 2021 the Company purchased 1,000,000 ordinary shares at a price of  1,193.07 p The shares purchased will be held in Treasury.
Following the transaction the Company holds 48,356,074 shares in Treasury.
The shares in issue less the total number of shares in Treasury are 1,436,424,806
The above figure ( 1,436,424,806 ) may be used by shareholders as the denominator for the calculations by which they will determine if they are required to notify their interest in, or a change to their interest in, Scottish Mortgage Investment Trust PLC under the FCA's Disclosure and Transparency Rules.
Baillie Gifford & Co Limited
Company Secretaries
2 March 2021
Regulated Information Classification:
Acquisition or disposal of the
issuer's own shares
This information is provided by RNS, the news service of the London Stock Exchange. RNS is approved by the Financial Conduct Authority to act as a Primary Information Provider in the United Kingdom. Terms and conditions relating to the use and distribution of this information may apply. For further information, please contact
rns@lseg.com
or visit
www.rns.com
.
RNS may use your IP address to confirm compliance with the terms and conditions, to analyse how you engage with the information contained in this communication, and to share such analysis on an anonymised basis with others as part of our commercial services. For further information about how RNS and the London Stock Exchange use the personal data you provide us, please see our
Privacy Policy
.
END
POSDKBBDOBKDONK

推荐阅读