python-3.x - 如何使用 Python 获取路透社网站的一个子版块(例如中东)的 20 多个新闻标题链接?
问题描述
我正在尝试在路透社网站上搜索与中东有关的所有新闻头条。网页链接:https ://www.reuters.com/subjects/middle-east
当我向下滚动时,此页面会自动显示以前的标题,但是当我查看页面源时,它只提供最后 20 个标题链接。
我试图寻找下一个或上一个超链接,通常会出现此类问题,但不幸的是,此页面上没有任何此类超链接。
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.reuters.com/subjects/middle-east'
result = requests.get(url)
content = result.content
soup = BeautifulSoup(content, 'html.parser')
# Gets all the links on the page source
links = []
for hl in soup.find_all('a'):
if re.search('article', hl['href']):
links.append(hl['href'])
# The first link is the page itself and so we skip it
links = links[1:]
# The urls are repeated and so we only keep the unique instances
urls = []
for url in links:
if url not in urls:
urls.append(url)
# The number of urls is limited to 20 (THE PROBLEM!)
print(len(urls))
我对所有这一切的经验非常有限,但我最好的猜测是,java 或页面使用的任何代码语言使它在向下滚动时会产生以前的结果,这也许是我需要弄清楚使用一些模块Python。
该代码进一步从每个链接中提取其他详细信息,但这与发布的问题无关。
解决方案
您可以使用selenium和Keys.PAGE_DOWN
选项首先向下滚动然后获取页面源。如果你愿意,你可以把它喂给 BeautifulSoup。
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
browser = webdriver.Chrome(executable_path='/path/to/chromedriver')
browser.get("https://www.reuters.com/subjects/middle-east")
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 25
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
no_of_pagedowns-=1
source=browser.page_source
soup = BeautifulSoup(source, 'html.parser')
# Gets all the links on the page source
links = []
for hl in soup.find_all('a'):
if re.search('article', hl['href']):
links.append(hl['href'])
# The first link is the page itself and so we skip it
links = links[1:]
# The urls are repeated and so we only keep the unique instances
urls = []
for url in links:
if url not in urls:
urls.append(url)
# The number of urls is limited to 20 (THE PROBLEM!)
print(len(urls))
输出
40
推荐阅读
- xslt - 将 2 个过滤后的数字相乘并求和
- rest - REST API 的 Golang 测试转储整个数据库
- python - 如何获得数字的半精度浮点表示?
- c# - 如何在 C# 中使字符串输入不区分大小写?
- python - 如何为 tpu 的多个输入提供 Tensorflow 的数据?
- sql - SQL WHERE 子句中的无效位置
- python - 我可以将 source_directory 参数指向 Azure blob 存储吗?
- python - Is there a way to find out if Python threading locks are ever used by more than one thread?
- java - 在春季使用多个服务/控制器
- sql - 为什么引用不属于正在查询的表的列(作为左侧操作数)不是 EXISTS 运算符中的错误?