python - 需要一些帮助来识别 HTML 标记,以便我提取所有相关的标题、链接和 img URL。我的代码当前显示 1
问题描述
我使用 Request 库来访问网站,并使用 BeautifulSoup 来解析 html。我希望我的爬虫能够从网站上抓取至少 4 个带有链接和图片 URL 的标题。我知道它的 HTML 标签,但我找不到哪个标签。我已经上传了我到目前为止所做的事情。该代码显示第一个标题、URL、标题链接。
from bs4 import BeautifulSoup
import requests
#user agent to facilitates end-user interaction with web content**
headers = [''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101'
]
#identifying website to be scraped*
source = request.get('https://www.jse.co.za/').text
#print(source) - verifying if HTLM for the page
soup = BeautifulSoup(source ,'lxml')# html parser
#print(soup.prettify)- to check if HTML has been parsed.
for item in soup.find_all('div',{'class':'view-content row row-flex'})[0:4]:# indexing
text = item.find('h3' {'class':'card__title'}).text .strip()
img = item.find('img' {'class': 'media__image })
link= item.find('a')
article_link = link.attrs('href')
print('Article Headline')
print(text)
print('IMAGE URL')
print(img['data-src']
print('LINK TO ARTICLE')
print(article_link)
print()
输出
# looking at output of 4 headlines
ARTICLE HEADLINE
South Africa offers investment opportunities to Asia Pacific investors
# looking at output of at least 4 Image URL's
IMAGE URL
/sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner_0.jpg?h=4ae650de&itok=hdGEy5jA
# I was hoping to scrape at least 4 links
LINK TO ARTICLE
/news/market-news/south-africa-offers-investment-opportunities-asia-pacific-investors
```
解决方案
看看那个 JSE 站点,他们使用article
标签来列出每个新闻项目,还有card
类,所以我建议使用这些标签for article in soup.find_all('article')
来拆分,然后在其中获取每个内部项目。
更新:完整的工作示例。
from bs4 import BeautifulSoup
import requests
base_url = 'https://www.jse.co.za'
source = requests.get(base_url).text
print("Got source")
soup = BeautifulSoup(source, 'html.parser')
print("Parsed source")
articles = soup.find_all("article", class_="card")
print(f"Number of articles found: {len(articles)}")
for article in articles:
print("----------------------------------------------------")
headline = article.h3.text.strip()
link = base_url + article.a['href']
text = article.find("div", class_="field--type-text-with-summary").text.strip()
img_url = base_url + article.picture.img['data-src']
print(headline)
print(link)
print(text)
print("Image: "+ img_url)
可在此处运行
推荐阅读
- javascript - 在反应中填充动态呈现的表单字段
- java - 单击 JPanel 后停止 Java 中的线程会产生很多错误
- delphi - Delphi 和 Indy TIdFTP:将服务器上一个文件夹中的所有文件复制到另一个文件夹
- apache-spark - 如何从 udf 访问广播变量,广播变量在另一个调用该 udf 的类中定义
- python-3.8 - 调用具有“self”参数的 Falcon 资源响应程序时,“缺少 1 个必需的位置参数:'resp'”
- java - Listview.setOnItemclicklistener 与 Searchview
- python - 如何让python在字符串之后读取一个值,然后比较并输出该值?
- r - 创建带有描述性的 data.frame
- python - 组合它们后moviepy破坏视频
- cloud-foundry - 删除特定空间中的所有应用和服务