python - Not able to scrape data which is not visible on screen but is part of the slider/carousel
问题描述
I am not able to scrape data on a website which is part of the slider/carousel. When i run my script it only scrapes only the first item from the slider/carousel. It does not go through all the pages inside that carousel.
Website i am trying to scrape is:
My Python script:
soup = BeautifulSoup(response, 'html.parser')
divTag = soup.find_all("a", class_=['sc-VigVT', 'eJWBx'])
for tag in divTag:
tdTags = tag.find_all("h3", class_=['sc-jAaTju', 'iNsSAY'])
for tag in tdTags:
print(tag.text)
Output:
Kunal Bahl and Rohit Bansal reveal the inside story of the Snapdeal turnaround
There are 7 carousel items but i can get only the first one. I cannot get the data from 2nd - 7th pages in carousel/slider.
Please check the below image on what i am referring to (red cirlce):
解决方案
The carousel is generated from Javascript using JSON data hardcoded in JS. Precisely, this JSON is introduced with :
window.__REDUX_STATE__= { ..... }
So presumably FYI, this site uses redux to manage the state of the app
We can just extract this JSON with the following script :
import requests
from bs4 import BeautifulSoup
import json
import pprint
r = requests.get('https://yourstory.com/')
prefix = "window.__REDUX_STATE__="
soup = BeautifulSoup(r.content, "html.parser")
#get the redux state (json)
data = [
json.loads(t.text[len(prefix):])
for t in soup.find_all('script')
if "__REDUX_STATE__" in t.text
]
#get only the section with cardType == "CarouselCard"
carouselCards = [
t["data"]
for t in data[0]["home"]["sections"]
if ("cardType" in t) and (t["cardType"] == "CarouselCard")
][0]
#print all cards
pprint.pprint(carouselCards)
#get the name, image path & link path
print([
(t["title"], t["path"], t["metadata"]["thumbnail"])
for t in carouselCards
])
The JSON has a sections
array inside home
field. This section object includes some object with a cardType
with value CarouselCard
where there is the data you are looking for
Also, from the JSON, the Carousel section starts like this :
{
"type":"content",
"dataAPI":"/api/v2/featured_stories?brand=yourstory&key=CURATED_SET",
"dataAttribute":"featured",
"cardType":"CarouselCard",
"data":[]
}
So I suppose you could also just get the cards using the API : https://yourstory.com/api/v2/featured_stories?brand=yourstory&key=CURATED_SET
import requests
r = requests.get('https://yourstory.com/api/v2/featured_stories?brand=yourstory&key=CURATED_SET')
#get the name, image path & link path
print([
(t["title"], t["path"], t["metadata"]["thumbnail"])
for t in r.json()["stories"]
])
which is more straightforward
推荐阅读
- javascript - 导航栏中的动态登录/注销
- javascript - React 内页导航和状态堆栈
- python - 如何制作 SQLAlchemy 引擎并从 Airflow Docker 容器中上传 DataFrame?
- apache-kafka - KSQLDB 连接拒绝 Kafka Connect
- javascript - 如何在 js 中有一个单独的文件来处理 Reactjs 中的 API 响应
- angular - 为什么打字稿中的抽象属性没有保留受保护的关键字
- azure - 如何在 azure 部署清单中创建绑定挂载
- docker - 无法使用 ansible 和 community.general.docker_image 拉取 docker 镜像
- javascript - 如何使用 Java Script 从 XML 中获取属性值?
- mysql - 为什么可以通过 CLI 而不是通过应用程序连接到 MySQL 8?