首页 > 解决方案 > Not able to scrape data which is not visible on screen but is part of the slider/carousel

问题描述

I am not able to scrape data on a website which is part of the slider/carousel. When i run my script it only scrapes only the first item from the slider/carousel. It does not go through all the pages inside that carousel.

Website i am trying to scrape is:

www.yourstory.com

My Python script:

soup = BeautifulSoup(response, 'html.parser')
divTag = soup.find_all("a", class_=['sc-VigVT', 'eJWBx'])

for tag in divTag:
    tdTags = tag.find_all("h3", class_=['sc-jAaTju', 'iNsSAY'])

    for tag in tdTags:
        print(tag.text)

Output:

Kunal Bahl and Rohit Bansal reveal the inside story of the Snapdeal turnaround

There are 7 carousel items but i can get only the first one. I cannot get the data from 2nd - 7th pages in carousel/slider.

Please check the below image on what i am referring to (red cirlce):

enter image description here

标签: pythonweb-scrapingbeautifulsoup

解决方案


The carousel is generated from Javascript using JSON data hardcoded in JS. Precisely, this JSON is introduced with :

window.__REDUX_STATE__= { ..... }

So presumably FYI, this site uses redux to manage the state of the app

We can just extract this JSON with the following script :

import requests
from bs4 import BeautifulSoup
import json
import pprint

r = requests.get('https://yourstory.com/')

prefix = "window.__REDUX_STATE__="
soup = BeautifulSoup(r.content, "html.parser")

#get the redux state (json)
data = [
    json.loads(t.text[len(prefix):]) 
    for t in soup.find_all('script')
    if "__REDUX_STATE__" in t.text
]

#get only the section with cardType == "CarouselCard"
carouselCards = [
    t["data"]
    for t in data[0]["home"]["sections"]
    if ("cardType" in t) and (t["cardType"] == "CarouselCard")
][0]

#print all cards
pprint.pprint(carouselCards)

#get the name, image path & link path
print([
    (t["title"], t["path"], t["metadata"]["thumbnail"]) 
    for t in carouselCards
])

The JSON has a sections array inside home field. This section object includes some object with a cardType with value CarouselCard where there is the data you are looking for

Also, from the JSON, the Carousel section starts like this :

{
    "type":"content",
    "dataAPI":"/api/v2/featured_stories?brand=yourstory&key=CURATED_SET",
    "dataAttribute":"featured",
    "cardType":"CarouselCard",
    "data":[]
}

So I suppose you could also just get the cards using the API : https://yourstory.com/api/v2/featured_stories?brand=yourstory&key=CURATED_SET

import requests

r = requests.get('https://yourstory.com/api/v2/featured_stories?brand=yourstory&key=CURATED_SET')

#get the name, image path & link path
print([
    (t["title"], t["path"], t["metadata"]["thumbnail"]) 
    for t in r.json()["stories"]
])

which is more straightforward


推荐阅读