首页 > 解决方案 > 无法使用 beautifulSoup for javascript 提取数据?

问题描述

大家好,我试图从https://newslab.malaysiakini.com/covid-19/en中提取数据

import requests
from bs4 import BeautifulSoup

page = requests.get("https://newslab.malaysiakini.com/covid-19/en")

soup = BeautifulSoup(page.content, 'html.parser')

option_tags = soup.find(id="uk-grid uk-grid-small uk-width-auto uk-flex uk-flex-middle uk-flex-center")

patient_items = option_tags.find_all(class_="patient")

first = patient_items[0]
print(first.prettigy())

我无法提取结果似乎我的 html.parser 无法获取我在谷歌控制台中看到的数据。任何人都可以在这方面提供帮助吗?

标签: pythonweb-scrapingbeautifulsouphtml-parsing

解决方案


该站点在最初请求https://newslab.malaysiakini.com/covid-19/en. 这些附加链接可能包含您要查找的内容。

此链接可能包含您要查找的所有信息,但 GPS 坐标除外。位置比较困难,它们似乎被编译成一些 javascript 和数据标签。

https://m5.malaysiakini.com/en/tag/covid-19?alt=json 这包含谷歌地图/列表上所有故事的 JSON 格式。例如:

{
            "title": "Tabligh particiapants: Foreigners the cause of Covid-19 spread, not fair to blame locals",
            "sid": 514832,
            "image_feat": ["https://i.newscdn.net/publisher-c1a3f893382d2b2f8a9aa22a654d9c97/2020/03/9b6ba685820341c1cfc4f7d7faff7ba0.jpg"],
            "image_feat_single": "https://i.newscdn.net/publisher-c1a3f893382d2b2f8a9aa22a654d9c97/2020/03/9b6ba685820341c1cfc4f7d7faff7ba0.jpg",
            "summary": "<p>Most of us went to the hospital for testing as soon we were given the directive, says a participant.</p>",
            "author": "",
            "author_array": [],
            "author_display": "no",
            "date_pub": 1584321043,
            "date_pub2": "1584321043000",
            "date_pubh": "2020-03-16 09:10:43+08:00",
            "category": "news",
            "comment_count": 0,
            "tags": ["health", "coronavirus", "covid-19", "tabligh gathering", "infection"],
            "free": false,
            "redirect": "",
            "date_modh": "2020-03-16 09:10:43+08:00"
        }

推荐阅读