首页 > 解决方案 > 如何从此网页中抓取一个数字(在 python 中)

问题描述

如果有人指导我如何提取数字“28,050”,我将不胜感激 在此处输入图像描述

我曾经通过这段代码(python 3)得到那个数字:

import requests
import bs4
res_bonbast = requests.get('https://bonbast.com/')
soup_bonbast = bs4.BeautifulSoup(res_bonbast.text,"lxml")
int(float(soup_bonbast.select('#usd1_top')[0].getText()

但最近他们似乎改变了一些东西

标签: pythonweb-scraping

解决方案


您的问题是直到页面加载后才会填充此值。当您的脚本向您展示时,此元素的 HTML 确实是空白的。当您在浏览器中加载站点时会发生什么,您可以通过打开开发工具并查看网络选项卡来确认这一点,您首先会得到一些该元素为空白的 HTML。稍后,调用https://bonbast.com/json返回用于填充元素的值。

您需要做的是自己向 bonbast.com/json 发出请求,并从 json 中提取您想要的值,而不是进行 HTML 解析。您正在寻找的密钥是 usd1。

bonbast.com/json 端点可能需要标头中的其他详细信息。我通过打开我的开发工具网络选项卡(在 Chrome 中,ctrl+shift+i >> 网络)访问 bonbast.com 并找到对 bonbast.com/json 的请求来捕获下面的 curl 请求。然后我右键单击它并选择“复制为卷曲”

curl 'https://bonbast.com/json' \
   -H 'authority: bonbast.com' \
   -H 'sec-ch-ua: "Chromium";v="95", ";Not A Brand";v="99"' \
   -H 'accept: application/json, text/javascript, */*; q=0.01' \
   -H 'content-type: application/x-www-form-urlencoded; charset=UTF-8' \
   -H 'x-requested-with: XMLHttpRequest' \
   -H 'sec-ch-ua-mobile: ?0' \
   -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36' \
   -H 'sec-ch-ua-platform: "Linux"' \
   -H 'origin: https://bonbast.com' \
   -H 'sec-fetch-site: same-origin' \
   -H 'sec-fetch-mode: cors' \
   -H 'sec-fetch-dest: empty' \
   -H 'referer: https://bonbast.com/' \
   -H 'accept-language: en-US,en;q=0.9' \
   -H 'cookie: st_bb=0; _gid=GA1.2.587414378.1636538685; __gads=ID=2f6e05bb70db575d-2208cfa441cc00d3:T=1636538685:RT=1636538685:S=ALNI_MaKL18-XZaWbbhlmh2h3RGvYmVKRw; _ga_PZF6SDPF22=GS1.1.1636562265.2.0.1636562265.0; _ga=GA1.2.633937873.1636538685; _gat_gtag_UA_35412804_1=1' \
   --data-raw 'data=0d7e26d17fde20e86b760b00127132d4%2CfTtTZ%2C2021-11-10-16-38-37&webdriver=false' \
   --compressed

结果是:

{ "try1": "2890",
  "month": 8,
  "emami1": "12450000",
  "afn2": "309",
  "afn1": "311",
  "rub2": "397",
  "azadi1_22": "6250000",
  "bhd2": "74870",
  "azn1": "16730",
  "bhd1": "75370",
  "azadi1g": "2350000",
  "bourse": "1904324.2",
  "try2": "2870",
  "cny1": "4450",
  "cny2": "4430",
  "cad1": "22860",
  "cad2": "22760",
  "jpy1": "2495",
  "thb1": "865",
  "usd1": "28420",
  "usd2": "28320",
  "thb2": "860",
  "azn2": "16630",
  "dkk1": "4400",
  "amd2": "590",
  "day": 19,
  "minute": "41",
  "amd1": "595",
  "bitcoin": "68616.85",
  "hour": "20",
  "sar2": "7545",
  "rub1": "400",
  "azadi1g2": "2250000",
  "azadi12": "12000000",
  "eur1": "32725",
  "eur2": "32575",
  "emami12": "12250000",
  "second": "45",
  "omr1": "73825",
  "year": 1400,
  "chf2": "30855",
  "chf1": "31005",
  "azadi1_42": "3700000",
  "jpy2": "2485",
  "kwd2": "93795",
  "kwd1": "94195",
  "sek1": "3280",
  "gbp2": "38090",
  "gbp1": "38290",
  "sek2": "3265",
  "myr1": "6850",
  "myr2": "6820",
  "omr2": "73525",
  "azadi1": "12350000",
  "azadi1_2": "6400000",
  "aud2": "20805",
  "azadi1_4": "3800000",
  "aud1": "20905",
  "dkk2": "4380",
  "inr2": "380",
  "inr1": "382",
  "last_modified": "November 10, 2021 16:00",
  "aed2": "7715",
  "aed1": "7735",
  "iqd2": "1935",
  "qar1": "7805",
  "qar2": "7775",
  "iqd1": "1945",
  "hkd2": "3620",
  "hkd1": "3650",
  "sar1": "7575",
  "created": "November 10, 2021 00:01",
  "sgd2": "20930",
  "sgd1": "21030",
  "ounce": "1854.31",
  "weekday": "Wednesday",
  "mithqal": "5416000",
  "gol18": "1250288",
  "nok1": "3305",
  "nok2": "3290"
}

但是,对你来说是个坏消息。curl请求中的参数似乎在一段时间后过期。我相信正在发生的事情是,当您访问该网站时,您会收到一个 cookie。该 cookie 是您向 json 端点发出请求的权限,但它会在短时间内过期。

可靠地抓取此页面需要少量工作 - 不仅仅是 StackOverflow 问题/答案。如果您想更多地谈论如何完成此操作,请随时给我发电子邮件(在我的个人资料中)。


推荐阅读