首页 > 解决方案 > 从脚本中的var提取数据并使用python将pdf下载到文件夹

问题描述

我正在尝试从网站上的嵌入式地图中抓取变量中的信息,其中包括地理坐标以及我想下载到特定文件夹中的 pdf 文档的链接。

到目前为止,我尝试获取变量中的信息的尝试没有成功。有人能告诉我我错过了什么来获取我需要的数据吗?

这是我到目前为止整理的代码:

from bs4 import BeautifulSoup
import re
import requests

url = 'https://goldenagri.com.sg/sustainability-dashboard/supply-chain-map' 
page = requests.get(url).content
soup = BeautifulSoup(page, "html.parser")

p = re.compile(r"var locations = .")
data  = soup.find_all("script")
m = p.match(data)

我似乎无法在脚本中找到匹配的变量。它似乎正在捕获更多信息,而不仅仅是那个变量,这应该是这个

除了获取按设施下载 pdf 的链接外,我还想创建data frame设施、公司名称、位置和地理坐标,以便将其导出到电子表格中。

标签: javascriptpythonregexweb-scrapingbeautifulsoup

解决方案


代码:

from bs4 import BeautifulSoup
import re
import requests

url = 'https://goldenagri.com.sg/sustainability-dashboard/supply-chain-map'
page = requests.get(url).content
soup = BeautifulSoup(page, "html.parser")
data = soup.find('script', text=re.compile(r"var locations = ."))
print(re.search('var locations.*', str(data)).group())

输出:

var locations = [[new google.maps.LatLng(3.778669444,98.68998056), 'Belawan','<table><tr><td><b>Facility Name </b></td><td>&nbsp; : Belawan Refinery and Kernel Crushing Plant</td></tr><tr><td><b>Company Name</b></td><td>&nbsp; : PT SMART TBK</td></tr><tr><td><b>Location </b></td><td>&nbsp; : Belawan</td></tr><tr><td valign="top"><b>Summary Report </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/14"  target="_blank">Belawan Summary Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/84"  target="_blank">Belawan Summary Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/154"  target="_blank">Belawan Summary Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/226"  target="_blank">Belawan Summary Q4 2017</a></li></ul></td></tr><tr><td valign="top"><b>Supplying Mills </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/15"  target="_blank">Belawan Mill List Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/86"  target="_blank">Belawan Mill List Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/153"  target="_blank">Belawan Mill List Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/225"  target="_blank">Belawan Mill List Q4 2017</a></li></ul></td></tr></table>'],[new google.maps.LatLng(1.684938889,101.4462444), 'Dumai','<table><tr><td><b>Facility Name </b></td><td>&nbsp; : Dumai Bulking Station</td></tr><tr><td><b>Company Name</b></td><td>&nbsp; : PT Ivo Mas Tunggal</td></tr><tr><td><b>Location </b></td><td>&nbsp; : Dumai</td></tr><tr><td valign="top"><b>Summary Report </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/16"  target="_blank">Dumai Summary Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/87"  target="_blank">Dumai Summary Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/156"  target="_blank">Dumai Summary Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/228"  target="_blank">Dumai Summary Q4 2017</a></li></ul></td></tr><tr><td valign="top"><b>Supplying Mills </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/17"  target="_blank">Dumai Mill List Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/100"  target="_blank">Dumai Mill List Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/155"  target="_blank">Dumai Mill List Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/227"  target="_blank">Dumai Mill List Q4 2017</a></li></ul></td></tr></table>'],[new google.maps.LatLng(0.994694444,100.3734167), 'Padang','<table><tr><td><b>Facility Name </b></td><td>&nbsp; : Incasi Raya Padang Bulking Station</td></tr><tr><td><b>Company Name</b></td><td>&nbsp; : PT Leidong West Indonesia</td></tr><tr><td><b>Location </b></td><td>&nbsp; : Padang</td></tr><tr><td valign="top"><b>Summary Report </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/18"  target="_blank">Padang Summary Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/92"  target="_blank">Padang Summary Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/162"  target="_blank">Padang Summary Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/234"  target="_blank">Padang Summary Q4 2017</a></li></ul></td></tr><tr><td valign="top"><b>Supplying Mills </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/19"  target="_blank">Padang Mill List  Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/93"  target="_blank">Padang Mill List Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/161"  target="_blank">Padang Mill List Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/233"  target="_blank">Padang Mill List Q4 2017</a></li></ul></td></tr></table>'],[new google.maps.LatLng(1.777694444,101.3532778), 'Lubuk Gaung','<table><tr><td><b>Facility Name </b></td><td>&nbsp; : Lubuk Gaung Refinery and Kernel Crushing Plant</td></tr><tr><td><b>Company Name</b></td><td>&nbsp; : PT Ivo Mas Tunggal</td></tr><tr><td><b>Location </b></td><td>&nbsp; : Lubuk Gaung</td></tr><tr><td valign="top"><b>Summary Report </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/78"  target="_blank">Lubuk Gaung Summary Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/88"  target="_blank">Lubuk Gaung Summary Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/158"  target="_blank">Lubuk Gaung Summary Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/230"  target="_blank">Lubuk Gaung Summary Q4 2017</a></li></ul></td></tr><tr><td valign="top"><b>Supplying Mills </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/77"  target="_blank">Lubuk Gaung Mills List Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/89"  target="_blank">Lubuk Gaung Mills List Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/157"  target="_blank">Lubuk Gaung Mills List Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/229"  target="_blank">Lubuk Gaung Mills List Q4 2017</a></li></ul></td></tr></table>'],[new google.maps.LatLng(-6.091638889,106.9761111), 'Marunda','<table><tr><td><b>Facility Name </b></td><td>&nbsp; : Marunda Refinery</td></tr><tr><td><b>Company Name</b></td><td>&nbsp; : PT SMART TBK</td></tr><tr><td><b>Location </b></td><td>&nbsp; : Marunda</td></tr><tr><td valign="top"><b>Summary Report </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/79"  target="_blank">Marunda Summary Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/90"  target="_blank">Marunda Summary Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/160"  target="_blank">Marunda Summary Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/232"  target="_blank">Marunda Summary Q4 2017</a></li></ul></td></tr><tr><td valign="top"><b>Supplying Mills </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/80"  target="_blank">Marunda Mill List Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/91"  target="_blank">Marunda Mill List Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/159"  target="_blank">Marunda Mill List Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/231"  target="_blank">Marunda Mill List Q4 2017</a></li></ul></td></tr></table>'],[new google.maps.LatLng(-7.329972222,112.7615556), 'Surabaya','<table><tr><td><b>Facility Name </b></td><td>&nbsp; : Surabaya Refinery</td></tr><tr><td><b>Company Name</b></td><td>&nbsp; : PT SMART TBK</td></tr><tr><td><b>Location </b></td><td>&nbsp; : Surabaya</td></tr><tr><td valign="top"><b>Summary Report </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/24"  target="_blank">Surabaya Summary Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/94"  target="_blank">Surabaya Summary Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/164"  target="_blank">Surabaya Summary Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/236"  target="_blank">Surabaya Summary Q4 2017</a></li></ul></td></tr><tr><td valign="top"><b>Supplying Mills </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/25"  target="_blank">Surabaya Mill List Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/95"  target="_blank">Surabaya Mill List Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/163"  target="_blank">Surabaya Mill List Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/235"  target="_blank">Surabaya Mill List Q4 2017</a></li></ul></td></tr></table>'],[new google.maps.LatLng(-5.525611111,105.3526111), 'Tarahan','<table><tr><td><b>Facility Name </b></td><td>&nbsp; : Tarahan Refinery and Kernel Crushing Plant</td></tr><tr><td><b>Company Name</b></td><td>&nbsp; : PT Sumber Indah Perkasa</td></tr><tr><td><b>Location </b></td><td>&nbsp; : Tarahan</td></tr><tr><td valign="top"><b>Summary Report </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/26"  target="_blank">Tarahan Summary Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/96"  target="_blank">Tarahan Summary Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/166"  target="_blank">Tarahan Summary Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/238"  target="_blank">Tarahan Q4 2017 Summary</a></li></ul></td></tr><tr><td valign="top"><b>Supplying Mills </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/27"  target="_blank">Tarahan Mill List Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/97"  target="_blank">Tarahan Mill List Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/165"  target="_blank">Tarahan Mill List Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/237"  target="_blank">Tarahan Mill List Q4 2017</a></li></ul></td></tr></table>'],[new google.maps.LatLng(-3.271758333,116.1177361), 'Tarjun','<table><tr><td><b>Facility Name </b></td><td>&nbsp; : Tarjun Refinery and Kernel Crushing Plant</td></tr><tr><td><b>Company Name</b></td><td>&nbsp; : PT SMART TBK</td></tr><tr><td><b>Location </b></td><td>&nbsp; : Tarjun</td></tr><tr><td valign="top"><b>Summary Report </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/28"  target="_blank">Tarjun Summary Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/98"  target="_blank">Tarjun Summary Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/168"  target="_blank">Tarjun Summary Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/240"  target="_blank">Tarjun Summary Q4 2017</a></li></ul></td></tr><tr><td valign="top"><b>Supplying Mills </b></td><td valign="top">&nbsp; : <ul class="gpoplist"><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/29"  target="_blank">Tarjun Mill List Q1 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/99"  target="_blank">Tarjun Mill List Q2 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/167"  target="_blank">Tarjun Mill List Q3 2017</a></li><li> - <a href="https://goldenagri.com.sg/sustainability-dashboard/download-file/getfile/239"  target="_blank">Tarjun Mill List Q4 2017</a></li></ul></td></tr></table>']];

推荐阅读