首页 > 解决方案 > 我正在尝试使用 python urllib 和美丽的汤从网站获取表数据,但它返回脚本

问题描述

我尝试了 BeautifulSoup,但它从 URL 中抓取了脚本。

url = 'https://ekartlogistics.com/shipmenttrack/FMPP0944216480' 
from bs4 import BeautifulSoup
from urllib import request, parse
read = request.urlopen(url)
soup = BeautifulSoup(read, 'html.parser')
print(soup.prettify())

它与其他 HTML 脚本一起返回该脚本。

在此处输入图像描述

我正在尝试从此 URL 获取此表数据

在此处输入图像描述

标签: python-3.xbeautifulsouppython-requestsurllib

解决方案


url 是由 javascript 动态加载的数据。所以你不能只使用beautifulsoup 来获取数据。您可以使用诸如 selenium 之类的自动化工具。这里我使用 selenium 来模仿 javascript 并通过使用 pandas 来抓取表数据,如下所示:

代码:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(5)

driver.get("https://ekartlogistics.com/shipmenttrack/FMPP0944216480")
time.sleep(3)
table = driver.find_element(By.CSS_SELECTOR, 'table.table').get_attribute('outerHTML')

df = pd.read_html(table)[0]
print(df)

输出:

                  Date         Time       Place                           Status
0     Sunday 17 October  04:24:26 PM     Kolkata                 Shipment Created
1     Sunday 17 October  04:24:31 PM     Kolkata     Dispatched to CentralHub_BAG
2     Sunday 17 October  04:56:00 PM     Kolkata       Received at CentralHub_BAG
3     Sunday 17 October  04:56:03 PM     Kolkata       Received at CentralHub_BAG
4     Monday 18 October  03:10:35 AM       Patna     Dispatched to CentralHub_BHT
5    Tuesday 19 October  04:48:44 AM       Patna       Received at CentralHub_BHT
6    Tuesday 19 October  05:03:44 PM  Samastipur  Dispatched to SatelliteHub_SAMA
7  Wednesday 20 October  02:47:44 AM  Samastipur    Received at SatelliteHub_SAMA
8   Thursday 21 October  09:21:52 AM  Samastipur                 Out For Delivery
9     Friday 22 October  07:38:36 AM  Samastipur                        Delivered

推荐阅读