首页 > 解决方案 > 无法使用 Find_All 刮取班级信息

问题描述

我正在尝试从网站https://www.programmableweb.com/category/all/apis下方提取课程信息表。我的代码适用于除https://www.programmableweb.com/category/all/apis?page=2092之外的所有页面。

from bs4 import BeautifulSoup
import requests

url = 'https://www.programmableweb.com/category/all/apis?page=2092'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
apis = soup.find_all('tr',{'class':['odd views-row-first', 'odd','even','even views-row-last']})
print(apis)

在 2092 页面上,我仅获得以下 1 个班级的信息

[<tr class="odd views-row-first views-row-last"><td class="views-field views-field-pw-version-title"> <a href="/api/inkling">Inkling API</a><br/></td><td class="views-field views-field-search-api-excerpt views-field-field-api-description hidden-xs visible-md visible-sm col-md-8"> Our REST API allows you to replicate much of the functionality in our hosted marketplace solution to build custom widgets and stock tickers for your Intranet, create custom reports, add trading...</td><td class="views-field views-field-field-article-primary-category"> <a href="/category/financial">Financial</a></td><td class="views-field views-field-pw-version-links"> <a href="/api/inkling-rest-api">REST v0.0</a></td></tr>]

对于任何其他页面(例如https://www.programmableweb.com/category/all/apis?page=2091),我会获得有关所有课程的信息。HTML 结构在所有页面中似乎都相似。

标签: htmlweb-scrapingbeautifulsoup

解决方案


这个网站不断地向它的数据库添加新的 API 所以这里有三种情况可能会导致这种情况:

  1. 您使用的选择器不准确。
  2. 该网站为您发送过多请求提供了某种安全措施。
  3. 在你刮这个页面的时候确实有一个项目。

情景 3 最有可能相信。

from bs4 import BeautifulSoup
import requests
from time import sleep

for page in range(1,2094): #starting with 1 then the last page will be 2093
  url = f'https://www.programmableweb.com/category/all/apis?page={page}'
  response = requests.get(url)
  data = response.text
  soup = BeautifulSoup(data, 'html.parser')
  apis = soup.select('table[class="views-table cols-4 table"] tbody tr') # better selector
  print(apis) #page 2093 currently has 6 items on it .
  sleep(5) #This will sleep for 5 secs 

推荐阅读