python - BeautifulSoup爬取时如何获取页面的深层次
问题描述
我尝试使用爬网来从网站创建一个小型数据集。我使用 BeautifulSoup 来获取页面信息,并希望从该网站上的产品中获取一些数据。事实上,我没有在“汤”中得到身体本身,它阻止了我获取主要数据。
我的代码:
def get_pages(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=0&sort=magic&seed=2569226&page=' + str(page)
source_code = requests.get(url)
text_page = source_code.text
soup = BeautifulSoup(text_page, 'html.parser')
for link in soup.findAll('a', {'class': 'soft-black mb3'}):
href = link.get('href')
print(href)
page += 1
get_pages(1)
我的问题是,我怎样才能获得更深层次的页面?
解决方案
这似乎对我有用..我在 5 页上运行它就好了。
from bs4 import BeautifulSoup
import re
import requests
def get_pages(max_pages):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page = 1
while page <= max_pages:
url = 'https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=0&sort=magic&seed=2569226&page=' + str(page)
source_code = requests.get(url, headers=headers)
soup = BeautifulSoup(source_code.text, 'lxml')
classes = soup.findAll('div', class_='js-react-proj-card col-full col-sm-12-24 col-lg-8-24')
urls = re.findall('"project":"https://www.kickstarter.com/.+\",', str(classes))
for url in urls:
each_page = requests.get(url.replace(',','').replace('"','').replace('project:',''), headers=headers)
soup = BeautifulSoup(source_code.text, 'lxml')
#I don't know what your end goal is, but this was just printing the url of the page.
print(each_page.url)
page += 1
Output =
https://www.kickstarter.com/projects/albertgajsak/makerphone-an-educational-diy-mobile-phone
https://www.kickstarter.com/projects/meadow/meadow-full-stack-net-standard-iot-platform
https://www.kickstarter.com/projects/simonegiertz/the-every-day-calendar
https://www.kickstarter.com/projects/keyboardio/model-01-travel-case-quickstarter
https://www.kickstarter.com/projects/44621210/qdee-robot-kit-a-whole-new-world-of-play-to-micro
https://www.kickstarter.com/projects/whambamsystems/wham-bam-the-best-flexible-bed-for-3d-printers-ava
https://www.kickstarter.com/projects/ludenso/magimask-immersive-high-definition-augmented-reali
https://www.kickstarter.com/projects/805332783/tinyjuice-the-smallest-self-adhesive-true-wireless
https://www.kickstarter.com/projects/2099924322/nebula-capsule-ii-worlds-first-android-tvtm-pocket
https://www.kickstarter.com/projects/767329947/dockcase-adapter-turn-your-macbook-pro-charger-int
https://www.kickstarter.com/projects/petato/footloose-next-gen-automatic-and-health-tracking-c
https://www.kickstarter.com/projects/1289187249/fingertip-microscope-bring-a-800x-microscope-on-yo
https://www.kickstarter.com/projects/bentristem/the-web-app-revolution-making-the-best-coding-cour
推荐阅读
- java - 在不知道其类型的情况下基于另一个抽象类的子类创建一个抽象类的子类
- amazon-web-services - URL 响应为 9422 毫秒,超过阈值(8000 毫秒):unitrl
- php - Laravel - 如何删除数组中的项目不使用循环
- java - 更新响应对象而不转换为 java 对象
- ruby-on-rails - 根据 Rails 中的另一个表更新数量
- c# - 如何使用 C# 在文件中用单个逗号替换多个选项卡
- java - 喜欢和不喜欢 Firebase 的系统不工作
- python - 使用 pandas 函数合并 2 行
- java - 两个 Comparables 的最小/最大函数
- url - Nginx 混淆 $fastcgi_script_name 和 $fastcgi_path_info