python - 如何仅从网页中抓取链接 - Python
问题描述
我的目标是获取每个链接
我的代码打印了 href/link,但它也打印了我不想要的其他垃圾。
我只想要href/
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import requests
driver = webdriver.Chrome()
productlink=[]
for x in range (1,3):
driver.get(f'https://meetinglibrary.asco.org/browse-meetings/2021%20Gastrointestinal%20Cancers%20Symposium?page={x}')
time.sleep(3)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='session')
for item in productlist:
for link in item.find_all('a',class_='session__button ng-star-inserted',href=True):
print(link)
解决方案
因为href=True
意味着获取那些带有属性的标签。href
还有Tag
. 要获得href
,您还需要使用。由于每个标签.get("href")
中只有一个按钮,您可以使用代替,并且不要忘记加入 。试试下面的代码:session
find
find_all
baseURL
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import requests
driver = webdriver.Chrome()
productlink=[]
baseURL = 'https://meetinglibrary.asco.org'
for x in range (1,3):
driver.get(f'https://meetinglibrary.asco.org/browse-meetings/2021%20Gastrointestinal%20Cancers%20Symposium?page={x}')
time.sleep(3)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='session')
for item in productlist:
print(baseURL + item.find('a',class_='session__button ng-star-inserted',href=True).get("href"))
打印:
https://meetinglibrary.asco.org/session/13455
https://meetinglibrary.asco.org/session/13458
https://meetinglibrary.asco.org/session/13445
https://meetinglibrary.asco.org/session/13450
https://meetinglibrary.asco.org/session/13460
https://meetinglibrary.asco.org/session/13462
https://meetinglibrary.asco.org/session/13464
https://meetinglibrary.asco.org/session/13459
https://meetinglibrary.asco.org/session/13446
https://meetinglibrary.asco.org/session/13451
https://meetinglibrary.asco.org/session/13461
https://meetinglibrary.asco.org/session/13463
https://meetinglibrary.asco.org/session/13465
https://meetinglibrary.asco.org/session/13399
https://meetinglibrary.asco.org/session/13443
https://meetinglibrary.asco.org/session/13444
https://meetinglibrary.asco.org/session/13352
https://meetinglibrary.asco.org/session/13381
https://meetinglibrary.asco.org/session/13383
https://meetinglibrary.asco.org/session/13372
https://meetinglibrary.asco.org/session/13382
https://meetinglibrary.asco.org/session/13447
https://meetinglibrary.asco.org/session/13849
https://meetinglibrary.asco.org/session/13384
https://meetinglibrary.asco.org/session/13389
https://meetinglibrary.asco.org/session/13453
https://meetinglibrary.asco.org/session/13859
https://meetinglibrary.asco.org/session/13391
https://meetinglibrary.asco.org/session/13392
https://meetinglibrary.asco.org/session/13394
....
推荐阅读
- scrapy - Scrapy splash 通常需要几分钟才能呈现相同的反应网站
- dynamic - QtQuick 和 StackLayout - 动态插入的组件未正确调整大小
- javascript - OAuth2 全局定义的客户端和并发后端功能执行
- javascript - 在基于另一个作业名的新条目之后将唯一/顺序 ID 添加到行的功能
- vb.net - 将位图移到前面
- c++ - 输出和我认为应该的输出之间冲突的原因是什么?(C++ 地图)
- android - 您上传的 APK 或 Android App Bundle 使用尚未有效的上传证书签名,因为它最近已被重置
- python - 如何在 python 中执行程序并允许用户与之交互?
- android - 如何在android中为日期对象设置小时、分钟、秒
- c - 编译RAID控制器设备驱动(linux内核模块)