首页 > 解决方案 > 如何仅从网页中抓取链接 - Python

问题描述

我的目标是获取每个链接

我的代码打印了 href/link,但它也打印了我不想要的其他垃圾。

我只想要href/

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import requests
driver = webdriver.Chrome()
productlink=[]
for x in range (1,3):
    driver.get(f'https://meetinglibrary.asco.org/browse-meetings/2021%20Gastrointestinal%20Cancers%20Symposium?page={x}')
    time.sleep(3)
    page_source = driver.page_source
    soup = BeautifulSoup(page_source,'html.parser')
    productlist=soup.find_all('div',class_='session')
    for item in productlist:
        for link in item.find_all('a',class_='session__button ng-star-inserted',href=True):
            print(link)

标签: pythonhtmlbeautifulsoup

解决方案


因为href=True意味着获取那些带有属性的标签。href还有Tag. 要获得href,您还需要使用。由于每个标签.get("href")中只有一个按钮,您可以使用代替,并且不要忘记加入 。试试下面的代码:sessionfindfind_allbaseURL

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import requests
driver = webdriver.Chrome()
productlink=[]
baseURL = 'https://meetinglibrary.asco.org'
for x in range (1,3):
    driver.get(f'https://meetinglibrary.asco.org/browse-meetings/2021%20Gastrointestinal%20Cancers%20Symposium?page={x}')
    time.sleep(3)
    page_source = driver.page_source
    soup = BeautifulSoup(page_source,'html.parser')
    productlist=soup.find_all('div',class_='session')
    for item in productlist:
        print(baseURL + item.find('a',class_='session__button ng-star-inserted',href=True).get("href"))

打印:

https://meetinglibrary.asco.org/session/13455
https://meetinglibrary.asco.org/session/13458
https://meetinglibrary.asco.org/session/13445
https://meetinglibrary.asco.org/session/13450
https://meetinglibrary.asco.org/session/13460
https://meetinglibrary.asco.org/session/13462
https://meetinglibrary.asco.org/session/13464
https://meetinglibrary.asco.org/session/13459
https://meetinglibrary.asco.org/session/13446
https://meetinglibrary.asco.org/session/13451
https://meetinglibrary.asco.org/session/13461
https://meetinglibrary.asco.org/session/13463
https://meetinglibrary.asco.org/session/13465
https://meetinglibrary.asco.org/session/13399
https://meetinglibrary.asco.org/session/13443
https://meetinglibrary.asco.org/session/13444
https://meetinglibrary.asco.org/session/13352
https://meetinglibrary.asco.org/session/13381
https://meetinglibrary.asco.org/session/13383
https://meetinglibrary.asco.org/session/13372
https://meetinglibrary.asco.org/session/13382
https://meetinglibrary.asco.org/session/13447
https://meetinglibrary.asco.org/session/13849
https://meetinglibrary.asco.org/session/13384
https://meetinglibrary.asco.org/session/13389
https://meetinglibrary.asco.org/session/13453
https://meetinglibrary.asco.org/session/13859
https://meetinglibrary.asco.org/session/13391
https://meetinglibrary.asco.org/session/13392
https://meetinglibrary.asco.org/session/13394
....

推荐阅读