首页 > 解决方案 > 抓取 href/URL

问题描述

我的代码进入一个包含多个条目的网页,获取它们的 URL,然后将它们放入列表中。

然后它逐个浏览每个 URL 列表,然后对每个演示文稿进行 scape。

现在,我抓取了每个演示文稿的每个标题(您可以查看是否运行代码),但在标题中,还有另一个我想要的 URL/href。

有没有办法刮掉这个?

谢谢

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
val=[]
driver = webdriver.Chrome()
for x in range (1,3):
    driver.get(f'https://www.abstractsonline.com/pp8/#!/9325/sessions/@sessiontype=Advances%20in%20Diagnostics%20and%20Therapeutics/{x}')
    time.sleep(9)
    page_source = driver.page_source
    eachrow = ["https://www.abstractsonline.com/pp8/#!/9325/session/" + x.get_attribute('data-id') for x in driver.find_elements_by_xpath('//*[@id="results"]/li//h1[@class="name"]')]
    for row in eachrow:
        val.append(row)
        print(row)

for b in val:
    driver.get(b)
    time.sleep(3)
    page_source1=driver.page_source
    soup=BeautifulSoup(page_source1,'html.parser')
    productlist=soup.find_all('a',class_='title color-primary')
    for item in productlist:
        presentationTitle=item.text.strip()
        print(presentationTitle)

标签: pythonseleniumweb-scrapingbeautifulsoup

解决方案


我认为您需要一些等待条件,然后为页面中的每个演示文稿提取 href 属性

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
base = 'https://www.abstractsonline.com/pp8/#!/9325/session/'

for x in range (1, 3):
    driver.get(f'https://www.abstractsonline.com/pp8/#!/9325/sessions/@sessiontype=Advances%20in%20Diagnostics%20and%20Therapeutics/{x}')
    links = [base + i.get_attribute('data-id') for i in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li .name")))]
    
    for link in links:
        driver.get(link)
        print(WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID, "spnSessionTitle"))).text)
        for presentation in driver.find_elements_by_css_selector('.title'):
            print(presentation.text.strip())
            print('https://www.abstractsonline.com/pp8' + presentation.get_attribute('href'))

推荐阅读