首页 > 解决方案 > 在python中删除重复的url

问题描述

from selenium import webdriver
import time
from bs4 import BeautifulSoup as Bs

driver = webdriver.Chrome(executable_path=r'C:\Users\91901\PycharmProjects\kk\drivers\chromedriver.exe')

# page_no = input('Enter Page Number : ')
page_no = '2'
blog_page = driver.get('https://xmonks.com/blog/page/' + page_no + '/')
time.sleep(1)
driver.execute_script("window.scrollTo(0, 500)")
time.sleep(1)
driver.execute_script("window.scrollTo(0, 1000)")
driver.execute_script("window.scrollTo(0, 1500)")
time.sleep(1)
driver.execute_script("window.scrollTo(0, 2000)")
time.sleep(2)
soup = Bs(driver.page_source, 'html.parser')
time.sleep(3)
link = soup.find('div', {'class': 'exp-grid-wrap'})
lnk = link.find_all('a')
for links in lnk:
        ll = links.get('href')
        print(ll)

我正在从这个网站获取博客网址,但我得到了一些重复的网址,请帮助我如何删除重复的网址,在此先感谢

标签: pythonseleniumbeautifulsoup

解决方案


你只需要找到合适的元素来获得你需要的东西。我使用了a tag其中包含缩略图的那个并且它起作用了。

这将为您提供输出:

soup = Bs(driver.page_source, 'html.parser')
sleep(3)

a_tags = soup.find_all('a', {'class': 'exp-post-thumb-inner'})

links = [i['href'] for i in a_tags]
for i in links:
    print(i)

# output

https://xmonks.com/mindfulness-meditation-coaching/
https://xmonks.com/the-healing-aspects-of-mindfulness-coaching/
https://xmonks.com/techniques-to-be-mindful-a-word-by-and-for-the-coaches/
https://xmonks.com/understanding-mindfulness-coaching/
https://xmonks.com/neuroscience-of-emotions-and-values/
https://xmonks.com/neuroscience-of-beliefs/
https://xmonks.com/coaching-skills-for-leaders/
https://xmonks.com/neuroscience-of-goals/
https://xmonks.com/simplifying-coaching-coaching-matters/
https://xmonks.com/solution-focused-coaching/

推荐阅读