首页 > 解决方案 > 如何在 BeautifulSoup 中添加“href contains”条件

问题描述

我正在尝试从网页中提取链接。在做的时候,我得到了所有的链接。需要提取仅包含的页面watch?v=

import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import ast
import json
import os
from urllib.request import Request, urlopen
# For ignoring SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Input from user

#url = input('Enter Youtube Video Url- ')
#url = 'https://www.youtube.com/watch?v=MxnkDj8PIxQ'
url = 'https://www.youtube.com/feed/trending'
# Making the website believe that you are accessing it using a mozilla browser

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

# Creating a BeautifulSoup object of the html page for easy extraction of data.

soup = BeautifulSoup(webpage, 'html.parser')
html = soup.prettify('utf-8')
for a in soup.find_all('a', href=True):
    print ("Found the URL:", a['href'])

我的输出

Found the URL: /watch?v=EJe3xxkzj5Y
Found the URL: /watch?v=Thf60JU8E98
Found the URL: /watch?v=Thf60JU8E98
Found the URL: /user/adityamusic
Found the URL: /channel/Muzik

我的预期输出应该只包含带有 watch?v= 的链接

Found the URL: /watch?v=EJe3xxkzj5Y
Found the URL: /watch?v=Thf60JU8E98

标签: pythonbeautifulsouphref

解决方案


您可以将正则表达式传递给href关键字find_all

soup.find_all('a', href=re.compile('^/watch\?v=')

代码

import re
# Rest of your code ...
for a in soup.find_all('a', href=re.compile('^/watch\?v=')):
    print ("Found the URL:", a['href']) 

推荐阅读