python - BeautifulSoup find_all('href') 只返回部分值
问题描述
我正在尝试从 IMDB 电影页面中抓取演员/女演员 ID。我只想要演员(我不想得到任何剧组),而这个问题是专门关于获取人的内部ID的。我已经有了人们的名字,所以我不需要帮助来获取这些名字。我从这个网页(https://www.imdb.com/title/tt0084726/fullcredits?ref_=tt_cl_sm#cast)作为硬编码的 url 开始,以获得正确的代码。
在检查链接时,我发现演员的链接看起来像这样。
<a href="/name/nm0000638/?ref_=ttfc_fc_cl_t1"> William Shatner</a>
<a href="/name/nm0000559/?ref_=ttfc_fc_cl_t2"> Leonard Nimoy</a>
<a href="/name/nm0346415/?ref_=ttfc_fc_cl_t17"> Nicholas Guest</a>
而其他贡献者的看起来像这样
<a href="/name/nm0583292/?ref_=ttfc_fc_dr1"> Nicholas Meyer </a>
<a href="/name/nm0734472/?ref_=ttfc_fc_wr1"> Gene Roddenberry</a>
这应该允许我通过检查 href 的结尾是否为“t [0-9] + $”而不是相同但带有“dr”或“wr”来区分演员/女演员与导演或作家等工作人员。
这是我正在运行的代码。
import urllib.request
from bs4 import BeautifulSoup
import re
movieNumber = 'tt0084726'
url = 'https://www.imdb.com/title/' + movieNumber + '/fullcredits?ref_=tt_cl_sm#cast'
def clearLists(n):
return [[] for _ in range(n)]
def getSoupObject(urlInput):
page = urllib.request.urlopen(urlInput).read()
soup = BeautifulSoup(page, features="html.parser")
return(soup)
def getPeopleForMovie(soupObject):
listOfPeopleNames, listOfPeopleIDs, listOfMovieIDs = clearLists(3)
#get all the tags with links in them
link_tags = soupObject.find_all('a')
#get the ids of people
for linkTag in link_tags:
link = str(linkTag.get('href'))
#print(link)
p = re.compile('t[0-9]+$')
q = p.search(link)
if link.startswith('/name/') and q != None:
id = link[6:15]
#print(id)
listOfPeopleIDs.append(id)
#return the names and IDs
return listOfPeopleNames, listOfPeopleIDs
newSoupObject = getSoupObject(url)
pNames, pIds = getPeopleForMovie(newSoupObject)
上面的代码返回一个空的 ID 列表,如果你取消注释 print 语句,你可以看到这是因为放入“link”变量的值最终是下面的(特定人的变化)
/name/nm0583292/
/name/nm0000638/
那不行。我只想要演员和女演员的 ID,以便以后可以使用这些 ID。我试图在stackoverflow上找到其他答案;我一直无法找到这个特定的问题。
这个问题(Beautifulsoup: parsing html – get part of href)与我想要做的很接近,但它从标签之间的文本部分获取信息,而不是从标签属性中的 href 部分获取信息。
如何确保从页面中只获得我想要的名称 ID(只是演员的 ID)? (此外,请随时提供收紧代码的建议)
解决方案
您尝试匹配的链接似乎在加载后已被 JavaScript 修改,或者可能基于其他变量而不是单独的 URL(如 cookie 或标头)以不同方式加载。
但是,由于您只是在演员阵容中的人物链接之后,更简单的方法是简单地匹配演员部分中人物的 ID。这实际上相当简单,因为它们都在一个元素中,<table class="cast_list">
所以:
import urllib.request
from bs4 import BeautifulSoup
import re
# it's Python, so use Python conventions, no uppercase in function or variable names
movie_number = 'tt0084726'
# The f-string is often more readable than a + concatenation
url = f'https://www.imdb.com/title/{movieNumber}/fullcredits?ref_=tt_cl_sm#cast'
# this is overly fancy for something as simple as initialising some variables
# how about:
# a, b, c = [], [], []
# def clearLists(n):
# return [[] for _ in range(n)]
# in an object-oriented program, assuming something is an object is the norm
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
soup = BeautifulSoup(page, features="html.parser")
# removed needless parentheses - arguably, even `soup` is superfluous:
# return BeautifulSoup(page, features="html.parser")
return soup
# keep two empty lines between functions, it's standard and for good reason
# it's easier to spot where a function starts and stops
# try using an editor or IDE that highlights your PEP8 mistakes, like PyCharm
# (that's just my opinion there, other IDEs than PyCharm will do as well)
def get_people_for_movie(soup_object):
# removed unused variables, also 'list_of_people_ids' is needlessly verbose
# since they go together, why not return people as a list of tuples, or a dictionary?
# I'd prefer a dictionary as it automatically gets rid of duplicates as well
people = {}
# (put a space at the start of your comment blocks!)
# get all the anchors tags inside the `cast_list` table
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
# the whole point of compiling the regex is to only have to do it once,
# so outside the loop
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of people
for link_tag in link_tags:
# the href attributes is a strings, so casting with str() serves no purpose
href = link_tag.get('href')
# matching and extracting part of the match can all be done in one step:
match = id_regex.search(href)
if match:
# don't shadow Python keywords like `id` with variable names!
identifier = match.group(1)
name = link_tag.text.strip()
# just ignore the ones with no text, they're the thumbs
if name:
people[identifier] = name
# return the names and IDs
return people
def main():
# don't do stuff globally, it'll just cause problems when reusing names in functions
soup = get_soup(url)
people = get_people_for_movie(soup)
print(people)
# not needed here, but a good habit, allows you to import stuff without running the main
if __name__ == '__main__':
main()
结果:
{'0000638': 'William Shatner', '0000559': 'Leonard Nimoy', '0001420': 'DeForest Kelley', etc.
并且代码进行了一些调整,并且没有对您的代码进行注释:
import urllib.request
from bs4 import BeautifulSoup
import re
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
return BeautifulSoup(page, features="html.parser")
def get_people_for_movie(soup_object):
people = {}
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of the cast
for link_tag in link_tags:
match = id_regex.search(link_tag.get('href'))
if match:
name = link_tag.text.strip()
if name:
people[match.group(1)] = name
return people
def main():
movie_number = 'tt0084726'
url = f'https://www.imdb.com/title/{movie_number}/fullcredits?ref_=tt_cl_sm#cast'
people = get_people_for_movie(get_soup(url))
print(people)
if __name__ == '__main__':
main()
推荐阅读
- pointers - 要在 for 循环中使用 goroutine,为什么迭代指向结构而不是结构本身的指针有效
- javascript - 如何在 ES6 中将对象映射到数组?
- r - 过滤分组变量
- php - CRC_CCITT 校验和 PHP
- angular - http://localhost:4200' 已被 CORS 策略阻止:对预检请求的响应未通过
- bash - 如何使用镜像名称访问 Docker 容器实例?
- ios - 擦除使用 pdfkit 添加的 pdfAnnotation
- android - 在 android studio 上反应原生
- parallel-processing - AWS Step Function 与 Lambda
- html - 在 HTML 页面上仅显示 1 张画廊图片