python - 使用 Beautiful Soup 抓取后,IMDb 的重定向超链接不起作用
问题描述
我正在尝试使用 Beautiful Soup 抓取 IMDb 上标题页的官方网站数据。例如,如果我需要获取Intersteller的数据,我有以下代码:
url = 'https://www.imdb.com/title/tt0816692/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
title_detail_soup = soup.find('div', {'id': 'titleDetails'})
details_soup = title_detail_soup.find_all('div', class_='txt-block')
detail_list = ['Official Sites:', 'Country:', 'Language:',
'Release Date:', 'Also Known As:', 'Filming Locations:']
details = {}
for detail in details_soup:
try:
# Each heading (h4) has detail heading
head = detail.find('h4')
if head.get_text() in detail_list:
# If the detail heading is in the detail list
if head.get_text() == 'Official Sites:':
# If details is about official sites
official_site = {}
detail.h4.decompose() # remove <h4> tags
a_tags = detail.find_all('a')
for a_tag in a_tags:
# exclude See more>> links
if a_tag.get_text() != 'See more':
data = url+a_tag['href'] # final link is base URL + hyperlink
official_site[a_tag.get_text()] = data
details['official-sites'] = official_site
except Exception as e:
print(e)
print(details) # Print the detail dictionary
页面的 HTML:
<div class="article" id="titleDetails">
<span class="rightcornerlink">
<a href="https://contribute.imdb.com/updates?edit=tt0816692/details&ref_=tt_dt_dt">Edit</a>
</span>
<h2>Details</h2>
<div class="txt-block">
<h4 class="inline">Official Sites:</h4>
<a href="/offsite/?page-action=offsite-facebook&token=BCYpckvEa_ZSPp2TC3Ztr1DNqde5ZCUHig7950CLYvsgSHOzBCfJSHpgg71IYRsZYP1DuUpTZb9H%0D%0AhK4BzY5AiKU5Vy2oFn7i91MVFT_TnR39yhU5V5NBAse2mY_ht5WdsmSBxQPGRBC6pIJJym7IXbao%0D%0ATz9SG3r8MjKfwIe9hBrJU5Y-vNdnR_uaDq_24s2NGj5ikJYWl_093YIHy_I2lnK-I6jK9OvOpwgw%0D%0AupABQOymuxA%0D%0A&ref_=tt_pdt_ofs_offsite_0" rel="nofollow">Official Facebook</a>
<span class="ghost">|</span>
<a href="/offsite/?page-action=offsite-interstellarmovie&token=BCYuB9Ouy5QXl_3W_k3RrnnXUdrfSLbBFfOcrJTX0yo5TtTDqsSLpry8x7drK8l0xpOJSEqt73Hz%0D%0A08qyki3_i83CrCym7SXSkevFQpT32TjuuJLgIlQ-W5CpRd-wZC9eD4R3SZOMdOfSjeoOtqiE5uU_%0D%0Az-YG1i5AImXY2xLmHSNwABh1hU7VHS-FnqKDW9G-4KOF78zpKdDIfrwlRs8px0yef9u51LojZz05%0D%0A0OBfTmRs_JI%0D%0A&ref_=tt_pdt_ofs_offsite_1" rel="nofollow">Official site</a>
<span class="ghost">|</span>
<span class="see-more inline">
<a href="externalsites?ref_=tt_dt_dt#official">See more</a> »
</span>
</div>
</div>
我已经使用此方法成功地将数据提取为字典格式,但是当我使用字典中的超链接时,它们不起作用并给出未找到请求的 URL 的错误。
输出字典:
{
'official-sites': {
'Official Facebook': 'https://www.imdb.com/title/tt0816692/offsite/?page-action=offsite-facebook&token=BCYqzjQrP9OA_yaYNwA9Q8hI5gt41EmHuu0_ePjZPHKui-hEmAEySo-0SHzZmSjpeeEVy3Art6SH%0D%0ATseW16b3uKMjIH8iOyO-ZVYR025mQ4YCbZIWUKEcEM-z0eOeUvud3KGbuQTCxrNhTGAx7xgFIB89%0D%0Al9jT6pvqSpSCdNYACnBhk_8MuNjCn8GIJZk-6PR1MZ1xQB5yDrqRNhNt9Dg8IDMXVpxTR8-LFu2I%0D%0Amf5KmXbmXos%0D%0A',
'Official site': 'https://www.imdb.com/title/tt0816692/offsite/?page-action=offsite-interstellarmovie&token=BCYsMb9WTKJLH9M9nmxvLDpn8ikQDnQmpVQZBurp9Trd1-XXbA_Bh4xoKx6yf3Qx4YNn3fT9UhFe%0D%0AnzcULcEY5SFJ7CW8kBj6dQvZA9GyvqfZMyIDS7daNe6rne6DkdL23CDPAkk1Xwr9rjiE6FF_m0vX%0D%0ASLH2NnzOf8BcKnaWILhGGdvHTYeZ_uRGm4QCIOzxw-CvLM2rag04ZbXM2ZUEvQm6OedW9XumtsnQ%0D%0AoP7ce67sytE%0D%0A'
}
}
解决方案
对于调用 get_text() 的所有代码,请确保对象不为空,
尝试使用这个:
url = 'https://www.imdb.com/title/tt0816692/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
title_detail_soup = soup.find('div', {'id': 'titleDetails'})
headings_soup = title_detail_soup.find_all(['h2', 'h3'])
details_soup = title_detail_soup.find_all('div', class_='txt-block')
detail_list = ['Official Sites:', 'Country:', 'Language:',
'Release Date:', 'Also Known As:', 'Filming Locations:']
details = {}
for detail in details_soup:
try:
head = detail.find('h4')
if head.get_text() in detail_list:
if head.get_text() == 'Official Sites:':
official_site = {}
detail.h4.decompose()
a_tags = detail.find_all('a')
for a_tag in a_tags:
if a_tag.get_text() != 'See more':
data = url +a_tag['href']
official_site[a_tag.text] = data
details['official-sites'] = official_site
except Exception as e:
# print(e)
pass
print(details)
推荐阅读
- javascript - 打开由 while 循环生成的具有相同类名的多个按钮提示的非引导模式(弹出对话框/模式)
- laravel - 如何将当前时间戳放在服务器端的 GraphQL 变异模式中?
- python - 如何将白天分开并将其作为python中的表格?
- flutter - 如何将数据发送到从 UI 扩展 BackgroundAudioTask 的 AudioServiceTask 类
- maven - 从 Maven 的子模块中检索顶级版本
- c++ - c++ 警告:“t”可能在此函数中未初始化使用 [-Wmaybe-uninitialized] cm(t,n);
- excel - 从 Excel 转换为 xml 时的格式字符串
- c++ - 为什么我在使用尾随空格字符时出现字符串反转错误?
- javascript - JS:将数组内部映射到两个日期的范围
- c - 为什么Linux上报的同一台机器的SMBIOS和Windows上报的不一样?