首页 > 解决方案 > 如何使用 BeautifulSoup 动态抓取数据

问题描述

我正在学习如何使用 BeautifulSoup 从网站上抓取数据并尝试从 YTS 网站上抓取电影链接和一些关于它的数据。但我被困在其中。我写了一个脚本来抓取两种类型的电影类型。但有些电影在技术规格领域具有两种或多种类型的电影品质。要选择我必须为每种电影类型编写代码。但是如何创建一个 for 或 while 循环来抓取所有数据。

import requests
from bs4 import BeautifulSoup

m_r = requests.get('https://yts.mx/movies/suicide-squad-2016')
m_page = BeautifulSoup(m_r.content, 'html.parser')

#------------------ Name, Date, Category ----------------
m_det = m_page.find_all('div', class_='hidden-xs')

m_detail = m_det[4]
m_name = m_detail.contents[1].string
m_date = m_detail.contents[3].string
m_category = m_detail.contents[5].string
print(m_name)
print(m_date)
print(m_category)

#------------------ Download Links ----------------
m_li = m_page.find_all('p', {'class':'hidden-xs hidden-sm'})
m_link = m_li[0]
m_link_720 = m_link.contents[3].get('href')
print(m_link_720)
m_link_1080 = m_link.contents[5].get('href')
print(m_link_1080)

#-------------------- File Size & Language -------------------------
tech_spec = m_page.find_all('div', class_='row')
s_size = tech_spec[6].contents[1].contents[1]
#-----------Convert file size to MB-----------
if 'MB' in s_size:
    s_size = s_size.replace('MB', '').strip()
    print(s_size)
elif 'GB' in s_size:
    s_size = float(s_size.replace('GB', '').strip())
    s_size = s_size * 1024
    print(s_size)
#--------- Big file Languge-----------
s_lan = tech_spec[6].contents[5].contents[2].strip()
print(s_lan)

b_size = tech_spec[8].contents[1].contents[1]
#-----------Convert file size to MB-----------
if 'MB' in b_size:
    b_size = b_size.replace('MB', '').strip()
    print(b_size)
elif 'GB' in b_size:
    b_size = float(b_size.replace('GB', '').strip())
    b_size = b_size * 1024
    print(b_size)
#--------- Big file Languge-----------
b_lan = tech_spec[8].contents[5].contents[2].strip()
print(b_lan)

标签: pythonpython-3.xweb-scrapingbeautifulsoup

解决方案


此脚本将获取每种电影质量的所有信息:

import requests
from bs4 import BeautifulSoup


url = 'https://yts.mx/movies/suicide-squad-2016'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for tech_quality, tech_info in zip(soup.select('.tech-quality'), soup.select('.tech-spec-info')):
    print('Tech Quality:', tech_quality.get_text(strip=True))
    file_size, resolution, language, rating = [td.get_text(strip=True, separator=' ') for td in tech_info.select('div.row:nth-of-type(1) > div')]
    subtitles, fps, runtime, peers_seeds = [td.get_text(strip=True, separator=' ') for td in tech_info.select('div.row:nth-of-type(2) > div')]
    print('File size:', file_size)
    print('Resolution:', resolution)
    print('Language:', language)
    print('Rating:', rating)
    print('Subtitles:', tech_info.select_one('div.row:nth-of-type(2) > div:nth-of-type(1)').a['href'] if subtitles else '-')
    print('FPS:', fps)
    print('Runtime:', runtime)
    print('Peers/Seeds:', peers_seeds)
    print('-' * 80)

印刷:

Tech Quality: 3D.BLU
File size: 1.88 GB
Resolution: 1920*800
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 8 / 35
--------------------------------------------------------------------------------
Tech Quality: 720p.BLU
File size: 999.95 MB
Resolution: 1280*720
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 61 / 534
--------------------------------------------------------------------------------
Tech Quality: 1080p.BLU
File size: 2.06 GB
Resolution: 1920*1080
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 80 / 640
--------------------------------------------------------------------------------
Tech Quality: 2160p.BLU
File size: 5.82 GB
Resolution: 3840*1600
Language: English 5.1
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 2 min
Peers/Seeds: P/S 49 / 110
--------------------------------------------------------------------------------

推荐阅读