首页 > 解决方案 > 使用美汤时只获得最后一行

问题描述

我有以下代码:

from bs4 import BeautifulSoup

import requests

import pandas as pd

def Get_Top_List_BR(url):
    
    
    response = requests.get(url)

    page = response.text

    soup = BeautifulSoup(page)

    table = soup.find(id='table')
   
    rows = [row for row in table.find_all('tr')]
    

    movies = {}

    for row in rows[1:]:
        items = row.find_all('td')
        link = items[1].find('a')
        title, url_string = link.text, link['href']
        #split url string into unique movie serial number
        url = url_string.split('?', 1)[0].split('t', 4)[-1].split('/', 1)[0]
        #set serial number as key to avoid duplication in any other category-especially title
        movies[url] = [url_string] +[i.text for i in items]
   
    movie_page = pd.DataFrame(movies).T  #transpose
    movie_page.columns = ['URL', 'Rank', 'Title', 'Genre', 'Budget', 'Running Time','Gross',
                    'Theaters', 'Total_Gross', 'Release_Date', 'Distributor', 'Estimated']

    return movie_page

df_test_BR = Get_Top_List_BR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')

df_test_BR.head(10)

问题:我只得到最后一行。问题:如何修复它以返回所有行?

标签: pythonpandasweb-scrapingbeautifulsoup

解决方案


首先,我不确定您使用的是哪个 Python 版本,但您实现 BeautifulSoup 的方式是不正确的,至少在我的版本中是这样。BeautifulSoup 强烈建议在这里使用解析器。您在此处的以下代码:

 response = requests.get(url)
 page = response.text
 soup = BeautifulSoup(page)
 table = soup.find(id='table')

应该:

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find(id='table')

您的问题是如何在 for 循环中定义 url。我设法遍历了所有元素,但是如何定义它们url才是问题所在。您阅读urlfor 循环内部定义的方式返回空白空间

所以你说它只返回最后一项。当它到达最后一项时,它将在 for 循环中获取 url。但是 url 只是空格,并且 key 已经存在于电影中。因此,它将覆盖那里的现有数据。

我不确定你想要如何url定义,但这段代码会如你所愿 - 获取所有电影、它们的名称、href值并返回前 10 个。唯一的区别应该是你如何定义urlmovies[url],但要小心不要再次访问网址。

此外,您url在 for 循环中重新定义以表示唯一 ID 的方式应该反映这一点 - 将其命名为 unique_id (或者,在本例中uid)。我还包括了打印语句来演示它通过整个循环并获得前 10 个值。

def Get_Top_List_GR(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find(id='table')
    rows = [row for row in table.find_all('tr')]

    movies = {}
    for row in rows[1:]:
        items = row.find_all('td')
        link = items[1].find('a')
        title, url_string = link.text, link['href']
        # split url string into unique movie serial number
        uid = url_string.split("/")[-2]
        print("{0} - {1} - {2}".format(url, title, uid))
        # set serial number as key to avoid duplication in any other category-        especially title
        movies[uid] = [url_string] + [i.text for i in items]
    movie_page = pd.DataFrame(movies).T  # transpose
    return movie_page

df_test_ = Get_Top_List_GR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')
print(df_test_.head(10))

推荐阅读