首页 > 解决方案 > 使用 beautifulsoup 循环页面

问题描述

我会从本网站https://www.transfermarkt.it/detailsuche/spielerdetail/suche/27564780刮取所有页面的播放器网址, 但我只能刮取第一个,为什么?我用 range() 写了一个 cicle

import pandas as pd
import requests
from bs4 import BeautifulSoup


list_url=[]
def get_player_urls(page):
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
    }
    link = 'https://www.transfermarkt.it/detailsuche/spielerdetail/suche/27564780/page/{page}'
    content = requests.get(link, headers=headers)
    soup = BeautifulSoup(content.text, 'html.parser')
    for urls in soup.find_all('a', class_='spielprofil_tooltip'):
        url = 'https://www.transfermarkt.it' + urls.get('href')
    
        print(url)
        list_url.append(url)
        
    return

for page in range(1,11,1):
    get_player_urls(page)

df_url = pd.DataFrame(list_url)
df_url.to_csv('df_url.csv', index=False, header=False)


标签: pythonweb-scrapingbeautifulsoup

解决方案


您实际上并没有将页面输入到网址中。此外,无需将 return 放在您的功能上。你没有返回任何东西:

import pandas as pd
import requests
from bs4 import BeautifulSoup


list_url=[]
def get_player_urls(page):
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
    }
    link = 'https://www.transfermarkt.it/detailsuche/spielerdetail/suche/27564780/page/{page}'.format(page=page)   #<-- Add this
    content = requests.get(link, headers=headers)
    soup = BeautifulSoup(content.text, 'html.parser')
    for urls in soup.find_all('a', class_='spielprofil_tooltip'):
        url = 'https://www.transfermarkt.it' + urls.get('href')
    
        print(url)
        list_url.append(url)

for page in range(1,11,1):
    get_player_urls(page)

df_url = pd.DataFrame(list_url)
df_url.to_csv('df_url.csv', index=False, header=False)

推荐阅读