首页 > 解决方案 > 无法使用漂亮的汤刮掉所有数据

问题描述

URL = r"https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/"
My_list = ['2007','2008','2009','2010']

Year = []
CompanyName = []
Rank = []
Score = []

for I, Page in enumerate(My_list, start=1):
    url = r'https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/{}'.format(Page)
    print(url)

    Res = requests.get(url)
    soup = BeautifulSoup(Res.content , 'html.parser')
    data = soup.find('div' ,{'id':'main-content'})
for Data in data:
        Title = data.findAll('h3')
        for title in Title:
            CompanyName.append(title.text.strip())


        Rank = data.findAll('div' ,{'class':'rank RankNumber'})
        for rank in Rank:
            Rank.append(rank)


        Score = data.findAll('div' ,{'class':'rank RankNumber'})
        for score in Score:
            Score.append(score)

我无法获得标题、排名、分数的所有数据。我不知道我是否确定了正确的标签。我无法从列表排名中提取价值。

标签: python-3.xbeautifulsoup

解决方案


让你开始。首先,找到所有 div.RankItem 元素,然后在每个元素中找到标题、排名和分数。

from bs4 import BeautifulSoup
import requests

resp = requests.get('https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/2010')
soup = BeautifulSoup(resp.content , 'html.parser')
for i, item in enumerate(soup.find_all("div", {"class": "RankItem"})):
    title = item.find("h3", {"class": "MainLink"}).get_text().strip()
    rank = item.find("div", {"class": "RankNumber"}).get_text().strip()
    score = item.find("div", {"class": "score"}).get_text().strip()
    print(i+1, title, rank, score)

推荐阅读