首页 > 解决方案 > Scraping a website with a particular format using Python

问题描述

I am trying to use Python to scrape the US News Ranking for universities, and I'm struggling. I normally use Python "requests" and "BeautifulSoup".

The data is here:

https://www.usnews.com/education/best-global-universities/rankings

Using right click and inspect shows a bunch of links and I don't even know which one to pick. I followed an example from the web that I found but it just gives me empty data:

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
import math
from lxml.html import parse
from io import StringIO


url = 'https://www.usnews.com/education/best-global-universities/rankings'
urltmplt = 'https://www.usnews.com/education/best-global-universities/rankings?page=2'

css = '#resultsMain :nth-child(1)'
npage = 20

urlst = [url] + [urltmplt + str(r) for r in range(2,npage+1)]

def scrapevec(url, css):
    doc = parse(StringIO(url)).getroot()
    return([link.text_content() for link in doc.cssselect(css)])

usng = []
for u in urlst:
    print(u)
    ts = [re.sub("\n *"," ", t) for t in scrapevec(u,css) if t != ""]

This doesn't work as t is an empty array.

I'd really appreciate any help.

标签: pythonweb-scraping

解决方案


The MWE you posted is not working at all: urlst is never defined and cannot be called. I strongly suggest you to look for basic scraping tutorials (with python, java, etc.): there is plenty and in general is a good starting.

Below you can find a snippet of a code that prints the universities' names listed on page 1 - you'll be able to extend the code to all the 150 pages through a for loop.

import requests
from bs4 import BeautifulSoup

newheaders = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}

baseurl = 'https://www.usnews.com/education/best-global-universities/rankings'

page1 = requests.get(baseurl, headers = newheaders) # change headers or get blocked 
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table

for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
    if a < 10: # there are 10 listed universities per page
        print(univ.text)

Edit: now the example works, but as you say in your question, it only returns empty lists. Below an edited version of the code that returns a list of all universities (pp. 1-150)

import requests 
from bs4 import BeautifulSoup

def parse_univ(url):
    newheaders = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
    }
    page1 = requests.get(url, headers = newheaders) # change headers or get blocked 
    soup = BeautifulSoup(page1.text, 'lxml')
    res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
    res = []
    for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
        if a < 10: # there are 10 listed universities per page
            res.append(univ.text)
    return res

baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='

ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists

univs = [item for sublist in ll for item in sublist] # unfold the list of lists

Re-edit following QHarr suggestion (thanks!) - same output, shorter and more "pythonic" solution

import requests 
from bs4 import BeautifulSoup

def parse_univ(url):
    newheaders = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
    }
    page1 = requests.get(url, headers = newheaders) # change headers or get blocked 
    soup = BeautifulSoup(page1.text, 'lxml')
    res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
    return [univ.text for univ in res_tab.select('[href]', limit=10)]

baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='

ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists

univs = [item for sublist in ll for item in sublist]

推荐阅读