首页 > 解决方案 > 自动翻页

问题描述

我想做网络爬虫。

该网站是一个求职网站,“确实”

该网站由大约 16 页组成。

我想刮掉所有页面。

但我的代码只刮了一页。

我该如何解决?


 import requests
 from bs4 import BeautifulSoup
 
 LIMIT = 50
 
 URL = F"https://kr.indeed.com/jobs?q=python&limit=50&radius=25"
 
 def get_last_page():
   result = requests.get(URL)
   soup = BeautifulSoup(result.text, "html.parser")
   pagination = soup.find("div", {"class":"pagination"})
   links = pagination.find_all('a')
   pages = []
   for link in links[:-1]:
     pages.append(int(link.string))
   max_page = pages[-1]
   return max_page
 
 def extract_job(html):
   title = html.find("h2", {"class": "title"}).find("a")["title"]
   company = html.find("span", {"class": "company"})
   company_anchor = company.find("a")
   if company_anchor is not None:
     company = str(company_anchor.string)
   else:
     company = str(company.string)
   company = company.strip()
   location = html.find("span", {"location"}).string
   job_id = html.find("h2",{"class":"title"}).find("a")["href"]
   return {'title' : title, 'company': company, 'location':location, "link": f"https://kr.indeed.com{job_id}"}
 
 def extract_jobs(last_page):
   print(f"Scrapping page {page}")
   jobs = []
   for page in range(last_page):
    result = requests.get(f"{URL}&start={page*LIMIT}")
    soup = BeautifulSoup(result.text, "html.parser")
    results = soup.find_all("div", {"class":"jobsearch-SerpJobCard"})
    for result in results:
      job = extract_job(result)
      jobs.append(job)
    return jobs
 
 def get_jobs():
   last_page = get_last_page()
   jobs = extract_jobs(last_page)
   return jobs

标签: pythonweb-scraping

解决方案


从第一页切换到其他页面后,您会在 URL 中看到“&start=”,因此我们可以将此参数添加到 for 循环中并获取所有页面,如下所示:

URL = F"https://kr.indeed.com/jobs?q=python&limit=50&radius=25&start="
def get_page(page_number):
    res = requests.get(f'{URL}{(page_number-1)*50}')
    soup = BeautifulSoup(res.content, 'lxml')
    return soup.find('div', attrs={"id": 'searchCountPages'})  # returns the page number from request
number_of_pages = 16
for i in range(1, number_of_pages + 1):
    print(get_page(i))

打印每个请求后,我们可以看到它一页一页地遍历所有页面。


推荐阅读