python-3.x - 从多个页面抓取数据,然后将其附加到 csv 文件
问题描述
我正在使用漂亮的汤进行网络抓取,以从中检索工作。我的代码正在运行,但是当它循环到下一页时,它会覆盖现有的 CSV 文件。我从其他帖子中看到我需要使用 pandas concat?但我似乎无法让它工作或在我的源代码中实现它。任何改进我的代码的建议也将不胜感激。
下面确实刮掉了第1-2页。
from bs4 import BeautifulSoup
import requests, pandas as pd
from urllib.parse import urljoin
print('Getting new jobs...')
main_url = 'https://www.indeed.com/jobs?q=web+developer&l=Sacramento,+CA&sort=date'
start_from = '&start='
for page in range(1, 3):
page = (page - 1) * 10
url = "%s%s%d" % (main_url, start_from, page) # get full url
indeed = requests.get(url)
indeed.raise_for_status()
soup = BeautifulSoup(indeed.text, 'html.parser')
home = 'https://www.indeed.com/viewjob?'
jobsTitle, companiesName, citiesName, jobsSummary, jobsLink = [], [], [], [], []
target = soup.find_all('div', class_=' row result')
for div in target:
if div:
title = div.find('a', class_='turnstileLink').text.strip()
jobsTitle.append(title)
company = div.find('span', class_='company').text.strip()
companiesName.append(company)
city = div.find('span', class_='location').text.strip()
citiesName.append(city)
summary = div.find('span', class_='summary').text.strip()
jobsSummary.append(summary)
job_link = urljoin(home, div.find('a').get('href'))
jobsLink.append(job_link)
target2 = soup.find_all('div', class_='lastRow row result')
for i in target2:
title2 = i.find('a', class_='turnstileLink').text.strip()
jobsTitle.append(title2)
company2 = i.find('span', class_='company').text.strip()
companiesName.append(company2)
city2 = i.find('span', class_='location').text.strip()
citiesName.append(city2)
summary2 = i.find('span', class_='summary').text.strip()
jobsSummary.append(summary2)
jobLink2 = urljoin(home, i.find('a').get('href'))
jobsLink.append(jobLink2)
data_record = []
for title, company, city, summary, link in zip(jobsTitle, companiesName, citiesName, jobsSummary, jobsLink):
data_record.append({'Job Title': title, 'Company': company, 'City': city, 'Summary': summary, 'Job Link': link})
df = pd.DataFrame(data_record, columns=['Job Title', 'Company', 'City', 'Summary', 'Job Link'])
df
解决方案
data_record
您可以使用构造函数在循环中创建列表DataFrame
:
data_record = []
for page in range(1, 3):
page = (page - 1) * 10
url = "%s%s%d" % (main_url, start_from, page) # get full url
indeed = requests.get(url)
indeed.raise_for_status()
soup = BeautifulSoup(indeed.text, 'html.parser')
...
for title, company, city, summary, link in zip(jobsTitle, companiesName, citiesName, jobsSummary, jobsLink):
data_record.append({'Job Title': title, 'Company': company, 'City': city, 'Summary': summary, 'Job Link': link})
df = pd.DataFrame(data_record, columns=['Job Title', 'Company', 'City', 'Summary', 'Job Link'])
可能的解决方案concat
:
dfs = []
for page in range(1, 3):
page = (page - 1) * 10
url = "%s%s%d" % (main_url, start_from, page) # get full url
indeed = requests.get(url)
indeed.raise_for_status()
soup = BeautifulSoup(indeed.text, 'html.parser')
...
data_record = []
for title, company, city, summary, link in zip(jobsTitle, companiesName, citiesName, jobsSummary, jobsLink):
data_record.append({'Job Title': title, 'Company': company, 'City': city, 'Summary': summary, 'Job Link': link})
df = pd.DataFrame(data_record, columns=['Job Title', 'Company', 'City', 'Summary', 'Job Link'])
dfs.append(df)
df_fin = pd.concat(dfs, ignore_index=True)
推荐阅读
- node.js - “吉普错误!堆栈”与 NodeJS,但安装了 XCode 命令行工具
- c++ - C++ 如何在后台进程上模拟鼠标点击?
- html - 为什么导航栏表现不佳?
- python - 如何将所有用户输入保存在文本文件中?
- ios - 当从 nativeXHR 或 Web 代理进行 XHR 调用时,Cookie 不会在 iOS 14 + XCode 12 中同步到 Cordova 主窗口
- java - Java 抛出 java.nio.file.NoSuchFileException,但文件存在
- vb.net - 递归浏览目录时拒绝访问vb.net
- nlp - 如何在 Huggingface BERT 模型之上添加 LSTM 层
- reactjs - 可以在 promise 的 THEN 构造中调度操作吗?
- asp.net - Blazor IdentityServer 身份验证