python-3.x - 使用 BeautifulSoup 在 python 中进行 Web 抓取 - 如何转置结果?
问题描述
我构建了下面的代码,并且遇到了如何转置结果的问题。实际上,我正在寻找以下结果:
# Column headers: 'company name', 'Work/Life Balance', 'Salary/Benefits', 'Job Security/Advancement', 'Management', 'Culture'
# Row 1: 3M, 3.8, 3.9, 3.5, 3.6, 3.8
# Row 2: Google, . . .
目前发生的情况如下:
# Column headers: 'Name', 'Rating', 'Category'
# Row 1: 3M, 3.8, Work/Life Balance
# Row 2: 3M, 3.9, Salary/Benefits
# and so on . . .
到目前为止我的代码:
import requests
import pandas as pd
from bs4 import BeautifulSoup
number = []
category = []
name = []
company = ['3M', 'Google']
for company_name in company:
try:
url = 'https://ca.indeed.com/cmp/'+company_name
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
rating = soup.find(class_='cmp-ReviewAndRatingsStory-rating')
rating = rating.find('tbody')
rows = rating.find_all('tr')
except:
pass
for row in rows:
number.append(str(row.find_all('td')[0].text))
category.append(str(row.find_all('td')[2].text))
name.append(company_name)
cols = {'Name':name,'Rating':number,'Category':category}
df = pd.DataFrame(cols)
print(df)
代码产生的内容:
Name Rating Category
0 3M 3.8 Work/Life Balance
1 3M 3.9 Salary/Benefits
2 3M 3.5 Job Security/Advancement
3 3M 3.6 Management
4 3M 3.8 Culture
5 Google 4.2 Work/Life Balance
6 Google 4.0 Salary/Benefits
7 Google 3.6 Job Security/Advancement
8 Google 3.9 Management
9 Google 4.2 Culture
10 Apple 3.8 Work/Life Balance
11 Apple 4.1 Salary/Benefits
12 Apple 3.7 Job Security/Advancement
13 Apple 3.7 Management
14 Apple 4.1 Culture
使用以下代码复制结果:
import pandas as pd
name = ['3M','3M','3M','3M','3M','Google','Google','Google','Google','Google','Apple','Apple','Apple','Apple','Apple']
number = ['3.8','3.9','3.5','3.6','3.8','4.2','4.0','3.6','3.9','4.2','3.8','4.1','3.7','3.7','4.1']
category = ['Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture','Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture','Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture']
cols = {'Name':name,'Rating':number,'Category':category}
df = pd.DataFrame(cols)
print(df)
解决方案
这是一种可能的方法。
import pandas as pd
name = ['3M','3M','3M','3M','3M','Google','Google','Google','Google','Google','Apple','Apple','Apple','Apple','Apple']
number = ['3.8','3.9','3.5','3.6','3.8','4.2','4.0','3.6','3.9','4.2','3.8','4.1','3.7','3.7','4.1']
category = ['Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture','Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture','Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture']
cols = {'Name':name,'Rating':number,'Category':category}
df = pd.DataFrame(cols)
print(df)
from collections import defaultdict
aggregated_data = defaultdict(dict)
for idx, row in df.iterrows():
aggregated_data[row.Name][row.Category] = row.Rating
result = pd.DataFrame(aggregated_data).T
print(result)
结果:
Salary/Benefits Culture Job Security/Advancement Management Work/Life Balance
3M 3.9 3.8 3.5 3.6 3.8
Google 4.0 4.2 3.6 3.9 4.2
Apple 4.1 4.1 3.7 3.7 3.8
我不认为这是“惯用的”方法。由于它使用原生 Python 数据类型和循环,因此它可能比纯 pandas 解决方案慢得多。但是,如果您的数据不是那么大,也许没关系。
编辑:我认为在最后一步中进行转置会导致列名以令人惊讶的顺序排列,因此这是一种从字典列表构建最终数据帧的方法。
from collections import defaultdict
data_by_name = defaultdict(dict)
for idx, row in df.iterrows():
data_by_name[row.Name][row.Category] = row.Rating
aggregated_rows = [{"company name": name, **ratings} for name, ratings in data_by_name.items()]
result = pd.DataFrame(aggregated_rows)
print(result)
结果:
company name Work/Life Balance Salary/Benefits Job Security/Advancement Management Culture
0 3M 3.8 3.9 3.5 3.6 3.8
1 Google 4.2 4.0 3.6 3.9 4.2
2 Apple 3.8 4.1 3.7 3.7 4.1
推荐阅读
- asp.net-core - .net 核心中的 CORS 问题
- git - GIT 提交错误 - WSL2 中的 VSCODE(错误:gpg 未能签署数据致命:未能写入提交对象)
- reactjs - 使用令牌的 Spring OAuth2 身份验证
- javascript - 编写一个函数来更改 div 的 style.display 属性?
- javascript - 如何找到可用于 toDataURL() 的编码
- python - 合并不同列上的多个数据框
- c# - 通过字符串名称 LINQ 访问表
- java - Android ExoPlayer 在不稳定的网络上崩溃
- pyomo - Pyomo 访问求解器状态
- php - Symfony3.4:类型错误:传递给 ... 的参数 1 必须是字符串类型,给定 null,调用