python - 使用 2 个参数在 df 中创建新列
问题描述
我需要根据 2 个条件创建一个新列,即人口超过 50,000 的国家和按降序排列的恢复率。
df1['Recovery Rate'] = df1.apply(lambda x: (x['Total Recovered']/x['Total Infected']), axis = 1)
df1['Populated Country'] = df1.apply(if lambda row: row.Country == Country and (row: row.Population 2020 (in thousands) >= 50000), axis = 1)
df1.sort_values(['Recovery Rate'], ascending = [False])
print(df1[['Populated Country','Recovery Rate']].head(10))
但是对于新的列代码,我遇到了以下错误。
File "<ipython-input-25-ab35558abd61>", line 4
df1['Populated Country'] = df1.apply(if lambda row: row.Country == Country and (row: row.Population 2020 (in thousands) >= 50000), axis = 1)
^
SyntaxError: invalid syntax
>Country Daily Tests Daily Tests per 100000 people Pop density per sq. km Urban Population (%) Start Date of Quarantine/Lockdown Start Date of Schools Closure Start Date of Public Place Restrictions Hospital Beds per 1000 people M-to-F Gender Ratio at Birth ... Death rate from lung diseases per 100k people for male Median Age GDP 2018 Crime Index Population 2020 (in thousands) Smokers in Population (%) % of Females in Population Total Infected Total Deaths Total Recovered
>0 Albania NaN NaN 105 63 NaN NaN NaN 2.9 1.08 ... 17.04 32.9 1.510250e+10 40.02 2877.797 28.7 49.063095 949 31 742
>1 Algeria NaN NaN 18 73 NaN NaN NaN 1.9 1.05 ... 12.81 28.1 1.737580e+11 54.41 43851.044 15.6 49.484268 7377 561 3746
>2 Argentina NaN NaN 17 93 3/20/2020 NaN NaN 5.0 1.05 ... 42.59 31.7 5.198720e+11 62.96 45195.774 21.8 51.237348 8809 393 2872
>3 Armenia 694.0 2.342029 104 63 NaN NaN NaN 4.2 1.13 ... 35.99 35.1 1.243309e+10 20.78 2963.243 24.1 52.956577 5041 64 2164
>4 Australia 31635.0 12.405939 3 86 NaN NaN 3/23/2020 3.8 1.06 ... 22.16 38.7 1.433900e+12 42.70 25499.884 14.7 50.199623 7072 100 6431
这是数据 - https://raw.githubusercontent.com/ptw2/PRGA/main/covid19_by_country.csv
这是我应该得到的结果
> Country Recovery Rate
>17 China 0.943459
>87 Thailand 0.941972
>47 South Korea 0.906031
>32 Germany 0.875705
>95 Vietnam 0.811728
任何人都可以帮忙吗?
解决方案
在这种情况下,定义一个函数来进行计算会更简洁,然后在 lambda 语句中应用该函数:
def compute_rr(row):
if row['Population 2020 (in thousands)'] >= 50000:
return row['Total Recovered'] / row['Total Infected']
df1['Recovery Rate'] = df1.apply(lambda row: compute_rr(row), axis = 1)
df1 = df1.sort_values(['Recovery Rate'], ascending = [False])
print(df1[['Country','Total Recovered','Total Infected','Recovery Rate']].head())
#Output:
Country Total Recovered Total Infected Recovery Rate
17 China 79310 84063 0.943459
87 Thailand 2857 3033 0.941972
47 South Korea 10066 11110 0.906031
32 Germany 155681 177778 0.875705
95 Vietnam 263 324 0.811728
如果您真的想更改数据框以消除人口 <50K 的国家/地区,只需将以下行添加到前面代码的底部。它消除了“恢复率”列中包含 NaN 的所有行。
df1 = df1[df1['Recovery Rate'].notna()]
推荐阅读
- java - compileOptions set to JavaVersion 1.8 cause gradle to fail sync
- matplotlib - 在 Julia 中绘图:缺乏广泛且容易理解的文档?
- acumatica - 从父帐户更新客户子项
- c# - 如何使用 ConfigurationManager 将连接字符串值从 Unity3d 传递到类库
- sql - Rails 关联 SQL
- mysql - 节点 mySQL Express 到 Json
- python - 在 Airflow 中动态生成分区
- reactjs - 在 React Redux 中将状态值传递给 mapDispatchToProps
- python - 性能良好的实时多线图更新
- formio - 我可以使用将表单保存到本地数据库吗?