首页 > 解决方案 > 使用 2 个参数在 df 中创建新列

问题描述

我需要根据 2 个条件创建一个新列,即人口超过 50,000 的国家和按降序排列的恢复率。


df1['Recovery Rate'] = df1.apply(lambda x: (x['Total Recovered']/x['Total Infected']), axis = 1)

df1['Populated Country'] = df1.apply(if lambda row: row.Country == Country and (row: row.Population 2020 (in thousands) >= 50000), axis = 1) 

df1.sort_values(['Recovery Rate'], ascending = [False])

print(df1[['Populated Country','Recovery Rate']].head(10))

但是对于新的列代码,我遇到了以下错误。


File "<ipython-input-25-ab35558abd61>", line 4
df1['Populated Country'] = df1.apply(if lambda row: row.Country == Country and (row: row.Population 2020 (in thousands) >= 50000), axis = 1)
                                         ^
SyntaxError: invalid syntax
>Country    Daily Tests Daily Tests per 100000 people   Pop density per sq. km  Urban Population (%)    Start Date of Quarantine/Lockdown   Start Date of Schools Closure   Start Date of Public Place Restrictions Hospital Beds per 1000 people   M-to-F Gender Ratio at Birth    ... Death rate from lung diseases per 100k people for male  Median Age  GDP 2018    Crime Index Population 2020 (in thousands)  Smokers in Population (%)   % of Females in Population  Total Infected  Total Deaths    Total Recovered
>0  Albania NaN NaN 105 63  NaN NaN NaN 2.9 1.08    ... 17.04   32.9    1.510250e+10    40.02   2877.797    28.7    49.063095   949 31  742
>1  Algeria NaN NaN 18  73  NaN NaN NaN 1.9 1.05    ... 12.81   28.1    1.737580e+11    54.41   43851.044   15.6    49.484268   7377    561 3746
>2  Argentina   NaN NaN 17  93  3/20/2020   NaN NaN 5.0 1.05    ... 42.59   31.7    5.198720e+11    62.96   45195.774   21.8    51.237348   8809    393 2872
>3  Armenia 694.0   2.342029    104 63  NaN NaN NaN 4.2 1.13    ... 35.99   35.1    1.243309e+10    20.78   2963.243    24.1    52.956577   5041    64  2164
>4  Australia   31635.0 12.405939   3   86  NaN NaN 3/23/2020   3.8 1.06    ... 22.16   38.7    1.433900e+12    42.70   25499.884   14.7    50.199623   7072    100 6431

这是数据 - https://raw.githubusercontent.com/ptw2/PRGA/main/covid19_by_country.csv

这是我应该得到的结果

>         Country  Recovery Rate
>17         China       0.943459
>87      Thailand       0.941972
>47   South Korea       0.906031
>32       Germany       0.875705
>95       Vietnam       0.811728

任何人都可以帮忙吗?

标签: pythonpandasdataframesortinglambda

解决方案


在这种情况下,定义一个函数来进行计算会更简洁,然后在 lambda 语句中应用该函数:

def compute_rr(row):
    if row['Population 2020 (in thousands)'] >= 50000:
        return row['Total Recovered'] / row['Total Infected']

df1['Recovery Rate'] = df1.apply(lambda row: compute_rr(row), axis = 1)
df1 = df1.sort_values(['Recovery Rate'], ascending = [False])

print(df1[['Country','Total Recovered','Total Infected','Recovery Rate']].head())

#Output:
        Country  Total Recovered  Total Infected  Recovery Rate
17        China            79310           84063       0.943459
87     Thailand             2857            3033       0.941972
47  South Korea            10066           11110       0.906031
32      Germany           155681          177778       0.875705
95      Vietnam              263             324       0.811728

如果您真的想更改数据框以消除人口 <50K 的国家/地区,只需将以下行添加到前面代码的底部。它消除了“恢复率”列中包含 NaN 的所有行。

df1 = df1[df1['Recovery Rate'].notna()]

推荐阅读