首页 > 解决方案 > 使用字符串列表过滤数据框

问题描述

我有以下数据框,称为 pop_df:

pop_df
Out[33]: 
            2014       2015       2016       2017       2018       2019
Geo                                                                    
AL     2892394.0  2885796.0  2875592.0  2876591.0  2870324.0  2862427.0
AL0    2892394.0  2885796.0  2875592.0  2876591.0  2870324.0  2862427.0
AL01    844921.0   836448.0   830981.0   826904.0   819793.0   813758.0
AL011   134332.0   131054.0   129056.0   125579.0   120978.0   118948.0
AL012   276058.0   277989.0   280205.0   284823.0   289626.0   290126.0
         ...        ...        ...        ...        ...        ...
UKN12   142028.0   142756.0   143363.0   143746.0   144105.0   144367.0
UKN13   139774.0   140222.0   140752.0   141368.0   141994.0   142565.0
UKN14   137722.0   139426.0   140691.0   141917.0   143286.0   144771.0
UKN15   136332.0   136904.0   137492.0   138000.0   138441.0   138948.0
UKN16   114696.0   115171.0   115581.0   116057.0   116612.0   117051.0

[2034 rows x 6 columns]

我还有一个欧盟国家代码列表:

EuropeanUnion=["BE","BG","CZ","DK","DE","EE","IE","EL","ES","FR","HR","IT","CY","LV","LT","LU","HU","MT","NL","AT","PL","PT","RO","SI","SK","FI","SE"]

我想过滤数据框以仅保留以列表中的国家/地区代码开头的行。我试过了,pop_df.index.str.startswith但我无法让它工作。我很感激你能提供的任何帮助。提前致谢!

标签: pythonpandas

解决方案


这似乎Geo是索引,所以你可以这样做:

result = df[df.index.str.match(rf'\b{"|".join(EuropeanUnion)}')]

输出 (虚拟)

           2014      2015      2016      2017      2018      2019
Geo                                                              
BE011  134332.0  131054.0  129056.0  125579.0  120978.0  118948.0
DE13   139774.0  140222.0  140752.0  141368.0  141994.0  142565.0

str.match的文档中:

确定每个字符串是否以匹配正则表达式开头。

表达式rf'\b{"|".join(EuropeanUnion)}', 构建一个匹配任何国家代码的正则表达式模式。


推荐阅读