首页 > 解决方案 > pandas:使用字典的键对 Serie 的值进行聚类

问题描述

我正在研究一个 DataFrame,就像这个:

DF_draft = pd.DataFrame(data={"subdivision_name" : ["01","02","03","04","05"], 
            "day": ["2020-03-18","2020-03-18","2020-03-18","2020-03-18","2020-03-18"],
            "data":[0,11,12,13,2]})

    subdivision_name    day         data
0   01                  2020-03-18  0
1   02                  2020-03-18  11
2   03                  2020-03-18  12
3   04                  2020-03-18  13
4   05                  2020-03-18  2

我试图通过遵循这样的字典来保留一些 subdivision_name 行:

reference_dict = {"Area Alpha": ["01","03","04","15","26","38","42","43","63","69","73","74"], 
                   "Area Beta" : ["21","25","39","58","70","71","89","90"],
                  "Area Gaga" : ["02","01","07","57","88","67","68","54"]}

我的目标是将 reference_dict 的几个键之一放在一个列表中,然后将其用作函数的参数。例如,["Area Alpha"] 的结果将是一个 DF,如:

Area Alpha   day
25           2020-03-18

reference_dict 的 key 变成了 Serie 的名字,我们在 reference_dict["Area Alpha"] 中添加细分“01”、“03”和“04”的值

我开始使用此功能进行测试:

def area_cluster(DF, one_list):
    for area in one_list:
        if area in reference_dict:
            condition = DF["subdivision_name"].any() in reference_dict[area]
            new_DF = DF[condition]
        else:
            print("Wrong orthograph")     
    return new_DF

我测试过

DF = area_cluster(DF_draft, ["Area Alpha"])

并得到了这个错误

KeyError                                  Traceback (most recent call last)
<ipython-input-131-46075ea2a017> in <module>
----> 1 DF = area_cluster(DF_draft, ["Area Alpha"])

<ipython-input-129-a86847d436dd> in area_cluster(DF, one_list)
      3         if area in subdivision_dict:
      4             condition = DF["subdivision_name"].any() in subdivision_dict[area]
----> 5             new_DF = DF[condition]
      6         else:
      7             print("Wrong orthograph")

c:\users\raphael\appdata\local\programs\python\python39\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2904             if self.columns.nlevels > 1:
   2905                 return self._getitem_multilevel(key)
-> 2906             indexer = self.columns.get_loc(key)
   2907             if is_integer(indexer):
   2908                 indexer = [indexer]

c:\users\raphael\appdata\local\programs\python\python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:
-> 2897                 raise KeyError(key) from err
   2898 
   2899         if tolerance is not None:

KeyError: True

也许这是一个索引问题,但我不知道如何解决它。

标签: pythonpandasdictionary

解决方案


IIUC,你可以这样做:

import pandas as pd

DF_draft = pd.DataFrame(data={"subdivision_name": ["01", "02", "03", "04", "05"],
                              "day": ["2020-03-18", "2020-03-18", "2020-03-18", "2020-03-18", "2020-03-18"],
                              "data": [0, 11, 12, 13, 2]})

reference_dict = {"Area Alpha": ["01", "03", "04", "15", "26", "38", "42", "43", "63", "69", "73", "74"],
                  "Area Beta": ["21", "25", "39", "58", "70", "71", "89", "90"],
                  "Area Gaga": ["02", "01", "07", "57", "88", "67", "68", "54"]}


def area_cluster(df, one_list):
    ss = []
    for area in one_list:
        if area in reference_dict:
            r = df.assign(area=df["subdivision_name"].map(dict.fromkeys(reference_dict[area], area)))
            r = r.dropna().groupby(['area', 'day'])['data'].sum().reset_index()
            ss.append(r)

    return pd.concat(ss, ignore_index=True)


res = area_cluster(DF_draft, ['Area Alpha', "Area Beta", "Area Gaga"])
print(res)

输出

         area         day  data
0  Area Alpha  2020-03-18    25
1   Area Gaga  2020-03-18    11

推荐阅读