首页 > 解决方案 > Pandas - 提取唯一的列组合并将它们计算在另一个表中

问题描述

任务1:

我有这样的表:

+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country  | 1          | NAN      | NAN        | NAN      | NAN        | 100   |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 2          | city     | 8          | NAN      | NAN        | 20    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 2          | city     | 9          | NAN      | NAN        | 80    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 3          | age      | 4          | sex      | 6          | 40    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 3          | age      | 5          | sex      | 7          | 60    |
+----------+------------+----------+------------+----------+------------+-------+

我需要将每一行依次排列,删除 NAN 并在字典中转换系列,字典的大小将是可变的,例如,前 2 个字典将如下所示:

{'a_name_0':'country','id_qname_0':1}
{'a_name_0':'region','id_qname_0':2, 'a_name_1':'city','id_qname_1':8}
{'a_name_0':'region','id_qname_0':2, 'a_name_1':'city','id_qname_1':9}

之后的每个字典都应该存储在一个列表中。

任务 2。

使用下表,我必须从上一步计算 dict 中列的外观:

+----------+------------+----------+------------+----------+
| id       | country    | city     | age        | sex      | 
+----------+------------+----------+------------+----------+
| 1        | 1          | NAN      | NAN        | NAN      | 
+----------+------------+----------+------------+----------+
| 2        | 1          | 8        | NAN        | NAN      | 
+----------+------------+----------+------------+----------+

如果有一些更快的映射解决方案,请告知,因为我要做的可能会很混乱。 这个答案对我没有帮助,因为我需要迭代器来提取参数以及计算它们的外观。

标签: pythonpandasdictionary

解决方案


您可以使用( )删除count列并将所有行转换为字典列表,然后在字典理解中过滤掉具有缺失值的字典:DataFrame.to_dictorient='r'records

L = [{k:v for k, v in x.items() if pd.notna(v)} for x in df.drop('count', 1).to_dict('r')]
print (L)
[{'a_name_0': 'country', 'id_qname_0': 1},
 {'a_name_0': 'region', 'id_qname_0': 2, 'a_name_1': 'city', 'id_qname_1': 8.0}, 
 {'a_name_0': 'region', 'id_qname_0': 2, 'a_name_1': 'city', 'id_qname_1': 9.0}, 
 {'a_name_0': 'region', 'id_qname_0': 3, 'a_name_1': 'age', 
 'id_qname_1': 4.0, 'a_name_2': 'sex', 'id_qname_2': 6.0},
 {'a_name_0': 'region', 'id_qname_0': 3, 'a_name_1': 'age',
 'id_qname_1': 5.0, 'a_name_2': 'sex', 'id_qname_2': 7.0}]

不是 100% 确定第二个 DataFrame:

L1 = [dict(zip(list(x.values())[::2], list(x.values())[1::2])) for x in L]
df = pd.DataFrame(L1)
print (df)
   country  region  city  age  sex
0      1.0     NaN   NaN  NaN  NaN
1      NaN     2.0   8.0  NaN  NaN
2      NaN     2.0   9.0  NaN  NaN
3      NaN     3.0   NaN  4.0  6.0
4      NaN     3.0   NaN  5.0  7.0

推荐阅读