首页 > 解决方案 > 如何在数据框的列中使用字典来访问该列的值并取另一列的平均值?

问题描述

我有一个数据框,其中有一列中包含单词和计数的字典,另一列中包含标签。

|dict                     |label   |
|-------------------------|--------|
|{'word1':1, 'word2':2}   |1       |
|{'word2':4, 'word3':1}   |0       |
|{'word1':3, 'word4':2}   |0       |
|-------------------------|--------|

我需要输出所有单词、它们的计数和平均标签(按计数加权):

|word   |count  |average|  
|-------|-------|-------|
|word1  |4      |0.25   |
|word2  |6      |0.33   |
|word3  |1      |0.0    |
|word4  |2      |0.0    |
|-------|-------|-------|

澄清平均值:因为在第 3 行有一个标签为 的实例和三个word1标签为 的实例,因此平均值为 1/4 = 0.25。10

我在访问循环中的两个不同列时遇到了困难。字典也让我失望,我有点像 python 菜鸟,所以非常感谢任何帮助。

标签: pythondataframedictionary

解决方案


Here you go:

##!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd

# sample data
df = pd.DataFrame([
    {'dict': {'word1': 1, 'word2': 2}, 'label': 1},
    {'dict': {'word2': 4, 'word3': 1}, 'label': 0},
    {'dict': {'word1': 3, 'word4': 2}, 'label': 0}])


new_rows = []
count = {}
# lets iter over the rows and keep count of label and value
for row in df.iterrows():
    new = {}
    current_dict = row[1]['dict']
    current_label = row[1]['label']
    for x, y in current_dict.items():

        new[x] = current_label*y

        if x in count.keys():
            count[x] += y
        else:
            count[x] = y
    new_rows.append(new)

# calculate average only when we have full count
new_df = pd.DataFrame(new_rows).sum(axis=0, skipna=True).divide(pd.Series(count))
# append count column to the right
new_df = pd.concat([new_df, pd.Series(count)], axis=1)
# rename the header
new_df = new_df.rename(columns = pd.Series(['average', 'count']))

i first restructured the data and then used pandas sum and divide functions to get the average.


推荐阅读