python - 如何在数据框的列中使用字典来访问该列的值并取另一列的平均值?
问题描述
我有一个数据框,其中有一列中包含单词和计数的字典,另一列中包含标签。
|dict |label |
|-------------------------|--------|
|{'word1':1, 'word2':2} |1 |
|{'word2':4, 'word3':1} |0 |
|{'word1':3, 'word4':2} |0 |
|-------------------------|--------|
我需要输出所有单词、它们的计数和平均标签(按计数加权):
|word |count |average|
|-------|-------|-------|
|word1 |4 |0.25 |
|word2 |6 |0.33 |
|word3 |1 |0.0 |
|word4 |2 |0.0 |
|-------|-------|-------|
澄清平均值:因为在第 3 行有一个标签为 的实例和三个word1
标签为 的实例,因此平均值为 1/4 = 0.25。1
0
我在访问循环中的两个不同列时遇到了困难。字典也让我失望,我有点像 python 菜鸟,所以非常感谢任何帮助。
解决方案
Here you go:
##!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
# sample data
df = pd.DataFrame([
{'dict': {'word1': 1, 'word2': 2}, 'label': 1},
{'dict': {'word2': 4, 'word3': 1}, 'label': 0},
{'dict': {'word1': 3, 'word4': 2}, 'label': 0}])
new_rows = []
count = {}
# lets iter over the rows and keep count of label and value
for row in df.iterrows():
new = {}
current_dict = row[1]['dict']
current_label = row[1]['label']
for x, y in current_dict.items():
new[x] = current_label*y
if x in count.keys():
count[x] += y
else:
count[x] = y
new_rows.append(new)
# calculate average only when we have full count
new_df = pd.DataFrame(new_rows).sum(axis=0, skipna=True).divide(pd.Series(count))
# append count column to the right
new_df = pd.concat([new_df, pd.Series(count)], axis=1)
# rename the header
new_df = new_df.rename(columns = pd.Series(['average', 'count']))
i first restructured the data and then used pandas sum and divide functions to get the average.