首页 > 解决方案 > How do I sum values from one column dependent on items in other columns?

问题描述

I have the following dataframe:

    Course  Orders Ingredient 1 Ingredient 2  Ingredient 3
    starter 3      Fish         Bread         Mayonnaise
    starter 1      Olives       Bread   
    starter 5      Hummus       Pita    
    main    1      Pizza        
    main    6      Beef         Potato        Peas
    main    9      Fish         Peas    
    main    11     Bread        Mayonnaise    Beef
    main    4      Pasta        Bolognese     Peas
    desert  10     Cheese       Olives        Crackers
    desert  7      Cookies      Cream   
    desert  8      Cheesecake   Cream   

I would like to sum the number of orders for each ingredient per course. It is not important which column the ingredient is in.

The following dataframe is what I would like my output to be:

Course  Ord Ing1       IngOrd1 Ing2     IngOrd2 Ing3 IngOrd3
starter 3   Fish       3       Bread    4       Mayo     3
starter 1   Olives     1       Bread    4       
starter 5   Hummus     5       Pita     5       
main    1   Pizza      1                
main    6   Beef       17      Potato   6       Peas     21
main    9   Fish       9       Peas     21      
main    11  Bread      11      Mayo     11      Beef     17
main    4   Pasta      4       Bolognese 4      Peas     21
desert  10  Cheese     10      Olives   10      Crackers 10
desert  7   Cookies    7       Cream    15      
desert  8   Cheesecake 8       Cream    15      

I have tried using groupby().sum() but this does not work with the ingredients in 3 columns.

I also cannot use lookup because there are instances in the full dataframe where I do not know what ingredient I am looking for.

标签: pythonpandas

解决方案


我不相信使用 groupby 或其他类似的 pandas 方法有真正巧妙的方法,尽管我很高兴被证明是错误的。无论如何,以下内容并不是特别漂亮,但它会给你你所追求的。

import pandas as pd
from collections import defaultdict

# The data you provided
df = pd.read_csv('orders.csv')

# Group these labels for convenience
ingredients = ['Ingredient 1', 'Ingredient 2', 'Ingredient 3']
orders = ['IngOrd1', 'IngOrd2', 'IngOrd3']

# Interleave the two lists for final data frame
combined = [y for x in zip(ingredients, orders) for y in x]

# Restructure the data frame so we can group on ingredients
melted = pd.melt(df, id_vars=['Course', 'Orders'], value_vars=ingredients, value_name='Ingredient')

# This is a map that we can apply to each ingredient column to
# look up the correct order count
maps = defaultdict(lambda: defaultdict(int))

# Build the map. Every course/ingredient pair is keyed to the total
# count for that pair, e.g. {(main, beef): 17, ...}
for index, group in melted.groupby(['Course', 'Ingredient']):
    course, ingredient = index
    maps[course][ingredient] += group.Orders.sum()

# Now apply the map to each ingredient column of the data frame
# to create the new count columns
for i, o in zip(ingredients, orders):
    df[o] = df.apply(lambda x: maps[x.Course][x[i]], axis=1)

# Adjust the columns labels
df = df[['Course', 'Orders'] + combined]

print df

     Course  Orders Ingredient 1  IngOrd1 Ingredient 2  IngOrd2 Ingredient 3  IngOrd3
0   starter       3         Fish        3        Bread        4   Mayonnaise        3
1   starter       1       Olives        1        Bread        4          NaN        0
2   starter       5       Hummus        5         Pita        5          NaN        0
3      main       1        Pizza        1          NaN        0          NaN        0
4      main       6         Beef       17       Potato        6         Peas       19
5      main       9         Fish        9         Peas       19          NaN        0
6      main      11        Bread       11   Mayonnaise       11         Beef       17
7      main       4        Pasta        4    Bolognese        4         Peas       19
8    desert      10       Cheese       10       Olives       10     Crackers       10
9    desert       7      Cookies        7        Cream       15          NaN        0
10   desert       8   Cheesecake        8        Cream       15          NaN        0

如果这是一个问题,您将需要处理 NaN 和 0 计数。但这是一项微不足道的任务。


推荐阅读