首页 > 解决方案 > 数组元素到数据框列

问题描述

recipe_id   cuisine     ingredients
0   10259   greek       [romaine lettuce, black olives, grape tomatoes]
1   25693   southern_us [plain flour, ground pepper, salt, tomatoes]
2   20130   filipino    [eggs, pepper, salt, mayonaise, cooking oil]
3   22213   indian      [water, vegetable oil, wheat, salt]

Dataframe 有一列,其中包含每行每个配方的不同成分的数组。我的目标是为每种成分创建列;如果配方行中使用了相应的成分,则标记为 1,否则标记为 0。

我的解决方案是:

for index,item in enumerate(df.ingredients):
  for ingredient in item:
    if (ingredient not in df.columns): df[ingredient]=0
    df[ingredient].iloc[index]=1

但练习答案表明:

def find_item(cell):
    if i in cell:
        return 1
    return 0

for item in df.ingredients:
  for i in item:
    df[i] = df['ingredients'].apply(find_item)

结果是一样的。我的解决方案对我来说似乎更具可读性。我想找出使用​​ apply 的原因。

PS我的解决方案也收到警告,但找不到,我应该如何解决?

SettingWithCopyWarning:试图在 DataFrame 中的切片副本上设置值

请参阅文档中的注意事项: https: //pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_with_indexer(indexer, value)

数据可在此处获得:

import urllib.request, json, urllib
with urllib.request.urlopen("https://raw.githubusercontent.com/konst54/datasets/master/recipes.json") as url:
    recepies = json.loads(url.read().decode())
df=pd.DataFrame(recepies)

标签: pythonarrayspandasdataframe

解决方案


为了避免你得到@Quang Hoang所说的错误:

df[ingredient].iloc[index]=1触发警告的,是索引链接,应该避免。

作为另一种选择,对于您的解决方案,您也可以尝试使用一些pivotexplode

import pandas as pd
import io
#creating your dataframe
s_e='''
recipe_id      cuisine        ingredients
0   10259     greek         ['romaine', 'lettuce', 'black olives', 'grape tomatoes']
1   25693     southern_us     ['plain flour', 'ground pepper', 'salt', 'tomatoes']
2   20130     filipino      ['eggs', 'pepper', 'salt', 'mayonaise', 'cooking oil']
3   22213     indian        ['water', 'vegetable oil', 'wheat', 'salt']
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', engine='python')
df.ingredients=df.ingredients.apply(eval)
print(df)


#approach to solution
df=df.explode('ingredients')
df['val']=[1]*len(df.ingredients)
newdf=df.pivot(index='recipe_id', columns='ingredients', values='val').fillna(0)
print(newdf)

输出:

df
   recipe_id      cuisine                                       ingredients
0      10259        greek  [romaine, lettuce, black olives, grape tomatoes]
1      25693  southern_us      [plain flour, ground pepper, salt, tomatoes]
2      20130     filipino      [eggs, pepper, salt, mayonaise, cooking oil]
3      22213       indian               [water, vegetable oil, wheat, salt]



newdf
ingredients  black olives  cooking oil  eggs  grape tomatoes  ground pepper  lettuce  mayonaise  pepper  plain flour  romaine  salt  tomatoes  vegetable oil  water  wheat
recipe_id
10259                 1.0          0.0   0.0             1.0            0.0      1.0        0.0     0.0          0.0      1.0   0.0       0.0            0.0    0.0    0.0
20130                 0.0          1.0   1.0             0.0            0.0      0.0        1.0     1.0          0.0      0.0   1.0       0.0            0.0    0.0    0.0
22213                 0.0          0.0   0.0             0.0            0.0      0.0        0.0     0.0          0.0      0.0   1.0       0.0            1.0    1.0    1.0
25693                 0.0          0.0   0.0             0.0            1.0      0.0        0.0     0.0          1.0      0.0   1.0       1.0            0.0    0.0    0.0

推荐阅读