首页 > 解决方案 > GroupBy 然后是 Mean/Median/Mode 使用 str.contains Criteria

问题描述

我有以下数据集示例:

Name | Year | Score | 2nd Score | % of People | Country | Fruit | Export Countries | Language | Transit Duration | Quality | Taste | Freshness | Packaging
Andes, The | 2021 | 8 | 8.8 | 87% | The Netherlands | The Apple | United States,United Kingdom | English,Japanese,French | 148.0 | 1.0 | 0.0 | 0.0 | 0.0
Phil | 2021 | 8 | 8.4 | 87% | Spain | The Banana | United Kingdom, Germany | English,German,French,Italian | 165.0 | 1.0 | 0.0 | 0.0 | 0.0
Sarah | 2021 | 9 | 8.3 | 89% | Greece | The Plum | Germany,United States | English,German,French,Italian | 153.0 | 1.0 | 0.0 | 0.0 | 0.0

我希望能够使用“语言”列,并且对于包含“德语”的任何条目,在分数列上执行均值/中值/模式。平均值的预期输出将类似于:

English  8.33  
German  8.5
French  8.33
Italian  8.5
Japanese  8

我有一种语法,可以将它们分成各自的列,然后我知道我可以这样做逻辑,但是我想了解我想要分析数据集的方式是否可行。

编辑:按要求编码

[{'Name': 'Andes, The',
  'Year': 2021,
  'Score': '8',
  '2nd Score': 8.8,
  '% of People': '87%',
  'Country': 'The Netherlands',
  'Fruit': 'The Apple',
  'Export Countries': 'United States,United Kingdom',
  'Language': 'English,Japanese,French',
  'Transit Duration': 148.0,
  'Quality': 1.0,
  'Taste': 0.0,
  'Freshness': 0.0,
  'Packaging': 0.0},
 {'Name': 'Phil',
  'Year': 2021,
  'Score': '8',
  '2nd Score': 8.8,
  '% of People': '87%',
  'Country': 'Spain',
  'Fruit': 'The Banana',
  'Export Countries': 'United Kingdom, Germany',
  'Language': 'English,German,French,Italian',
  'Transit Duration': 118.0,
  'Quality': 1.0,
  'Taste': 0.0,
  'Freshness': 0.0,
  'Packaging': 0.0},
{'Name': 'Sarah',
  'Year': 2021,
  'Score': '9',
  '2nd Score': 8.8,
  '% of People': '89%',
  'Country': 'Greece',
  'Fruit': 'The Plum',
  'Export Countries': 'Germany,United States',
  'Language': 'English,German,French,Italian',
  'Transit Duration': 165.0,
  'Quality': 1.0,
  'Taste': 0.0,
  'Freshness': 0.0,
  'Packaging': 0.0},
{'Name': 'William',
  'Year': 2021,
  'Score': '6',
  '2nd Score': 8.8,
  '% of People': '65%',
  'Country': 'Brazil',
  'Fruit': 'Strawberries',
  'Export Countries': 'Spain,Greece',
  'Language': 'Spanish, French',
  'Transit Duration': 153.0,
  'Quality': 1.0,
  'Taste': 0.0,
  'Freshness': 0.0,
  'Packaging': 0.0},

标签: pythonpandasdataframenumpy

解决方案


您可以使用 column 使其变平,然后按每种语言进行聚合splitexplode Language

(df[['Score', 'Language']]
  .assign(Language=lambda x: x.Language.str.split(','))
  .explode('Language')
  .groupby('Language')
  .Score.mean()
  .reset_index())

   Language     Score
0   English  8.333333
1    French  8.333333
2    German  8.500000
3   Italian  8.500000
4  Japanese  8.000000

推荐阅读