首页 > 解决方案 > 在重复行中对 DataFrame 中特定列的值求和

问题描述

我有一个数据框架,其中包含我删除并修改了一些信息的书籍。但是,“bookISBN”列中有一些行具有重复值,我想将所有这些行合并为一个。

我计划创建一个新的 DataFrame,在其中保留 url、ISBN、标题和流派的第一个值,但我想对“genreVotes”列的值求和以创建合并。我怎样才能做到这一点?

原始数据框:

In [23]: network = data[["bookTitle", "bookISBN", "highestVotedGenre", "genreVotes"]]
         network.head().to_dict("list")
Out [23]: 
{'bookTitle': ['The Hunger Games',
  'Twilight',
  'The Book Thief',
  'Animal Farm',
  'The Chronicles of Narnia'],
 'bookISBN': ['9780439023481',
  '9780316015844',
  '9780375831003',
  '9780452284241',
  '9780066238500'],
 'highestVotedGenre': ['Young Adult',
  'Young Adult',
  'Historical-Historical Fiction',
  'Classics',
  'Fantasy'],
 'genreVotes': [103407, 80856, 59070, 73590, 26376]}

重复:

In [24]: duplicates = network[network.duplicated(subset=["bookISBN"], keep=False)]
         duplicates.loc[(duplicates["bookISBN"] == "9780439023481") | (duplicates["bookISBN"] == "9780375831003")]
Out [24]:
{'bookTitle': ['The Hunger Games',
  'The Book Thief',
  'The Hunger Games',
  'The Book Thief',
  'The Book Thief'],
 'bookISBN': ['9780439023481',
  '9780375831003',
  '9780439023481',
  '9780375831003',
  '9780375831003'],
 'highestVotedGenre': ['Young Adult',
  'Historical-Historical Fiction',
  'Young Adult',
  'Historical-Historical Fiction',
  'Historical-Historical Fiction'],
 'genreVotes': [103407, 59070, 103407, 59070, 59070]}

(在这个例子中,投票都是一样的,但在某些情况下,值是不同的)。

预期输出:

{'bookTitle': ['The Hunger Games',
  'Twilight',
  'The Book Thief',
  'Animal Farm',
  'The Chronicles of Narnia'],
 'bookISBN': ['9780439023481',
  '9780316015844',
  '9780375831003',
  '9780452284241',
  '9780066238500'],
 'highestVotedGenre': ['Young Adult',
  'Young Adult',
  'Historical-Historical Fiction',
  'Classics',
  'Fantasy'],
 'genreVotes': [260814, 80856, 177210, 73590, 26376]}

标签: pythonpandasjupyter-notebook

解决方案


推荐阅读