python - 在重复行中对 DataFrame 中特定列的值求和

问题描述

我有一个数据框架，其中包含我删除并修改了一些信息的书籍。但是，“bookISBN”列中有一些行具有重复值，我想将所有这些行合并为一个。

我计划创建一个新的 DataFrame，在其中保留 url、ISBN、标题和流派的第一个值，但我想对“genreVotes”列的值求和以创建合并。我怎样才能做到这一点？

原始数据框：

In [23]: network = data[["bookTitle", "bookISBN", "highestVotedGenre", "genreVotes"]]
         network.head().to_dict("list")
Out [23]: 
{'bookTitle': ['The Hunger Games',
  'Twilight',
  'The Book Thief',
  'Animal Farm',
  'The Chronicles of Narnia'],
 'bookISBN': ['9780439023481',
  '9780316015844',
  '9780375831003',
  '9780452284241',
  '9780066238500'],
 'highestVotedGenre': ['Young Adult',
  'Young Adult',
  'Historical-Historical Fiction',
  'Classics',
  'Fantasy'],
 'genreVotes': [103407, 80856, 59070, 73590, 26376]}

重复：

In [24]: duplicates = network[network.duplicated(subset=["bookISBN"], keep=False)]
         duplicates.loc[(duplicates["bookISBN"] == "9780439023481") | (duplicates["bookISBN"] == "9780375831003")]
Out [24]:
{'bookTitle': ['The Hunger Games',
  'The Book Thief',
  'The Hunger Games',
  'The Book Thief',
  'The Book Thief'],
 'bookISBN': ['9780439023481',
  '9780375831003',
  '9780439023481',
  '9780375831003',
  '9780375831003'],
 'highestVotedGenre': ['Young Adult',
  'Historical-Historical Fiction',
  'Young Adult',
  'Historical-Historical Fiction',
  'Historical-Historical Fiction'],
 'genreVotes': [103407, 59070, 103407, 59070, 59070]}

（在这个例子中，投票都是一样的，但在某些情况下，值是不同的）。

预期输出：

{'bookTitle': ['The Hunger Games',
  'Twilight',
  'The Book Thief',
  'Animal Farm',
  'The Chronicles of Narnia'],
 'bookISBN': ['9780439023481',
  '9780316015844',
  '9780375831003',
  '9780452284241',
  '9780066238500'],
 'highestVotedGenre': ['Young Adult',
  'Young Adult',
  'Historical-Historical Fiction',
  'Classics',
  'Fantasy'],
 'genreVotes': [260814, 80856, 177210, 73590, 26376]}

标签： pythonpandasjupyter-notebook

python - 在重复行中对 DataFrame 中特定列的值求和

问题描述

解决方案

推荐阅读