python - 如何获得按第二个变量（Python）分组的词频计数

问题描述

我是 Python 的新手，所以很可能我只是没有正确地措辞来找到答案。

使用 Pandas，我能够在数据的描述字段中为每条记录找到最常见的 N 个单词。但是，我有两列；分类列和描述字段。如何找到每个类别最常见的词？

防爆数据：

 - Property|Description
 - House| Blue, Two stories, pool
 - Car| Green, Dented, Manual, New
 - Car| Blue, Automatic, Heated Seat
 - House|New, Furnished, HOA
 - Car|Blue, Old, Multiple Owners

我当前的代码将返回 Blue=3、New=2 等。但我需要知道的是 Blue 出现在 Car 一词中两次，而在 House 中出现一次。

当前相关代码

words = (data.Description.str.lower().str.cat(sep=' ').split())
keywords=pandas.DataFrame(Counter(words).most_common(10), columns=['Words', 'Frequency'])

标签： pythonpandas

试试这个，按分隔符拆分行值，然后应用explode将列表中的每个元素转换为一行，最后是Groupby

# remove leading white space's & split by delimiter
df['Description'] = df['Description'].str.strip()\
    .str.replace(",\s+", ",")\
    .str.split(',')

# apply group by to get count of each word.
print(df.explode(column='Description').
      groupby(["Property","Description"]).size().reset_index(name='count'))

输出，

   Property      Description  count
0       Car        Automatic      1
1       Car             Blue      2
2       Car           Dented      1
3       Car            Green      1
4       Car      Heated Seat      1
...

python - 如何获得按第二个变量（Python）分组的词频计数

问题描述

解决方案

推荐阅读