首页 > 解决方案 > Pandas - dataframe containing comments(rows) and words as column headers how to get a frequency count?

问题描述

I am trying to perform a word frequency count on a relatively large dataframe and don't know what approach would be the best.

Currently my dataframe looks like this -

 Comment        'I'    'it'    'is'    'up'

'I was here'    NaN    NaN     NaN     NaN
'I like soup'   NaN    NaN     NaN     NaN
'whats up'      NaN    NaN     NaN     NaN
'This is it'    NaN    NaN     NaN     NaN

My goal is to perform a frequency count for each of the words in the column headers ('I', 'it', 'is', 'up') for each comment. E.g. after the counting process the result should look something like this -

 Comment        'I'    'it'    'is'    'up'

'I was here'     1      0        0      0
'I like soup'    1      0        0      0
'whats up'       0      0        0      1
'This is it'     0      1        1      0

What would be the best approach to this? The real dataset contains about 50k comments and over 10k columns with different words.

标签: pythonpandasnumpynlpnltk

解决方案


我认为没有比以下更好的方法了:

for column in df.columns[1:]: # All but comment column.
   df[column] = df[column].str.contains(df['Comment'])

这将为您提供一个布尔矩阵,如果您真的需要,您可以将其映射到位。


推荐阅读