python - pandas DataFrame 中列中字符串的一种热编码

问题描述

我有一个带有“描述”列的 DataFrame，我想制作一个热编码，其中包括描述中单词的字数

    description
0   test words that describe things
1   more and more words here
2   things test

期望的输出

    test   words  that describe things more  here  and
0   1.0    1.0    1.0    1.0    1.0    0.0   0.0   0.0
1   0.0    1.0    0.0    0.0    0.0    2.0   1.0   1.0
2   1.0    0.0    0.0    0.0    1.0    0.0   0.0   0.0

我目前的解决方案是：

one_hot = df.apply(lambda x: pd.Series(x.description).str.split(expand=True).stack().value_counts(), axis=1)

在大型数据集（130K 行）上，这变得非常慢（每行 2.6 毫秒），我想知道是否有更好的解决方案。我还想删除仅出现在一个条目中的单词。

    test   words  things
0   1.0    1.0    1.0
1   0.0    1.0    0.0
2   1.0    0.0    1.0

标签： pythonpandas

IIUC，对于计数，你可以groupby+sum在axis=1之后做一个get_dummies

final = (pd.get_dummies(df['description'].str.split(expand=True))
         .groupby(lambda x: x.split('_')[-1],axis=1).sum())

或应用（较慢）：

df['description'].str.split(expand=True).apply(pd.value_counts,axis=1).fillna(0)

   and  describe  here  more  test  that  things  words
0    0         1     0     0     1     1       1      1
1    1         0     1     2     0     0       0      1
2    0         0     0     0     1     0       1      0

python - pandas DataFrame 中列中字符串的一种热编码

问题描述

解决方案

推荐阅读