python - 如何找到类别中的唯一单词 - Python
问题描述
我有一个数据框,其中 column1 包含文本数据,column2 包含文本的类别,包含在 column1 中。我想找到一个类别(即非正式)的文本数据中出现的单词,但不会出现在其他类别中。数据框中的多行将具有相同的类别。
Textual Category
Hi johnny how are you today Informal
Dear Johnny Formal
Hey Johnny Informal
To Johnny Formal
示例输出:
Informal: [Hi, how, are, you, today, Hey]
Formal: [Dear, To]
解决方案
# Remove punctuation
df.Textual = df.Textual.str.replace('.', '')
df.Textual = df.Textual.str.replace(',', '')
df.Textual = df.Textual.str.replace('?', '')
# get list of all words per Category
df1 = df.groupby(['Category'])['Textual'].apply(' '.join).reset_index()
df1['Textual'] = df1.Textual.str.split().apply(lambda x: list(filter(None, list(set(x)))))
print(df1)
# Split the list in different columns
df = pd.DataFrame(df1.Textual.values.tolist(), index= df1.index)
print(df)
# Reshape the df to have a line for each word
df['Category'] = df1.Category
df = df.set_index("Category")
df = df.stack()
print(df)
# Drop word that are present in several Categories
df = df.str.upper().drop_duplicates(keep=False)
print(df)
# Reshape the df to the expected output
df = df.groupby('Category').apply(list)
print(df)
推荐阅读
- scikit-learn - 在 Colaboratory 中可视化决策树
- python-3.x - 在 python 中读取 OLE 文件元数据
- algorithm - 寻找最优解的动态算法
- ios - Some UIColors are black instead of the way the way they should be, why?
- python - Shutdown with terminal on rising edge of GPIO input
- websphere - NoClassDefFoundError Websphere Applicatoin Server (WAS)(尽管类存在)
- android - ViewCompat$OnUnhandledKeyEventListener on setContentView
- solr - Solr 通过导入/更新和修改数据到 Solr
- sql - SQL:按方向过滤行
- python - R 和 Python 中的 G 测试(比例的两个样本测试)