python - 如何将特定列中的单词作为标签分配给新数据框
问题描述
嗨朋友我是新来的,
从特定列中重复最多的单词制作一个矩阵,A
并将所选列的名称作为标签添加到我的数据框中。
我有的:
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pr.DataFrame(raw_data)
我的目标是什么:
我想要做:
1-分隔字符串并计算特定列中的单词
2-制作一个零矩阵
3- 新矩阵应在步骤 1(我的问题)中用已建立的单词标记
4- 搜索每一行,如果这个词已经成立则 1 else 0
结果是我得到的新数据框:
A word_count char_count 0 1 2 3 4 5 6 7 8 9 10 11
0 This is yellow 3 14 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green 3 13 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 These are orange 3 16 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 This is a pen 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 This is an Orange 4 17 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
我做了什么:
import pandas as pd
import numpy as np
# 1- Data frame
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pd.DataFrame(raw_data)
df
## 2- Count the words and characters in evrey row in columns "A"
df['word_count'] = df['A'].agg(lambda x: len(x.split(" ")))
df['char_count'] = df['A'].agg(lambda x:len(x))
df
# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
df_word_count=list(df_word_count["A"])
len(df_word_count)
A Count
0 is 4
1 This 3
2 orange 1
3 That 1
4 yellow 1
5 Orange 1
6 are 1
7 a 1
8 an 1
9 These 1
10 green 1
11 pen 1
# 4- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(df_word_count)))
allfeatures.shape
# 5- Make a For-Loop
for i in range(len(df_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(df_word_count[i]))
# 5- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
我想要什么:
步骤 3中的单词"A"
应该是新矩阵的标签,而不是 0 1 2 ...
A word_count char_count is This orange etc.
0 This is yellow 3 14 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green 3 13 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 These are orange 3 16 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 This is a pen 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 This is an Orange 4 17 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
解决方案
所以我稍微改变了你的代码,你的第 3 步看起来像这样:
# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
list_word_count=list(df_word_count["A"])
len(list_word_count)
最大的变化是变量的名称list_word_count=list(df_word_count["A"])
使用新变量后,其余代码如下所示:
# 4- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(list_word_count)))
allfeatures.shape
# 5- Make a For-Loop
for i in range(len(list_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(list_word_count[i]))
# 6- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
唯一的变化是变量的名称不同。我做的是第七步
# 7- change columns name from list
#This creates a list of the words you wanted
l = list(df_word_count["A"])
# if you see this, it shows only the words you have in the column A
# but the result dataset that you showed you wanted, you also had some columns #that had values such as word count, etc. So we need to add that. We do this by #inserting those values you want in the list, at the beginning
l.insert(0,"char_count")
l.insert(0,"word_count")
l.insert(0,"A")
# Finally, I rename all the columns with the names that I have in the list l
Complete_data.columns = l
推荐阅读
- java - spring-cloud-gcp-data-firestore 的继承
- forms - 合并 2 个 pdf 文件并保留表格
- javascript - 如何通过 Javascript 在此 HTML 中传递 src 值
- javascript - JavaScript:几何和的 BigInteger 实现
- php - 你如何用 jquery & php 上传图片文件?
- javascript - 使用 AJAX 调用从 Php 查看 PDF
- php - 如何从响应中获取 GuzzleHttp 7.x 解析的请求 url
- git - 合并到功能分支时,Git 在主恢复提交中恢复
- xml - 将xml转换为没有soapEnvolpe和命名空间的xml,还使用xslt将命名空间添加到内部元素
- node.js - Node/React/Redux:在 Node 和 React 之间传递 api JSON 对象时遇到问题