首页 > 解决方案 > 如何将特定列中的单词作为标签分配给新数据框

问题描述

嗨朋友我是新来的,

从特定列中重复最多的单词制作一个矩阵,A并将所选列的名称作为标签添加到我的数据框中。

我有的:

raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pr.DataFrame(raw_data)

我的目标是什么:

我想要做:

1-分隔字符串并计算特定列中的单词

2-制作一个零矩阵

3- 新矩阵应在步骤 1(我的问题)中用已建立的单词标记

4- 搜索每一行,如果这个词已经成立则 1 else 0

结果是我得到的新数据框:

    A                   word_count  char_count  0   1   2   3   4   5   6   7   8   9   10  11
0   This is yellow      3           14          1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1   That is green       3           13          1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2   These are orange    3           16          0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3   This is a pen       4           13          1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4   This is an Orange   4           17          1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0

我做了什么:

import pandas as pd
import numpy as np

# 1- Data frame
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pd.DataFrame(raw_data)
df

## 2- Count the words and characters in evrey row in columns "A"
df['word_count'] = df['A'].agg(lambda x: len(x.split(" ")))
df['char_count'] = df['A'].agg(lambda x:len(x))
df

# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
df_word_count=list(df_word_count["A"])
len(df_word_count)

    A       Count
0   is      4
1   This    3
2   orange  1
3   That    1
4   yellow  1
5   Orange  1
6   are     1
7   a       1
8   an      1
9   These   1
10  green   1
11  pen     1


# 4- Make a ZERO-Matrix 
allfeatures=np.zeros((df.shape[0],len(df_word_count)))
allfeatures.shape

# 5- Make a For-Loop 
for i in range(len(df_word_count)):
  allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(df_word_count[i]))

# 5- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)

我想要什么:

步骤 3中的单词"A"应该是新矩阵的标签,而不是 0 1 2 ...

A   word_count          char_count  is  This orange etc.
0   This is yellow      3           14  1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1   That is green       3           13  1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2   These are orange    3           16  0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3   This is a pen       4           13  1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4   This is an Orange   4           17  1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0

标签: pythonpandasdataframe

解决方案


所以我稍微改变了你的代码,你的第 3 步看起来像这样:

# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
list_word_count=list(df_word_count["A"])
len(list_word_count)

最大的变化是变量的名称list_word_count=list(df_word_count["A"])

使用新变量后,其余代码如下所示:

# 4- Make a ZERO-Matrix 
allfeatures=np.zeros((df.shape[0],len(list_word_count)))
allfeatures.shape

# 5- Make a For-Loop 
for i in range(len(list_word_count)):
  allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(list_word_count[i]))

# 6- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)

唯一的变化是变量的名称不同。我做的是第七步

# 7- change columns name from list
#This creates a list of the words you wanted
    l = list(df_word_count["A"])
# if you see this, it shows only the words you have in the column A
# but the result dataset that you showed you wanted, you also had some columns #that had values such as word count, etc. So we need to add that. We do this by #inserting those values you want in the list, at the beginning
    l.insert(0,"char_count")
    l.insert(0,"word_count")
    l.insert(0,"A")
    
# Finally, I rename all the columns with the names that I have in the list l
    Complete_data.columns = l

我明白了: 在此处输入图像描述


推荐阅读