首页 > 解决方案 > pandas Explode 产生意想不到的结果

问题描述

我正在尝试分解数据框的一列以获取多行。展开它的列称为关键字,它是从 FlashText 包中作为关键字返回的情绪列表。这意味着如果关键字在文本列(带有句子的列)中,那么它将返回与该句子对应的那种情绪或多种情绪

如果我使用我创建的示例数据框,这与预期的输出完美配合,但是当应用于数据框时,它会返回随机的行组合。

我认为这个意外的结果是因为数据帧有重复的索引,但是,删除它们会给出相同的错误结果。

预期产出

from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(keyword_dict=keywords_dict)


test_df = pd.DataFrame({'text': ['I really hate and love love everyone best confident shy', 'i should be sleeping i have a stressed out week coming to me',
                                 'late night snack glass of oj bc im quotdown with the sicknessquot then back to sleepugh i hate getting sick', 
                                 
                                 # NaN results to empty list
                                 'whatever', 
                                 '[]', 
                                 'body of missing northern calif girl found poli', 
                                 'i miss kenny powers',

                                 'sorry  tell them mea culpa from me and that i really am sorry'
                        ]
                        })

# Extracting keywords
test_df['keywords'] = test_df['text'].apply(lambda x: kp.extract_keywords(x, span_info=False))

# Exploding keywords column into rows
test_df = test_df.explode('keywords').reset_index(drop=True)#.drop('index', 1) # drop duplicate indexes

# Transforming NaN into empty list
test_df['keywords'] = test_df['keywords'].fillna({i: [] for i in test_df.index})


test_df
    text                                                keywords
0   I really hate and love love everyone best conf...   unfriendly
1   I really hate and love love everyone best conf...   friendly
2   I really hate and love love everyone best conf...   friendly
3   I really hate and love love everyone best conf...   confident
4   I really hate and love love everyone best conf...   insecure
5   i should be sleeping i have a stressed out wee...   neg_hp
6   late night snack glass of oj bc im quotdown wi...   unfriendly
7   whatever                                            []
8   []                                                  []
9   body of missing northern calif girl found poli      []
10  i miss kenny powers                                 []
11  sorry tell them mea culpa from me and that i ...    sadness
12  sorry tell them mea culpa from me and that i ...    sadness

电流输出不爆炸

这里的句子i miss kenny powers返回一个空列表

在此处输入图像描述

带爆炸的电流输出

这里的句子i miss kenny powers返回情感confident,这是错误的

在此处输入图像描述

数据框:数据框样本 40k

标签: pythonpandasdataframeexplode

解决方案


当前使用 csv 包为我工作的解决方案:

# New solution : exploding with csv
import csv

CSV_PATH = 'temp_data.csv'
data = []

df_concat.to_csv(CSV_PATH)

with open(file=CSV_PATH, mode='r') as f:
    reader = csv.DictReader(f)
    columns = reader.fieldnames

    print(columns)

    for record in reader:
        keywords = eval(record['keywords'])

        if not keywords:
            data.append((record['text'], '[]')) #record['category'], record['Valence'], record['Arousal'], record['Dominance']

        for keyword in keywords:
            data.append((record['text'], keyword)) #record['category'], record['Valence'], record['Arousal'], record['Dominance']

df_concat = pd.DataFrame(data, columns=['text', 'keywords'])

推荐阅读