python - pandas Explode 产生意想不到的结果
问题描述
我正在尝试分解数据框的一列以获取多行。展开它的列称为关键字,它是从 FlashText 包中作为关键字返回的情绪列表。这意味着如果关键字在文本列(带有句子的列)中,那么它将返回与该句子对应的那种情绪或多种情绪
如果我使用我创建的示例数据框,这与预期的输出完美配合,但是当应用于数据框时,它会返回随机的行组合。
我认为这个意外的结果是因为数据帧有重复的索引,但是,删除它们会给出相同的错误结果。
预期产出
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(keyword_dict=keywords_dict)
test_df = pd.DataFrame({'text': ['I really hate and love love everyone best confident shy', 'i should be sleeping i have a stressed out week coming to me',
'late night snack glass of oj bc im quotdown with the sicknessquot then back to sleepugh i hate getting sick',
# NaN results to empty list
'whatever',
'[]',
'body of missing northern calif girl found poli',
'i miss kenny powers',
'sorry tell them mea culpa from me and that i really am sorry'
]
})
# Extracting keywords
test_df['keywords'] = test_df['text'].apply(lambda x: kp.extract_keywords(x, span_info=False))
# Exploding keywords column into rows
test_df = test_df.explode('keywords').reset_index(drop=True)#.drop('index', 1) # drop duplicate indexes
# Transforming NaN into empty list
test_df['keywords'] = test_df['keywords'].fillna({i: [] for i in test_df.index})
test_df
text keywords
0 I really hate and love love everyone best conf... unfriendly
1 I really hate and love love everyone best conf... friendly
2 I really hate and love love everyone best conf... friendly
3 I really hate and love love everyone best conf... confident
4 I really hate and love love everyone best conf... insecure
5 i should be sleeping i have a stressed out wee... neg_hp
6 late night snack glass of oj bc im quotdown wi... unfriendly
7 whatever []
8 [] []
9 body of missing northern calif girl found poli []
10 i miss kenny powers []
11 sorry tell them mea culpa from me and that i ... sadness
12 sorry tell them mea culpa from me and that i ... sadness
电流输出不爆炸
这里的句子i miss kenny powers
返回一个空列表
带爆炸的电流输出
这里的句子i miss kenny powers
返回情感confident
,这是错误的
数据框:数据框样本 40k
解决方案
当前使用 csv 包为我工作的解决方案:
# New solution : exploding with csv
import csv
CSV_PATH = 'temp_data.csv'
data = []
df_concat.to_csv(CSV_PATH)
with open(file=CSV_PATH, mode='r') as f:
reader = csv.DictReader(f)
columns = reader.fieldnames
print(columns)
for record in reader:
keywords = eval(record['keywords'])
if not keywords:
data.append((record['text'], '[]')) #record['category'], record['Valence'], record['Arousal'], record['Dominance']
for keyword in keywords:
data.append((record['text'], keyword)) #record['category'], record['Valence'], record['Arousal'], record['Dominance']
df_concat = pd.DataFrame(data, columns=['text', 'keywords'])
推荐阅读
- fortran - 使用 Lapack 和 Fortran 的户主 QR 分解
- tailwind-css - 我想让我的自定义顺风类使用媒体前缀
- spring - FCM 和自我证明问题(重新打开)
- javascript - jest spyOn navigator.mediaDevices
- csv - 如何从所有 csv 文件中添加第 n 列并存储在另一个 csv 文件中(标题是该列来自的文件名)?
- sap-cloud-sdk - 批处理请求问题(删除不起作用)
- machine-learning - 我有 2 个文件夹。一个图像在 1 个文件夹中,另一个图像在另一个文件夹中。我必须比较两张图片并找出不同之处
- python - M2Crypto.RSA.RSAError:数据大于 mod len?
- reactjs - 检查 Modal 是否在玩笑酶中使用条件道具渲染
- azure - 自定义容器中的 Azure Web App,环境变量未从应用服务传播到容器