首页 > 解决方案 > 以数据集为中心

问题描述

我已经尝试并达到了可以读取文件夹中的多个文本文件,将它们填充到数据帧(一个数据帧)中然后得出以下输出的地步,但是我正在努力解决如何将这种格式更改为所需的输出(如下图所示):

Name    Col2    Col3    Freq    File_Path
b   h   e   43  xyz/fgghh/something_1.txt
g   j   k   432 xyz/fgghh/something_1.txt
n   q   e   6   xyz/fgghh/something_1.txt
p   p   t   3   xyz/fgghh/something_1.txt
uu  l   x   1   xyz/fgghh/something_1.txt
x   r   u   23  xyz/fgghh/something_1.txt
b   h   e   43  xyz/fgghh/something_2.txt
ll  e   e   1   xyz/fgghh/something_2.txt
n   e   e   6   xyz/fgghh/something_2.txt
p   e   e   3   xyz/fgghh/something_2.txt
x   y   z   23  xyz/fgghh/something_2.txt
zz  j   k   432 xyz/fgghh/something_2.txt
b   h   e   43  xyz/fgghh/something.txt
g   j   k   432 xyz/fgghh/something.txt
n   e   e   6   xyz/fgghh/something.txt
p   e   e   3   xyz/fgghh/something.txt
u   e   e   1   xyz/fgghh/something.txt
yyyy    y   z   23  xyz/fgghh/something.txt


import pandas as pd
import os
import glob

dirpath= "......"
filenames = glob.glob("...../*.tsv")
list_of_dfs = [pd.read_csv(filename,sep='\t') for filename in filenames]
for dataframe, filename in zip(list_of_dfs, filenames):
  dataframe['File_Path'] = filename
combined_df = pd.concat(list_of_dfs, ignore_index=True,sort=False)
out_df=combined_df.pivot_table(index='Name', columns='File_Path')
out_df.to_csv(os.path.join(dirpath,'myMerged_file_2.txt'), sep='\t', encoding='utf-8',quoting=0,index=False,index_label=None)

out_df=combined_df.pivot_table(index='Name', columns='File_Path') 这仍然不起作用。我只想要输出中的 Name 列和 Freq 值

我不确定如何在此文件上使用 merge 或 concat 命令使输出看起来像(期望的输出):

Name    something.txt   something_1.txt something_2.txt
yyyy    23      
b   43  43  43
g   432 432 
p   3   3   3
u   1       
n   6   6   6
x       23  23
uu      1   
zz          432
ll          1

标签: pythonpandasjoin

解决方案


首先,用于os.path.basename从文件路径中提取文件名。然后您可以使用groupbyfirstunstack

import os
(df.groupby([df.Name, df.File_Path.map(os.path.basename)], sort=False)
   .Freq.first()
   .unstack(1, fill_value=''))

File_Path something_1.txt something_2.txt something.txt
Name                                                   
b                      43              43            43
g                     432                           432
n                       6               6             6
p                       3               3             3
uu                      1                              
x                      23              23              
ll                                      1              
zz                                    432              
u                                                     1
yyyy                                                 23

在哪里,

df.File_Path.map(os.path.basename)

0     something_1.txt
1     something_1.txt
2     something_1.txt
3     something_1.txt
4     something_1.txt
5     something_1.txt
6     something_2.txt
7     something_2.txt
8     something_2.txt
9     something_2.txt
10    something_2.txt
11    something_2.txt
12      something.txt
13      something.txt
14      something.txt
15      something.txt
16      something.txt
17      something.txt
Name: File_Path, dtype: object

另一种选择是使用crosstab

(pd.crosstab(index=df.Name, 
             columns=df.File_Path.map(os.path.basename), 
             values=df.Freq, 
             aggfunc='sum')
   .fillna(''))

File_Path something.txt something_1.txt something_2.txt
Name                                                   
b                    43              43              43
g                   432             432                
ll                                                    1
n                     6               6               6
p                     3               3               3
u                     1                                
uu                                    1                
x                                    23              23
yyyy                 23                                
zz                                                  432

推荐阅读