首页 > 解决方案 > 如何使用 Python 中的数组选择和排序数据框中的列

问题描述

我有一个相当大的数据框 df2(约 50,000 行 x 2,000 列)。列标题是样本名称。另外,我有一个数据框 df1,其中包含我想作为 df1 索引包含在分析中的样本列表。我想使用 df1 索引中的样本列表仅从 df2 中选择那些选定样本的列,丢弃其余的列。我还想保留 df1 索引中的样本顺序。

示例数据:

# df1
data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'], 
        'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
        'Year':[2012, 2014, 2015]}
df1 = pd.DataFrame(data1)
df1.set_index('Sample')

# df2
data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'], 
        'Sample_A': [0,1,0,0,1],
        'Sample_B':[0,0,1,0,0],
        'Sample_C':[1,0,0,0,1],
        'Sample_D':[0,0,1,1,0]}
df2 = pd.DataFrame(data2)
df2.set_index('Num')

首先,我从 df1 的索引中生成我想要的样本列表,例如

samples = df1['Sample'].tolist()

'样品'然后,

['Sample_A', 'Sample_D', 'Sample_E']

使用“样本”,我想要的输出数据框 df3 应该如下所示:

index  Sample_A  Sample_D
Value_1  0  0
Value_2  1  0
Value_3  0  1
Value_4  0  1
Value_5  1  0

但是如果我使用

df3 = df2[samples]

然后我收到错误消息:

"['Sample_E'] not in index"

那么如何忽略 df2 中未找到的样本以避免出现此错误消息?

更新有效的解决方案 -

# 1. Define samples to use from df1
samples = df1['Sample'].tolist()
# Only include samples that are found in df2 as well
final_samples = list(set(list(df2.columns)) & set(samples ))
# Make new df with columns corresponding to final_samples
df3 = df2.loc[:, final_samples]

标签: pythonarrayspandasdataframe

解决方案


试试这样。。

df = pd.read_csv("data.csv", usecols=['Sample_A','Sample_D']).fillna('')
print(df)

选择所有行和某些列,可以使用单个冒号选择所有行。

>>> df.loc[:, ['Sample_A','Sample_D']]

您提供的数据集中的答案:

>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
...         'Sample_A': [0,1,0,0,1],
...         'Sample_B':[0,0,1,0,0],
...         'Sample_C':[1,0,0,0,1],
...         'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num').loc[:, ['Sample_A','Sample_D']]
         Sample_A  Sample_D
Num
Value_1         0         0
Value_2         1         0
Value_3         0         1
Value_4         0         1
Value_5         1         0

======================================

>>> df3 = df2.loc[:, samples]
>>> df3
   Sample_A  Sample_D  Sample_E
0         0         0       NaN
1         1         0       NaN
2         0         1       NaN
3         0         1       NaN
4         1         0       NaN

或者

>>> df3 = df2.reindex(columns=samples)
>>> df3
   Sample_A  Sample_D  Sample_E
0         0         0       NaN
1         1         0       NaN
2         0         1       NaN
3         0         1       NaN
4         1         0       NaN

推荐阅读