首页 > 解决方案 > 使用 pandas 获取文本文件的子集

问题描述

我有一个像这个例子这样的大文本文件:

例子:

    CodeClass   Name    Accession   CF33500_02.txt  CF33503_07.txt  CF33505_06.txt
dd  Endogenous  dd  hh  101.238776  8.084376    1.187888
bb  Endogenous  bb  jj  562.853249  2013.886134 1288.568388
gg  Endogenous  gg  ll  218.148969  184.816378  176.705670
kk  Endogenous  kk  tt  23.499524   155.006161  593.654190

第一行是标题,第一列是行名。我想获得这个文件的一个子集,其中所有行都存在,但只有这些列在新文件中:

Name,CF33500_02.txt,CF33503_07.txt,CF33505_06.txt

为此,我正在尝试使用以下代码使用 pandas 来做到这一点:

df = pd.read_table("myfile.txt", index_col=0)
df2 = df.iloc[:, [1, 3, 4, 5]]

但它不起作用。你知道怎么解决吗?它给出了这个错误:

: Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/John/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1418, in __getitem__
    return self._getitem_tuple(key)
  File "/home/John/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 2092, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "/home/John/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 235, in _has_valid_tuple
    self._validate_key(k, i)
  File "/home/John/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 2031, in _validate_key
    raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds

标签: pandasfile

解决方案


我建议使用read_table指定为的分隔符\s+,这将根据它们之间是否存在一个或多个空白字符来分隔值列。

df = pd.read_table("myfile.txt", sep="\s+")

df
    CodeClass   Name  Accession  CF33500_02.txt  CF33503_07.txt CF33505_06.txt
dd  Endogenous  dd      hh       101.238776      8.084376       1.187888
bb  Endogenous  bb      jj       562.853249      2013.886134    1288.568388
gg  Endogenous  gg      ll       218.148969      184.816378     176.705670
kk  Endogenous  kk      tt       23.499524       155.006161     593.654190

然后子集数据框:

cols_to_keep = ["Name", "CF33500_02.txt", 
                "CF33503_07.txt", "CF33505_06.txt"]

df2 = df[cols_to_keep]
df2
    Name    CF33500_02.txt  CF33503_07.txt  CF33505_06.txt
dd  dd      101.238776      8.084376        1.187888
bb  bb      562.853249      2013.886134     1288.568388
gg  gg      218.148969      184.816378      176.705670
kk  kk      23.499524       155.006161      593.654190

推荐阅读