首页 > 解决方案 > 如何将这种特定的数据格式放入数据框中?

问题描述

我有一个这种格式的文本文件

====================
Something Something Something
====================
Something Something Something
====================
Something Something Something
====================
Something Something Something
Something Something Something
Something Something Something
====================
Something Something Something

Something Something Something
Something Something Something
====================
Something Something Something
====================
Something Something Something
====================
Something Something Something
====================

正如我试图说明的那样,有一些换行符,一些空行,但定义特征是我试图捕获的内容总是在等号行之间。

我试过 .read_csv 但这不起作用,因为数据框中单元格的值应该显示所有文本,包括换行符。

具体来说,

df = pd.read_csv(x + "/" + file, sep="====================", names=["Content"], engine="python", index_col=False)

我想要的数据框看起来像


   Content
0     Something Something Something
1     Something Something Something\n                 \nSomething Something Something\nSomething Something Something

例如。

有谁知道我怎么能做到这一点?

标签: pythonpandas

解决方案


首先定义一个自定义文件阅读器类:

class InFile:
    def __init__(self, infile):
        self.infile = open(infile)
    def __iter__(self):
        return self
    def read(self, *args, **kwargs):
        res = ''
        while True:
            line = self.infile.readline()
            if not line:
                self.infile.close()
                return line
            if line[:4] == '====':
                if len(res) > 0:
                    break
            else:
                res += line
        return res

然后将您的输入文件转换为字符串列表(其中一些是多行字符串):

ff = InFile('Input.txt')
tbl = []
while True:
    tt = ff.read()
    if not tt: break
    tbl.append(tt.strip())

最后一步是从此列表中创建一个 DataFrame:

df = pd.DataFrame({'Content': tbl})

不幸的是,如果你只尝试print(df)Pandas会使用 each 的文本表示来打印这个 DataFrame \n,并且整个(可能是多行)字符串无论如何都会占用一行

因此,检查已读取内容的更好方法是运行自定义循环,从每一行打印索引和内容字段:

for idx, row in df.iterrows():
    print(f'  Idx: {idx}')
    print(row.Content)

对于您的数据样本,在每个 Something之后插入一个连续数字,结果是:

  Idx: 0
Something1 Something Something
  Idx: 1
Something2 Something Something
  Idx: 2
Something3 Something Something
  Idx: 3
Something4 Something Something
Something5 Something Something
Something6 Something Something
  Idx: 4
Something7 Something Something

Something8 Something Something
Something9 Something Something
  Idx: 5
Something10 Something Something
  Idx: 6
Something11 Something Something
  Idx: 7
Something12 Something Something

请注意,Something7之后的输出包含一个空行,就像在您的输入文件中一样。


推荐阅读