首页 > 解决方案 > Pandas read_csv on file without space?

问题描述

Given a set of data that looks like the following, each line are 10 characters in length. They are links of a network, comprised of combinations of 4 or 5 character node numbers. Below is an example of the situations I would face:

|10637 4652|

| 1038 1037|

|7061219637|

|82004 2082|

As the dataset doesn't care much about spacing, While lines 1, 2 and 4 can be read in Pandas easily with either sep=' ' or delim_whitespace=True, I'm afraid I can't do the same for line 3. There is very little I can do to the input data file as it's generated from a third party software (apart from doing some formatting in Excel, which seemed counterintuitive...) Please, is there something in Pandas allowing me to specify the number of characters (in my case, 5) as a delimiter?

Advice much appreciated.

标签: pythonpandasdataframe

解决方案


我认为您正在寻找的是pd.read_fwf读取固定宽度的文件。在这种情况下,您将指定列规范:

pd.read_fwf(io.StringIO('''|10637 4652|
| 1038 1037|
|7061219637|
|82004 2082|'''), colspecs=[(1, 6), (6, 11)], header=None)

列规范是 0-indexed 和 end-exclusive。您也可以使用该widths参数,但我会避免在剥离之前使用它|,以确保您的变量以数字而不是以管道开头或结尾的字符串正确读取。

在这种情况下,这将产生:

       0      1
0  10637   4652
1   1038   1037
2  70612  19637
3  82004   2082

header=None由于您的示例数据中缺少标题,我通过了。您可能需要根据需要进行调整。我还删除了您输入中的所有空白行。如果输入中实际上有空行,那么我将首先运行:'\n'.join((s for s in input_string.split('\n') if len(s.strip()) != 0))在传递它以进行解析之前。在那里,您还需要首先将文件作为字符串加载,清理它,然后将其传递给io.StringIOto read_fwf


推荐阅读