首页 > 解决方案 > 根据行首的时间戳过滤文本文件

问题描述

我有这个巨大的文本文件,我想在最开始时获取与数据相关联的行。这是来自该文本文件的几行。这是超过 36 小时的数据片段。我所说的关联是指时间戳后面的 8 个数据点。

2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917
2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925
2020-08-03 22:17:14,0,0,4805,4800,91,28.05,24.05,58.9341
2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683
2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978
...

我找不到python查看时间戳秒部分的方法,然后创建一个新列表,其中仅包含与“:00”秒关联的数据。

for line in fh:
    line = line.rstrip("\n")
    line = line.split(",")
    masterlist.extend(line) #this is putting the information into one list
    altmasterlist.append(line) #this is putting the lines of information into a list

for line in altmasterlist:
    if ":00" in line:
        finalmasterlist.extend(line) #Nothing is entering this if statement

print(finalmasterlist)

我是否在正确的区域使用这两个 for 循环?

标签: python

解决方案


  • 使用熊猫
    • 这可以通过 1 行矢量化操作来完成。
    • timeit测试所示,对于 1M 行数据,使用 pandas 比使用 读取文件慢 106 毫秒with open,以及使用strfind 的方法:00
      • 主要区别在于,pandas 已将所有数据转换为正确的dtype,(例如datetime,intfloat),并且代码更简洁。
      • 此外,数据现在采用了一种有用的格式来执行时间序列分析和绘图,但我建议添加列名。
        • df.columns = ['datetime', ..., 'price']
  • 读取文件pandas.read_csv并解析第 0 列中的日期。
    • 使用header=None,因为测试数据中没有提供标题
  • 使用布尔索引选择秒为 0 时的日期
    • 使用.dt访问器获取.second.
import pandas as pd

# read the file which apparently has no header and parse the date column
df = pd.read_csv('test.csv', header=None, parse_dates=[0])

# using Boolean indexing to select data when seconds = 00
top_of_the_minute = df[df[0].dt.second == 0]

# save the data
top_of_the_minute.to_csv('clean.csv', header=False, index=False)

# display(top_of_the_minute)
                    0  1  2     3     4   5      6      7        8
5 2020-08-03 22:17:00  0  0  4803  4800  91  28.05  24.05  58.8917
6 2020-08-03 22:17:00  0  0  4802  4800  91  28.05  24.05  58.8925
7 2020-08-03 22:17:00  0  0  4805  4800  91  28.05  24.05  58.9341
8 2020-08-03 22:17:00  0  0  4802  4800  91  28.05  24.05  58.9683
9 2020-08-03 22:17:00  0  0  4802  4800  91  28.05  23.05  58.9780

# example: rename columns
top_of_the_minute.columns = ['datetime', 'v1', 'v2', 'v3', 'v4', 'v5', 'p1', 'p2', 'p3']

# example: plot the data
p = top_of_the_minute.plot('datetime', 'p3')
p.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
p.set_xlim('2020-08', '2020-09')

在此处输入图像描述

test.csv

2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917
2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925
2020-08-03 22:17:14,0,0,4805,4800,91,28.05,24.05,58.9341
2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683
2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978
2020-08-03 22:17:00,0,0,4803,4800,91,28.05,24.05,58.8917
2020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.8925
2020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.9341
2020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.9683
2020-08-03 22:17:00,0,0,4802,4800,91,28.05,23.05,58.978

%%timeit测试

创建测试数据

# read test.csv
df = pd.read_csv('test.csv', header=None, parse_dates=[0])

# create a dataframe with 1M rows 
df = pd.concat([df] * 100000)

# save the new test data
df.to_csv('test.csv', index=False, header=False)

test_sk

def test_sk(path: str):
    zero_entries = []

    with open(path, "r") as file:
        for line in file:
            semi_index = line.index(',')
            if line[:semi_index].endswith(':00'):
                zero_entries.append(line)
    return zero_entries


%%timeit
result_sk = test_sk('test.csv')
[out]:
668 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

test_tm

def test_tm(path: str):
    df = pd.read_csv(path, header=None, parse_dates=[0])
    return df[df[0].dt.second == 0]


%%timeit
result_tm = test_tm('test.csv')
[out]:
774 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

推荐阅读