首页 > 解决方案 > How to read a small percentage of lines of a very large CSV. Pandas - time series - Large dataset

问题描述

I have a time series in a big text file. That file is more than 4 GB.

As it is a time series, I would like to read only 1% of lines.

Desired minimalist example:

df = pandas.read_csv('super_size_file.log',
                      load_line_percentage = 1)
print(df)

desired output:

>line_number, value
 0,           654564
 100,         54654654
 200,         54
 300,         46546
 ...

I can't resample after loading, because it takes too much memory to load it in the first place.

I may want to load chunk by chunk and resample every chunk. But is seems inefficient to me.

Any ideas are welcome. ;)

标签: pythonpandastime-seriesbigdata

解决方案


每当我必须处理一个非常大的文件时,我都会问“ Dask会做什么?”。

将大文件加载为dask.DataFrame,将索引转换为列(由于完全索引控制不可用的解决方法),然后过滤该新列。

import dask.dataframe as dd
import pandas as pd

nth_row = 100  # grab every nth row from the larger DataFrame
dask_df = dd.read_csv('super_size_file.log')  # assuming this file can be read by pd.read_csv
dask_df['df_index'] = dask_df.index
dask_df_smaller = dask_df[dask_df['df_index'] % nth_row == 0]

df_smaller = dask_df_smaller.compute()  # to execute the operations and return a pandas DataFrame

这将为您提供较大文件中的第 0、100、200 行等。如果您想将 DataFrame 减少到特定列,请在调用计算之前执行此操作,即dask_df_smaller = dask_df_smaller[['Signal_1', 'Signal_2']]. 您还可以调用 compute 并scheduler='processes'选择使用 CPU 上的所有内核。


推荐阅读