python - How to read a small percentage of lines of a very large CSV. Pandas - time series - Large dataset
问题描述
I have a time series in a big text file. That file is more than 4 GB.
As it is a time series, I would like to read only 1% of lines.
Desired minimalist example:
df = pandas.read_csv('super_size_file.log',
load_line_percentage = 1)
print(df)
desired output:
>line_number, value
0, 654564
100, 54654654
200, 54
300, 46546
...
I can't resample after loading, because it takes too much memory to load it in the first place.
I may want to load chunk by chunk and resample every chunk. But is seems inefficient to me.
Any ideas are welcome. ;)
解决方案
每当我必须处理一个非常大的文件时,我都会问“ Dask会做什么?”。
将大文件加载为dask.DataFrame
,将索引转换为列(由于完全索引控制不可用的解决方法),然后过滤该新列。
import dask.dataframe as dd
import pandas as pd
nth_row = 100 # grab every nth row from the larger DataFrame
dask_df = dd.read_csv('super_size_file.log') # assuming this file can be read by pd.read_csv
dask_df['df_index'] = dask_df.index
dask_df_smaller = dask_df[dask_df['df_index'] % nth_row == 0]
df_smaller = dask_df_smaller.compute() # to execute the operations and return a pandas DataFrame
这将为您提供较大文件中的第 0、100、200 行等。如果您想将 DataFrame 减少到特定列,请在调用计算之前执行此操作,即dask_df_smaller = dask_df_smaller[['Signal_1', 'Signal_2']]
. 您还可以调用 compute 并scheduler='processes'
选择使用 CPU 上的所有内核。
推荐阅读
- flutter - NoSuchMethodError:在 null 上调用了 getter 长度
- python - 在同一路由中返回 Python Flask 中的多个语句
- heroku - 运行 Cloudinary 插件与现有的带有载波集成的 RoR 相比有什么好处吗
- python-3.x - 逐个循环播放jpgs以下载到计算机
- apache-spark - 如何提交不同语言的 Spark 应用程序?
- python - 函数内部的变量是通过对象全局还是局部访问?[Python]
- visual-studio-code - 一个标准输出如何通过管道传输到父进程的标准输出?
- c# - 如何以编程方式绑定生成的组合框
- ansible - Ansible 剧本替换模块未按预期运行
- c# - 使用资源字符串进行枚举序列化