首页 > 解决方案 > 如何优化代码,同时导出为 csv?

问题描述

我正在尝试根据某些规则导出到 csv 文件。导出到 csv 需要更多时间。谁能建议如何优化代码?

代码片段:

readCsv = pd.read_csv(inputFile)     

readCsv.head()     

readCsv.columns  

readCsv[(readCsv[attributeKey.title()].str.casefold()).str.contains(str.lower(Key))==True]
.to_excel(r"C:\\User\\Desktop\\resultSet.xlsx", index = None, header=True)

标签: pandas

解决方案


像这样的东西?

import pandas as pd
import time

df = pd.DataFrame({'Column1': pd.util.testing.rands_array(10, 10000000)})

attributeKey = 'column1'
Key = 'abc' #the string you are checking for

start = time.time()
df[df[attributeKey.title()].str.lower().str.contains((Key).lower())]
end = time.time()

print(end - start)

df.to_excel('output.xlsx')

编辑1:答案2

import pandas as pd
import time

df = pd.DataFrame({'Column1': pd.util.testing.rands_array(10, 10000000)})

attributeKey = 'column1'
Key = 'abc' #the string you are checking for

start = time.time()
df[df[attributeKey.title()].str.lower().str.contains((Key).lower())]
end = time.time()

print(end - start)

df.to_excel('output.xlsx')

10.000.000 行需要6.47几秒钟。任何更大的东西,我都会遇到内存错误。如果你需要更大,你可能需要研究 Dask。

编辑2:答案3

更改为应用可将时间缩短大约一半。

import pandas as pd
import time

def check_key(s):
    return KEY.lower() in s.lower()

df = pd.DataFrame({'Column1': pd.util.testing.rands_array(10, 10000000)})

KEY = 'abc' #the string you are checking for
ATTRIBUTE_KEY = 'column1'

start = time.time()
df[df[ATTRIBUTE_KEY.title()].apply(check_key)]
end = time.time()

print(end - start)

输出:3.3952105045318604

编辑 3:答案 4

只是为了好玩,尝试多处理:

from multiprocessing import  Pool
from functools import partial
import numpy as np

def parallelize(data, func, num_of_processes=8):
    data_split = np.array_split(data, num_of_processes)
    pool = Pool(num_of_processes)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

def run_on_subset(func, data_subset):
    return data_subset.apply(func)

def parallelize_on_rows(data, func, num_of_processes=8):
    return parallelize(data, partial(run_on_subset, func), num_of_processes)

def check_key(s):
    return KEY.lower() in s.lower()

df = pd.DataFrame({'Column1': pd.util.testing.rands_array(10, 10000000)})

KEY = 'abc' #the string you are checking for
ATTRIBUTE_KEY = 'column1'

start = time.time()
parallelize_on_rows(df[ATTRIBUTE_KEY.title()], check_key)
end = time.time()

print(end - start)

输出6.306780815124512,所以答案 3 似乎是最有效的,无论如何都有这个大小的数据。


推荐阅读