string - Optimizing python code by using cython
问题描述
I would like to optimize a code in Python in order to increase a speed of string processing and was wondering if someone can help.
I did some tests in Cython but I am not particularly happy with the results. I am wondering if string processing can be even optimized or perhaps I have not created a proper code in Cython. In the first case, interating over list of strings and a simple assignment to a variable was about 5 times faster on average. In the other case, where I included searching for a substring, there was no speed increase at all.
[TL;DR]
I have a large corpora which I need to process. Currently, in pure python, I iterate over list of strings searching for a substring and filter out those elements (main strings) which do not contain a given substring. I have created a couple of for loops and was hoping for a significant speed increase after converting a code to Cython.
First, I tested how fast would a simple iteration over two lists of strings be when strings are assigned to a variable:
Results:
[Cython][1]: 0.45067 sec
[Python][2]: 2.09907 sec
[Cython][1] approximately 4.5 times faster than [Python][2]
Code in Cython:
def start(itsstr, tokens):
cdef size_t s
cdef size_t t
cdef size_t ns = len(itsstr)
cdef size_t nt = len(tokens)
cdef unicode x
for s in xrange(ns):
for t in xrange(nt):
x = itsstr[s]
def start(itsstr, tokens):
for s in itsstr:
for t in tokens:
x = s
The current main code in Python searches for a substrings. I tested a speed increase against searching for a substring in Cython. The results are following:
Results:
[Cython][1]: 7.13278 sec
[Python][2]: 8.7094 sec
[Cython][1] 1.2 times faster than [Python][2]
Code in Cython:
def start(itsstr, tokens):
cdef size_t s
cdef size_t t
cdef size_t ns = len(itsstr)
cdef size_t nt = len(tokens)
cdef unicode x
for s in xrange(ns):
for t in xrange(nt):
if tokens[t] in itsstr[s]:
x = itsstr[s]
Code in Python
def start(itsstr, tokens):
for s in itsstr:
for t in tokens:
if t in s:
x = s
As you can see there is no really any speed increase.
I have been thinking about what I could do at this point? I am not experienced in Cython. I have a good experience in C/C++ , though.
Thanks
解决方案
推荐阅读
- javascript - 用 highchart 绘制动态数组数据结构
- python - 用来自另一个不同形状的 DataFrame 的值填充 pandas DataFrame
- azure - 使用 ARM 模板创建用户管理的身份和服务器管理员
- excel - Excel Power Query 获取文件列表,包括空文件夹和子文件夹的名称
- python - 如何修复 VALUE / ASSERTION 错误?我正在尝试抓取一个有六列但合并了三列的表
- python - 大 S3 文件的 s3fs 超时
- api - 使用 REST API 的策略模式
- amazon-web-services - AWS cloudwatch schedule drop event,在 eventbridge 中调用第二个 lambda
- python - 将 tcpdump 保存为变量
- c++ - 在 CUDA 中转换“for循环”的问题