首页 > 解决方案 > Optimizing python code by using cython

问题描述

I would like to optimize a code in Python in order to increase a speed of string processing and was wondering if someone can help.

I did some tests in Cython but I am not particularly happy with the results. I am wondering if string processing can be even optimized or perhaps I have not created a proper code in Cython. In the first case, interating over list of strings and a simple assignment to a variable was about 5 times faster on average. In the other case, where I included searching for a substring, there was no speed increase at all.

[TL;DR]

I have a large corpora which I need to process. Currently, in pure python, I iterate over list of strings searching for a substring and filter out those elements (main strings) which do not contain a given substring. I have created a couple of for loops and was hoping for a significant speed increase after converting a code to Cython.

First, I tested how fast would a simple iteration over two lists of strings be when strings are assigned to a variable:

Results:

[Cython][1]: 0.45067 sec
[Python][2]: 2.09907 sec
[Cython][1] approximately 4.5 times faster than [Python][2]

Code in Cython:

def start(itsstr, tokens):
    cdef size_t s
    cdef size_t t
    cdef size_t ns = len(itsstr)
    cdef size_t nt = len(tokens)
    cdef unicode x
    for s in xrange(ns):
        for t in xrange(nt):
            x = itsstr[s]

Python

def start(itsstr, tokens):
    for s in itsstr:
        for t in tokens:
            x = s

The current main code in Python searches for a substrings. I tested a speed increase against searching for a substring in Cython. The results are following:

Results:

[Cython][1]: 7.13278 sec
[Python][2]: 8.7094 sec
[Cython][1] 1.2 times faster than [Python][2]

Code in Cython:

def start(itsstr, tokens):
    cdef size_t s
    cdef size_t t
    cdef size_t ns = len(itsstr)
    cdef size_t nt = len(tokens)
    cdef unicode x
    for s in xrange(ns):
        for t in xrange(nt):
            if tokens[t] in itsstr[s]:
                x = itsstr[s]

Code in Python

def start(itsstr, tokens):
    for s in itsstr:
        for t in tokens:
            if t in s:
                x = s

As you can see there is no really any speed increase.

I have been thinking about what I could do at this point? I am not experienced in Cython. I have a good experience in C/C++ , though.

Thanks

标签: stringcython

解决方案


推荐阅读