首页 > 解决方案 > 使用字符串匹配优化列表理解

问题描述

我有一个看起来像这样的函数:

def my_fun:
    return [match.start() for match in re.finditer(f'(?=({some_pattern}))', string)]

它只返回一个整数列表,是的,我尝试了一个生成器,它非常快,但不幸的是它必须是一个列表。模式可能看起来像这样word1|word2|word3...- 很多词 :) 这是我优化工作的效果,但它仍然太慢。我可以做些什么来调整性能?我应该考虑使用 Cython 吗?

示例字符串:https ://www.codepile.net/pile/bD9R0qZR

示例模式:https ://www.codepile.net/pile/eRPK025Y

标签: regexpython-3.xoptimizationstring-matchingmicro-optimization

解决方案


我刚刚使用 Python 原语实现了这一点,并与您的解决方案进行了比较。以下代码也适用于多次(并且可能重叠)事件(使用match_all()辅助函数),并且比使用正则表达式要快得多:

import re


def find_all(text, pattern):
    len_text = len(text)
    i = 0
    while i < len_text:
        i = text.find(pattern, i)
        if i >= 0:
            yield i
            i += 1
        else:
            break


def my_fun_all(text, pattern):
    words = pattern.split('|')
    return sorted([i for word in words for i in find_all(text, word)])


def my_fun(text, pattern):
    return [match.start() for match in re.finditer(f'(?=({pattern}))', text)]


print(my_fun(text, pattern))
# [501, 523, 865, 878, 1407, 1728, 1956]
print(my_fun_all(text, pattern))
# [501, 523, 865, 878, 1407, 1728, 1956]

print(my_fun(text * 2, pattern) == my_fun_all(text * 2, pattern))
# True


%timeit my_fun(text, pattern)
# 1.48 ms ± 6.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun_all(text, pattern)
# 512 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit my_fun(text * 2, pattern)
# 2.97 ms ± 27.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit my_fun_all(text * 2, pattern)
# 843 µs ± 8.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit my_fun(text * 20, pattern)
# 29.4 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit my_fun_all(text * 20, pattern)
# 3.36 ms ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

编辑

这是我在 OP 指定需要支持多次出现之前的解决方案:

def my_fun_new(text, pattern):
    words = pattern.split('|')
    result = []
    for word in words:
        i = text.find(word)
        if i >= 0:
            result.append(i)
    return sorted(result)

编辑

为 Aho-Corasick 的 @Finomnis 实现添加时间:

import ahocorasick


def my_fun_ac(string, raw_pattern):
    A = ahocorasick.Automaton()
    for pattern in raw_pattern.split('|'):
        A.add_word(pattern, len(pattern))
    A.make_automaton()

    result = []
    for end_index, original_len in A.iter(string):
        result.append(end_index - original_len + 1)
    return result


print(my_fun(text * 2, pattern) == my_fun_ac(text * 2, pattern))
# True

%timeit my_fun_ac(text, pattern)
# 221 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun_ac(text * 2, pattern)
# 268 µs ± 972 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun_ac(text * 20, pattern)
# 1.09 ms ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这比我使用 Python 内置函数的实现快了 2 到 4 倍。


为了完整起见,我报告了我使用的内容textpattern

text = """\
Intro:
[Snoop]
Da da da da daaaa
It's tha motha fuckin' D O double G (Snoop dogg)
Da da da da daaaa
You know im mobbin' with the D.R.E

[Kurupt]
Yeah yeah yeah
You know who's back up in this mothafucker *echoes*

[Snoop]
What what what what

[Kurupt]
So blaze the weed out there

[Snoop]
Blaze it up

[Kurupt]
Blaze that shit up nigga
Yeah waz up snoop

Verse one: [snoop]
Top dogg buy them all nigga burn this shit up
D-P-G-C my nigga turn that shit up
CPT, LBC yeah we hookin' back up
N' when they bang this in the club baby you gotta get up
Thug niggas, drug dealers yeah they givin' it up
Low life, your life boy we livin' it up
Take a chance that's why we dancin'
In the party fo' sho'
Slip my ho a fourty four n' she crept in it back do'
Bitches lookin' at me strange but you know i don't care
Step up in this mothafucker just to swingin' my hair
Bitch quit talkin' Crip walk
If you down with the set
Take a Bullet with some dick
Take this dope from this jet
Outta town put it down for father of rap
N' if your ass get crack bitch shut your trap
Come back get back that's the part of success
N' if you believe the X then you'll relievin' your stress

Music in between

[snoop]
Da da da da daaaa

[Dre]
It's the mothafuckin' D-R-E

[Kurupt]
Dr. Dre mothafucker [Snoop] what what what what

[Snoop]
Da da da da daaaa

Verse two: [Dre]
You know im mobbing with the D O double G
Straight off the fuckin' street's of CPT
King of the beats n' you ride to em' in your fleet (Fleetwood) wood
Or Coupe DeVille rollin on dubs
How you feel?
Whoopty whoop nigga what?
Dre n' snoop chronic'ed out
In the 'llac with doc in the back
Sippin' 'gnac, clip in the strap
Dippin' through hoods
What hoods? Compton, longbeach, ingelwood
South central out to the westside (westside)
It's california love this california bud
Got a nigga gang of pub
I'm on one, I might bail up in the Century Club
With my jeans on n' my team's strong
Get my drink on n' my smoke on
Then go home with somethin' to poke on (waz up bitch)
Loc' it's on for the two tripple oh
Comin' real it's the next episode *Echoes*

Music in between

[Nate Dogg]
Hold up.. heeeeey
For my niggas who be thinkin' we soft
We don't.. plaaaay
We gonna rockin' til the weels fall of
Hold up.. heeeeey
For my niggas who be acting to bold
Take a.. seeeeat
Hope you ready for the next episode heeeeeey
(Music stops and pauses)
Smoke weed everyday
"""

pattern = "melodic|eyes|wander|rambunctious|whirl|provide|ruddy|quaint|sea|snatch|tangy|women|mammoth|peel|home|hope|sense|measure|lake|gleaming|vagabond|phobic|support|boring|discreet|overrated|spoil|load|lick|early|envious|alleged|heady|queen|seed|quiver|squalid|jelly|standing|wreck|slope|farflung|press|run|tender|shrill|duck|actor|foregoing|tickle|haunt|minor|seal|breakable|wren|trick|bitesized|stage|listen|depend|string|abounding|explode|cows|low|creature|compare|habitual|pipe|hand|occur|eye|drunk|furniture|insect|worthless|fang|found|connection|quarter|celery|stretch|rifle|glistening|baby|flesh|invite|motionless|watch|letter|current|succinct|panicky|snail|ear|prevent|nebulous|knife|jolly|dirt|scrawny|zephyr|giraffe|unique|jumbled|curly|fry|blushing|delirious|open|muddled|year|earsplitting|zipper|grandiose|answer|puncture|chase|gentle|middle|nine|overflow|rod|troubled|pop|handy|chalk|thunder|acceptable|bumpy|filthy|majestic|paste|snotty|quack|illated|tired|present|credit|substance|launch|equable|bashful|callous|appreciate|dead|soothe|amused|handsome|nappy|amazing|unbiased|bushes|yellow|way|cakes|cent|tested|letters|colour|apparel|rhythm|shoes|homeless|open|cub|alarm|soak|voracious|chin|flame|skillful|form|possessive|retire|camera|street|condition|gate|scintillating|follow|imported|approval|humdrum|tasty|murky|sprout|morning|picture|happen|observe|bang|slap|cough|five|tie|plan|punish|sneeze|perfect|shake|average|clear|caring|accurate|telephone|skate|daughter|wild|spurious|nutritious|sneaky|breezy|separate|sore|fade|blind|aware|cheat|heat|cowardly|unused|sheet|puny|pump|property|obscene|immense|soggy|shiny|spot|racial|lace|dapper|cheap|fluttering|husky|fire|hammer|aquatic|stick|lamentable|yell|chilly|zoom|animated|living|spell|separate|shade|delicious|deer|suck|squeamish|call|labored|shirt|numerous|push|sleet|price|earthy|ambiguous|porter|chickens|mailbox|omniscient|mourn|descriptive|kiss|polite|changeable|children|cheese|assorted|illustrious|action|serious|dislike|rhetorical|scandalous|nasty|steady|safe|walk|different|example|ring|talk|print|dinosaurs|switch|behave|murder|brown|cooperative|past|proud|disastrous|observant"

编辑:固定输入偏移,更新代码和时间)


推荐阅读