regex - 使用字符串匹配优化列表理解
问题描述
我有一个看起来像这样的函数:
def my_fun:
return [match.start() for match in re.finditer(f'(?=({some_pattern}))', string)]
它只返回一个整数列表,是的,我尝试了一个生成器,它非常快,但不幸的是它必须是一个列表。模式可能看起来像这样word1|word2|word3...
- 很多词 :) 这是我优化工作的效果,但它仍然太慢。我可以做些什么来调整性能?我应该考虑使用 Cython 吗?
解决方案
我刚刚使用 Python 原语实现了这一点,并与您的解决方案进行了比较。以下代码也适用于多次(并且可能重叠)事件(使用match_all()
辅助函数),并且比使用正则表达式要快得多:
import re
def find_all(text, pattern):
len_text = len(text)
i = 0
while i < len_text:
i = text.find(pattern, i)
if i >= 0:
yield i
i += 1
else:
break
def my_fun_all(text, pattern):
words = pattern.split('|')
return sorted([i for word in words for i in find_all(text, word)])
def my_fun(text, pattern):
return [match.start() for match in re.finditer(f'(?=({pattern}))', text)]
print(my_fun(text, pattern))
# [501, 523, 865, 878, 1407, 1728, 1956]
print(my_fun_all(text, pattern))
# [501, 523, 865, 878, 1407, 1728, 1956]
print(my_fun(text * 2, pattern) == my_fun_all(text * 2, pattern))
# True
%timeit my_fun(text, pattern)
# 1.48 ms ± 6.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun_all(text, pattern)
# 512 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun(text * 2, pattern)
# 2.97 ms ± 27.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit my_fun_all(text * 2, pattern)
# 843 µs ± 8.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun(text * 20, pattern)
# 29.4 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit my_fun_all(text * 20, pattern)
# 3.36 ms ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
编辑
这是我在 OP 指定需要支持多次出现之前的解决方案:
def my_fun_new(text, pattern):
words = pattern.split('|')
result = []
for word in words:
i = text.find(word)
if i >= 0:
result.append(i)
return sorted(result)
编辑
为 Aho-Corasick 的 @Finomnis 实现添加时间:
import ahocorasick
def my_fun_ac(string, raw_pattern):
A = ahocorasick.Automaton()
for pattern in raw_pattern.split('|'):
A.add_word(pattern, len(pattern))
A.make_automaton()
result = []
for end_index, original_len in A.iter(string):
result.append(end_index - original_len + 1)
return result
print(my_fun(text * 2, pattern) == my_fun_ac(text * 2, pattern))
# True
%timeit my_fun_ac(text, pattern)
# 221 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun_ac(text * 2, pattern)
# 268 µs ± 972 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun_ac(text * 20, pattern)
# 1.09 ms ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这比我使用 Python 内置函数的实现快了 2 到 4 倍。
为了完整起见,我报告了我使用的内容text
和pattern
:
text = """\
Intro:
[Snoop]
Da da da da daaaa
It's tha motha fuckin' D O double G (Snoop dogg)
Da da da da daaaa
You know im mobbin' with the D.R.E
[Kurupt]
Yeah yeah yeah
You know who's back up in this mothafucker *echoes*
[Snoop]
What what what what
[Kurupt]
So blaze the weed out there
[Snoop]
Blaze it up
[Kurupt]
Blaze that shit up nigga
Yeah waz up snoop
Verse one: [snoop]
Top dogg buy them all nigga burn this shit up
D-P-G-C my nigga turn that shit up
CPT, LBC yeah we hookin' back up
N' when they bang this in the club baby you gotta get up
Thug niggas, drug dealers yeah they givin' it up
Low life, your life boy we livin' it up
Take a chance that's why we dancin'
In the party fo' sho'
Slip my ho a fourty four n' she crept in it back do'
Bitches lookin' at me strange but you know i don't care
Step up in this mothafucker just to swingin' my hair
Bitch quit talkin' Crip walk
If you down with the set
Take a Bullet with some dick
Take this dope from this jet
Outta town put it down for father of rap
N' if your ass get crack bitch shut your trap
Come back get back that's the part of success
N' if you believe the X then you'll relievin' your stress
Music in between
[snoop]
Da da da da daaaa
[Dre]
It's the mothafuckin' D-R-E
[Kurupt]
Dr. Dre mothafucker [Snoop] what what what what
[Snoop]
Da da da da daaaa
Verse two: [Dre]
You know im mobbing with the D O double G
Straight off the fuckin' street's of CPT
King of the beats n' you ride to em' in your fleet (Fleetwood) wood
Or Coupe DeVille rollin on dubs
How you feel?
Whoopty whoop nigga what?
Dre n' snoop chronic'ed out
In the 'llac with doc in the back
Sippin' 'gnac, clip in the strap
Dippin' through hoods
What hoods? Compton, longbeach, ingelwood
South central out to the westside (westside)
It's california love this california bud
Got a nigga gang of pub
I'm on one, I might bail up in the Century Club
With my jeans on n' my team's strong
Get my drink on n' my smoke on
Then go home with somethin' to poke on (waz up bitch)
Loc' it's on for the two tripple oh
Comin' real it's the next episode *Echoes*
Music in between
[Nate Dogg]
Hold up.. heeeeey
For my niggas who be thinkin' we soft
We don't.. plaaaay
We gonna rockin' til the weels fall of
Hold up.. heeeeey
For my niggas who be acting to bold
Take a.. seeeeat
Hope you ready for the next episode heeeeeey
(Music stops and pauses)
Smoke weed everyday
"""
pattern = "melodic|eyes|wander|rambunctious|whirl|provide|ruddy|quaint|sea|snatch|tangy|women|mammoth|peel|home|hope|sense|measure|lake|gleaming|vagabond|phobic|support|boring|discreet|overrated|spoil|load|lick|early|envious|alleged|heady|queen|seed|quiver|squalid|jelly|standing|wreck|slope|farflung|press|run|tender|shrill|duck|actor|foregoing|tickle|haunt|minor|seal|breakable|wren|trick|bitesized|stage|listen|depend|string|abounding|explode|cows|low|creature|compare|habitual|pipe|hand|occur|eye|drunk|furniture|insect|worthless|fang|found|connection|quarter|celery|stretch|rifle|glistening|baby|flesh|invite|motionless|watch|letter|current|succinct|panicky|snail|ear|prevent|nebulous|knife|jolly|dirt|scrawny|zephyr|giraffe|unique|jumbled|curly|fry|blushing|delirious|open|muddled|year|earsplitting|zipper|grandiose|answer|puncture|chase|gentle|middle|nine|overflow|rod|troubled|pop|handy|chalk|thunder|acceptable|bumpy|filthy|majestic|paste|snotty|quack|illated|tired|present|credit|substance|launch|equable|bashful|callous|appreciate|dead|soothe|amused|handsome|nappy|amazing|unbiased|bushes|yellow|way|cakes|cent|tested|letters|colour|apparel|rhythm|shoes|homeless|open|cub|alarm|soak|voracious|chin|flame|skillful|form|possessive|retire|camera|street|condition|gate|scintillating|follow|imported|approval|humdrum|tasty|murky|sprout|morning|picture|happen|observe|bang|slap|cough|five|tie|plan|punish|sneeze|perfect|shake|average|clear|caring|accurate|telephone|skate|daughter|wild|spurious|nutritious|sneaky|breezy|separate|sore|fade|blind|aware|cheat|heat|cowardly|unused|sheet|puny|pump|property|obscene|immense|soggy|shiny|spot|racial|lace|dapper|cheap|fluttering|husky|fire|hammer|aquatic|stick|lamentable|yell|chilly|zoom|animated|living|spell|separate|shade|delicious|deer|suck|squeamish|call|labored|shirt|numerous|push|sleet|price|earthy|ambiguous|porter|chickens|mailbox|omniscient|mourn|descriptive|kiss|polite|changeable|children|cheese|assorted|illustrious|action|serious|dislike|rhetorical|scandalous|nasty|steady|safe|walk|different|example|ring|talk|print|dinosaurs|switch|behave|murder|brown|cooperative|past|proud|disastrous|observant"
(编辑:固定输入偏移,更新代码和时间)
推荐阅读
- android - 打开 PDF 文件 Delphi RIO
- django - 如何将 simplejwt 令牌存储到数据库中
- javascript - 如何在angularJS中验证具有相同名称的多个表单
- vue.js - 道具对象的属性在刷新函数中未定义
- javascript - 为什么 PWA 应用程序创建 ShourtCut 而不是安装 apk(Android 版本 9)?
- java - kafka 流跳跃窗口聚合导致时间戳为零的多个窗口
- python - Python:分配变量并忽略'IndexError'
- c# - EF6 创建一个额外的空实体(表)
- jsp - 无法在 JSP 中显示印地语字体
- reactjs - 显示 xml 而不是站点