首页 > 解决方案 > Python字符串正则表达式联合返回一堆空字符串

问题描述

我正在尝试将串联的字符串列表作为正则表达式传递给re.findall

re.findall(regex, string)

但是结果我在一对列表中得到了一堆空字符串。

re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
# [('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]

位置是这样的列表:

['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', ...]

像这样的手动测试工作:

print(re.findall('miami|zika', 'Zika Outbreak Hits Miami'.lower()))
# ['zika', 'miami']

但我不知道连接位置以创建一个大的正则表达式有什么问题。也许是这样?locations拥有 24588 个元素。

我目前正在根据 geonamescache 提供的城市和国家/地区创建位置列表:

import geonamescache

gc = geonamescache.GeonamesCache()
countries = [country["name"].lower() for country in list(gc.get_countries().values())]
cities    = [city["name"].lower() for city in list(gc.get_cities().values())]
locations =  countries + cities

我正在使用的文本如下所示:

Zika Outbreak Hits Miami
Could Zika Reach New York City?
First Case of Zika in Miami Beach
Mystery Virus Spreads in Recife, Brazil
Dallas man comes down with case of Zika

标签: pythonregexgeonames

解决方案


查看您的位置列表并在列表中查找空字符串或异常位置名称。

例如:这很好用

In [1]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba']

In [2]: import re

In [3]: re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
Out[3]: []

In [4]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
Out[4]: ['switzerland']

这不是因为我的列表中有一个空位置

In [5]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', '']

In [6]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
Out[6]:
['switzerland',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

编辑

正如预期的那样,位置中的特殊字符导致了代码中的问题。您可以使用以下代码来创建正则表达式本身,它主要是干扰正则表达式的地方:

In [21]: [l for l in locations if l.find('(') >= 0]
Out[21]:
['zürich (kreis 11) / seebach',
 'zürich (kreis 11) / oerlikon',
 'zürich (kreis 10) / höngg',
 'zürich (kreis 4) / aussersihl',
 'zürich (kreis 10) / wipkingen',
 'zürich (kreis 11) / affoltern',
 'zürich (kreis 2) / wollishofen',
 'zürich (kreis 3) / sihlfeld',
 'zürich (kreis 6) / unterstrass',
 'zürich (kreis 9) / albisrieden',
 'zürich (kreis 9) / altstetten',
 'stadt winterthur (kreis 1)',
 'zürich (kreis 12)',
 'seen (kreis 3)',
 'zürich (kreis 3)',
 'zürich (kreis 11)',
 'zürich (kreis 9)',
 'oberwinterthur (kreis 2)',
 'zürich (kreis 10)',
 'zürich (kreis 2)',
 'zürich (kreis 8)',
 'zürich (kreis 7)',
 'zürich (kreis 6)',
 'wetter (ruhr)',
 'schwedt (oder)',
 'kempten (allgäu)',
 'kelkheim (taunus)',
 'halle (saale)',
 'frankfurt (oder)',
 'brake (unterweser)',
 'v.s.k.valasai (dindigul-dist.)',
 'dainava (kaunas)',
 'miguel alemán (la doce)',
 'jardines de la silla (jardines)',
 'licenciado benito juárez (campo gobierno)',
 'ampliación san mateo (colonia solidaridad)',
 'kalibo (poblacion)',
 'city of milford (balance)',
 'butte-silver bow (balance)']

使用 re.escape 创建正则表达式来处理特殊字符。您可能还想进行完整的单词匹配,否则像breafrom这样的部分单词break将匹配

In [21]: locations_regex = re.compile(r'|'.join([re.escape(l) for l in sorted(locations, key=lambda x:-len(x))]))

推荐阅读