python - Python字符串正则表达式联合返回一堆空字符串
问题描述
我正在尝试将串联的字符串列表作为正则表达式传递给re.findall
:
re.findall(regex, string)
但是结果我在一对列表中得到了一堆空字符串。
re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
# [('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]
位置是这样的列表:
['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', ...]
像这样的手动测试工作:
print(re.findall('miami|zika', 'Zika Outbreak Hits Miami'.lower()))
# ['zika', 'miami']
但我不知道连接位置以创建一个大的正则表达式有什么问题。也许是这样?locations
拥有 24588 个元素。
我目前正在根据 geonamescache 提供的城市和国家/地区创建位置列表:
import geonamescache
gc = geonamescache.GeonamesCache()
countries = [country["name"].lower() for country in list(gc.get_countries().values())]
cities = [city["name"].lower() for city in list(gc.get_cities().values())]
locations = countries + cities
我正在使用的文本如下所示:
Zika Outbreak Hits Miami
Could Zika Reach New York City?
First Case of Zika in Miami Beach
Mystery Virus Spreads in Recife, Brazil
Dallas man comes down with case of Zika
解决方案
查看您的位置列表并在列表中查找空字符串或异常位置名称。
例如:这很好用
In [1]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba']
In [2]: import re
In [3]: re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
Out[3]: []
In [4]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
Out[4]: ['switzerland']
这不是因为我的列表中有一个空位置
In [5]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', '']
In [6]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
Out[6]:
['switzerland',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'']
编辑
正如预期的那样,位置中的特殊字符导致了代码中的问题。您可以使用以下代码来创建正则表达式本身,它主要是干扰正则表达式的地方:
In [21]: [l for l in locations if l.find('(') >= 0]
Out[21]:
['zürich (kreis 11) / seebach',
'zürich (kreis 11) / oerlikon',
'zürich (kreis 10) / höngg',
'zürich (kreis 4) / aussersihl',
'zürich (kreis 10) / wipkingen',
'zürich (kreis 11) / affoltern',
'zürich (kreis 2) / wollishofen',
'zürich (kreis 3) / sihlfeld',
'zürich (kreis 6) / unterstrass',
'zürich (kreis 9) / albisrieden',
'zürich (kreis 9) / altstetten',
'stadt winterthur (kreis 1)',
'zürich (kreis 12)',
'seen (kreis 3)',
'zürich (kreis 3)',
'zürich (kreis 11)',
'zürich (kreis 9)',
'oberwinterthur (kreis 2)',
'zürich (kreis 10)',
'zürich (kreis 2)',
'zürich (kreis 8)',
'zürich (kreis 7)',
'zürich (kreis 6)',
'wetter (ruhr)',
'schwedt (oder)',
'kempten (allgäu)',
'kelkheim (taunus)',
'halle (saale)',
'frankfurt (oder)',
'brake (unterweser)',
'v.s.k.valasai (dindigul-dist.)',
'dainava (kaunas)',
'miguel alemán (la doce)',
'jardines de la silla (jardines)',
'licenciado benito juárez (campo gobierno)',
'ampliación san mateo (colonia solidaridad)',
'kalibo (poblacion)',
'city of milford (balance)',
'butte-silver bow (balance)']
使用 re.escape 创建正则表达式来处理特殊字符。您可能还想进行完整的单词匹配,否则像brea
from这样的部分单词break
将匹配
In [21]: locations_regex = re.compile(r'|'.join([re.escape(l) for l in sorted(locations, key=lambda x:-len(x))]))
推荐阅读
- oracle - 函数索引在与其他运算符一起使用的 oracle 中不起作用
- jekyll - 如何在 Jekyll 中创建一个动态菜单,在创建新页面时自动填充导航项
- java - kotlin 继承在测试类中不起作用
- c# - 解释一下这个语句 += () => 的用途以及它是如何工作的
- bash - 如何在 PhpStorm 集成终端中逐字移动光标?
- visual-studio-code - 带有 WSL 的 VSCode 扩展调试节点
- yadcf - YADCF range_number - 是否可以向/从范围添加预设选择列表?
- json - jq:将数据合并到树中
- sql - 来自多个列的“Hive”最大列值
- nlp - 为教育目的构建聊天机器人