首页 > 解决方案 > Python:正则表达式——提取中文文本

问题描述

我正在尝试从以下文本中提取省份和城市名称(这是 html,但我删除了一些转义字符)。但是,我编写的正则表达式返回一个空白列表。

当我在一个 re 网站(例如https://regex101.com/)上测试代码时,它似乎可以工作,但是当我在脚本中编写它时它不起作用。

这是我的代码的缩短版本(html 转储要长得多)。

任何帮助,将不胜感激。

import re
text = 'try  window.getAreaStat = [provinceName:湖北省,provinceShortName:湖北,confirmedCount:3554,suspectedCount:0,curedCount:80,deadCount:125,comment:待明确地区:治愈 30,cities:[cityName:武汉,confirmedCount:1905,suspectedCount:0,curedCount:47,deadCount:104,cityName:黄冈,confirmedCount:324,suspectedCount:0,curedCount:2,deadCount:5,cityName:孝感,confirmedCount:274,suspectedCount:0,curedCount:0,deadCount:3,cityName:荆门,confirmedCount:142,suspectedCount:0,curedCount:0,deadCount:4,cityName:襄阳,confirmedCount:131,suspectedCount:0,curedCount:0,deadCount:0,cityName:随州,confirmedCount:116,suspectedCount:0,curedCount:0,deadCount:0,cityName:咸宁,confirmedCount:112,suspectedCount:0,curedCount:0,deadCount:0,cityName:荆州,confirmedCount:101,suspectedCount:0,curedCount:1,deadCount:2,cityName:十堰,confirmedCount:88,suspectedCount:0,curedCount:0,deadCount:0,cityName:黄石,confirmedCount:86,suspectedCount:0,curedCount:0,deadCount:1,cityName:鄂州,confirmedCount:84,suspectedCount:0,curedCount:0,deadCount:1,cityName:宜昌,confirmedCount:63,suspectedCount:0,curedCount:0,deadCount:1,cityName:恩施州,confirmedCount:51,suspectedCount:0,curedCount:0,deadCount:0,cityName:天门,confirmedCount:34,suspectedCount:0,curedCount:0,deadCount:3,cityName:仙桃,confirmedCount:32,suspectedCount:0,curedCount:0,deadCount:0,cityName:潜江,confirmedCount:8,suspectedCount:0,curedCount:0,deadCount:1,cityName:神农架林区,confirmedCount:3,suspectedCount:0,curedCount:0,deadCount:0],provinceName:浙江省,provinceShortName:浙江,confirmedCount:296,suspectedCount:0,curedCount:3,deadCount:0,comment:,cities:[cityName:温州,confirmedCount:114,suspectedCount:0,curedCount:3,deadCount:0,cityName:杭州,confirmedCount:51,suspectedCount:0,curedCount:0,deadCount:0,cityName:台州,confirmedCount:40,suspectedCount:0,curedCount:0,deadCount:0,cityName:宁波,confirmedCount:20,suspectedCount:0,curedCount:0,deadCount:0,cityName:绍兴,confirmedCount:19,suspectedCount:0,curedCount:0,deadCount:0,cityName:嘉兴,confirmedCount:14,suspectedCount:0,curedCount:0,deadCount:0,cityName:金华,confirmedCount:13,suspectedCount:0,curedCount:0,deadCount:0,cityName:衢州,confirmedCount:8,suspectedCount:0,curedCount:0,deadCount:0,cityName:舟山,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:丽水,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:湖州,confirmedCount:5,suspectedCount:0,curedCount:0,deadCount:0],provinceName:广东省,provinceShortName:广东,confirmedCount:241,suspectedCount:0,curedCount:5,deadCount:0,comment:,cities:[cityName:广州,confirmedCount:63,suspectedCount:0,curedCount:0,deadCount:0,cityName:深圳,confirmedCount:63,suspectedCount:0,curedCount:4,deadCount:0,cityName:佛山,confirmedCount:18,suspectedCount:0,curedCount:0,deadCount:0,cityName:珠海,confirmedCount:14,suspectedCount:0,curedCount:0,deadCount:0,cityName:惠州,confirmedCount:12,suspectedCount:0,curedCount:1,deadCount:0,cityName:中山,confirmedCount:12,suspectedCount:0,curedCount:0,deadCount:0,cityName:阳江,confirmedCount:10,suspectedCount:0,curedCount:0,deadCount:0,cityName:湛江,confirmedCount:7,suspectedCount:0,curedCount:0,deadCount:0,cityName:东莞,confirmedCount:7,suspectedCount:0,curedCount:0,deadCount:0,cityName:清远,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:汕头,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:揭阳,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:肇庆,confirmedCount:5,suspectedCount:0,curedCount:0,deadCount:0,cityName:韶关,confirmedCount:4,suspectedCount:0,curedCount:0,deadCount:0,cityName:梅州,confirmedCount:4,suspectedCount:0,curedCount:0,deadCount:0,cityName:茂名,confirmedCount:2,suspectedCount:0,curedCount:0,deadCount:0,cityName:汕尾,confirmedCount:1,suspectedCount:0,curedCount:0,deadCount:0,cityName:河源'

regex = "((?<=provinceName:)|(?<=cityName:)).*?(?=,)"
province = re.findall(regex, text)

print(province)
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

标签: pythonregexcjk

解决方案


从此答案中,re.findall将返回所有捕获的组。我在https://regexr101.com中尝试了您的正则表达式,它都返回空白捕获组。

您可以通过添加使用非捕获组(?:...)

regex = "(?:(?<=provinceName:)|(?<=cityName:)).*?(?=,)"

在 Repl.it 上预览


推荐阅读