python - Python:正则表达式——提取中文文本
问题描述
我正在尝试从以下文本中提取省份和城市名称(这是 html,但我删除了一些转义字符)。但是,我编写的正则表达式返回一个空白列表。
当我在一个 re 网站(例如https://regex101.com/)上测试代码时,它似乎可以工作,但是当我在脚本中编写它时它不起作用。
这是我的代码的缩短版本(html 转储要长得多)。
任何帮助,将不胜感激。
import re
text = 'try window.getAreaStat = [provinceName:湖北省,provinceShortName:湖北,confirmedCount:3554,suspectedCount:0,curedCount:80,deadCount:125,comment:待明确地区:治愈 30,cities:[cityName:武汉,confirmedCount:1905,suspectedCount:0,curedCount:47,deadCount:104,cityName:黄冈,confirmedCount:324,suspectedCount:0,curedCount:2,deadCount:5,cityName:孝感,confirmedCount:274,suspectedCount:0,curedCount:0,deadCount:3,cityName:荆门,confirmedCount:142,suspectedCount:0,curedCount:0,deadCount:4,cityName:襄阳,confirmedCount:131,suspectedCount:0,curedCount:0,deadCount:0,cityName:随州,confirmedCount:116,suspectedCount:0,curedCount:0,deadCount:0,cityName:咸宁,confirmedCount:112,suspectedCount:0,curedCount:0,deadCount:0,cityName:荆州,confirmedCount:101,suspectedCount:0,curedCount:1,deadCount:2,cityName:十堰,confirmedCount:88,suspectedCount:0,curedCount:0,deadCount:0,cityName:黄石,confirmedCount:86,suspectedCount:0,curedCount:0,deadCount:1,cityName:鄂州,confirmedCount:84,suspectedCount:0,curedCount:0,deadCount:1,cityName:宜昌,confirmedCount:63,suspectedCount:0,curedCount:0,deadCount:1,cityName:恩施州,confirmedCount:51,suspectedCount:0,curedCount:0,deadCount:0,cityName:天门,confirmedCount:34,suspectedCount:0,curedCount:0,deadCount:3,cityName:仙桃,confirmedCount:32,suspectedCount:0,curedCount:0,deadCount:0,cityName:潜江,confirmedCount:8,suspectedCount:0,curedCount:0,deadCount:1,cityName:神农架林区,confirmedCount:3,suspectedCount:0,curedCount:0,deadCount:0],provinceName:浙江省,provinceShortName:浙江,confirmedCount:296,suspectedCount:0,curedCount:3,deadCount:0,comment:,cities:[cityName:温州,confirmedCount:114,suspectedCount:0,curedCount:3,deadCount:0,cityName:杭州,confirmedCount:51,suspectedCount:0,curedCount:0,deadCount:0,cityName:台州,confirmedCount:40,suspectedCount:0,curedCount:0,deadCount:0,cityName:宁波,confirmedCount:20,suspectedCount:0,curedCount:0,deadCount:0,cityName:绍兴,confirmedCount:19,suspectedCount:0,curedCount:0,deadCount:0,cityName:嘉兴,confirmedCount:14,suspectedCount:0,curedCount:0,deadCount:0,cityName:金华,confirmedCount:13,suspectedCount:0,curedCount:0,deadCount:0,cityName:衢州,confirmedCount:8,suspectedCount:0,curedCount:0,deadCount:0,cityName:舟山,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:丽水,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:湖州,confirmedCount:5,suspectedCount:0,curedCount:0,deadCount:0],provinceName:广东省,provinceShortName:广东,confirmedCount:241,suspectedCount:0,curedCount:5,deadCount:0,comment:,cities:[cityName:广州,confirmedCount:63,suspectedCount:0,curedCount:0,deadCount:0,cityName:深圳,confirmedCount:63,suspectedCount:0,curedCount:4,deadCount:0,cityName:佛山,confirmedCount:18,suspectedCount:0,curedCount:0,deadCount:0,cityName:珠海,confirmedCount:14,suspectedCount:0,curedCount:0,deadCount:0,cityName:惠州,confirmedCount:12,suspectedCount:0,curedCount:1,deadCount:0,cityName:中山,confirmedCount:12,suspectedCount:0,curedCount:0,deadCount:0,cityName:阳江,confirmedCount:10,suspectedCount:0,curedCount:0,deadCount:0,cityName:湛江,confirmedCount:7,suspectedCount:0,curedCount:0,deadCount:0,cityName:东莞,confirmedCount:7,suspectedCount:0,curedCount:0,deadCount:0,cityName:清远,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:汕头,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:揭阳,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:肇庆,confirmedCount:5,suspectedCount:0,curedCount:0,deadCount:0,cityName:韶关,confirmedCount:4,suspectedCount:0,curedCount:0,deadCount:0,cityName:梅州,confirmedCount:4,suspectedCount:0,curedCount:0,deadCount:0,cityName:茂名,confirmedCount:2,suspectedCount:0,curedCount:0,deadCount:0,cityName:汕尾,confirmedCount:1,suspectedCount:0,curedCount:0,deadCount:0,cityName:河源'
regex = "((?<=provinceName:)|(?<=cityName:)).*?(?=,)"
province = re.findall(regex, text)
print(province)
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
解决方案
从此答案中,re.findall
将返回所有捕获的组。我在https://regexr101.com中尝试了您的正则表达式,它都返回空白捕获组。
您可以通过添加使用非捕获组(?:...)
regex = "(?:(?<=provinceName:)|(?<=cityName:)).*?(?=,)"
推荐阅读
- java - 我对使用非对称密钥加密的数字签名感到困惑
- android - 将 com.google.gms:google-services 版本从 4.1.0 更新到 4.2.0 时出错
- mysql - meta_value 字段的内部联接
- javascript - Bootstrap Datetime Picker 自动输出返回日期
- java - 如何返回一个数字,该数字表示与传递给该方法的字母匹配的图块数?
- c# - 将整数从一种形式传递到另一种形式来做一些工作
- css - 使用 webpack 引导
- c# - 第一个登录页面不是从 SQL 查询中填充的
- java - Java - 使用 DirectoryStream 计算文件夹中的所有文件扩展名
- c# - ASP.NET Core 标识:同一帐户的多个提供程序