python - 正则表达式从字符串中提取地址街道
问题描述
给定示例文本,我想提取地址街道(星号之间的文本)。使用下面的正则表达式,我可以提取大多数句子的地址街道,但主要是 text4 和 text5 失败。
regex = r"(^[0-9]+[\s\-0-9,A-Za-z]+)"
text1 = *9635 E COUNTY ROAD, 1000 N*.
text2 = *8032 LIBERTY RD S*.
text3 = *2915 PENNSYLVANIA AVENUE* 40 Other income (loss) 15 Alternative minimum tax (AMT) ilems
A 2,321
text4 = *2241 Western Ave*. 10 Other income loss 15 — Altemative minimum tax AMT itams
text5 = *450 7TH STREET, APT 2-M*
text6 = *9635 East County Road 1000 North*
My code---
for k,v in val.items():
if k == "Shareholder Address Street":
text = " ".join(v)
pattern1 = r"(^[0-9]+[\s\-0-9,A-Za-z]+)"
addressRegex = re.compile(pattern1)
match = addressRegex.search(text)
if match is not None:
delta = []
delta.append("".join(match.group(0)))
val[k] = delta
任何人都可以建议更改上述正则表达式,因为它适用于大多数文件吗?
解决方案
利用
^\d+(?:[ \t][\w,-]+)*
见证明
解释
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[ \t] any character of: ' ', '\t' (tab)
--------------------------------------------------------------------------------
[\w,-]+ any character of: word characters (a-z,
A-Z, 0-9, _), ',', '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)* end of grouping
推荐阅读
- microstrategy - 在 MicroStrategy 中部署到另一个项目时更改数据库实例
- python - 流式音频 [来自 Python]
- apache-flink - flink 如何支持本地模式?
- mysql - 如何在同一个表的同一个查询中使用 SUM 和 COUNT 创建 SQL
- css - Bootstrap 4 无法在两列之间添加装订线
- c++ - asio::async_write 性能限制
- google-sheets-api - Google 表格:从保护中排除当前日期行
- sql - Conditional CHECK constraint
- reactjs - useContext in React Native
- java - 列出所有可以由用户输入字符组成的单词(Java 数组)