javascript - 如何从以下文本中提取信息?
问题描述
我正在尝试从不同网站的文本中提取标题、描述和地址。我目前正在做一些网络抓取,以提取上述信息。但是,我无法想出一个与下面我想要的预期文本输出相匹配的正则表达式。
我可以知道如何改进我的正则表达式并嵌入建议的规则集来满足和提取上述信息吗?
我的正则表达式:
(^.+\n)(^.+\n)?(^\d+.*\d{6})
要嵌入的规则集:
First line (title)
- can contain any alphabets and numbers
- should not contain dot(.)
Second line (description or additonal information)
- can contain any alphabets and numbers
- should contain dot(.)
- second line can be empty
- if its empty then extract the first line which is the title
Third line (address)
- address extraction
输入文本:
View store information
TAMPINES MART
11559.33Km Away,
5 TAMPINES ST 32, #01-07/16 TAMPINESS MART, 529284
67817232
Open Now
Full Menu
View store information
THE SIGNATURE
The SIGNATURE is a wonderful destination for shopping text.
51, CHANGI BUSINESS PARK CENTRAL 2, #01-15, THE SIGNATURE, 486066
65883667
Open Now
Full Menu
Jewel Changi Airport
Jewel Changi Airport is a breath-taking place for families text.
78 Airport Boulevard, #B2-275-277 Jewel Changi Airport, Singapore, 819666
预期的文本输出:(理想情况下)
TAMPINES MART
11559.33Km Away,
5 TAMPINES ST 32, #01-07/16 TAMPINESS MART, 529284
THE SIGNATURE
11559.97Km Away,
51, CHANGI BUSINESS PARK CENTRAL 2, #01-15, THE SIGNATURE, 486066
Jewel Changi Airport
78 Airport Boulevard, #B2-275-277 Jewel Changi Airport, Singapore, 819666
解决方案
一种选择是使用\w
并重复第一个捕获组来匹配单词,以获得最后一次迭代的值作为标题。
^(\w+(?: \w+)*\r?\n)*(?:(?![^.\r\n]*\.|.*\d{6}).*\r?\n)*(?:([^\r\n.]*\..*(?:\r?\n(?!.* \d{6}).*)*)\r?\n)?(.* \d{6}(?:\r?\n(?![A-Z]).*)*)$
const regex = /^(\w+(?: \w+)*\r?\n)*(?:(?![^.\r\n]*\.|.*\d{6}).*\r?\n)*(?:([^\r\n.]*\..*(?:\r?\n(?!.* \d{6}).*)*)\r?\n)?(.* \d{6}(?:\r?\n(?![A-Z]).*)*)$/mg;
const str = `View store information
TAMPINES MART
11559.33Km Away,
5 TAMPINES ST 32, #01-07/16 TAMPINESS MART, 529284
67817232
Open Now
Full Menu
View store information
THE SIGNATURE
The SIGNATURE is a wonderful destination for shopping text.
51, CHANGI BUSINESS PARK CENTRAL 2, #01-15, THE SIGNATURE, 486066
65883667
Open Now
Full Menu
Jewel Changi Airport
Jewel Changi Airport is a breath-taking place for families text.
78 Airport Boulevard, #B2-275-277 Jewel Changi Airport, Singapore, 819666`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
console.log("Title: " + m[1]);
if (undefined !== m[2]) {
console.log("Description: " + m[2]);
}
console.log("Address: " + m[3]);
console.log("\n")
}
推荐阅读
- reactjs - 如何在蚂蚁设计中使用表格元素中的步骤?
- python - 如何输入 IP 地址并打印系列中的下一个数字
- triggers - GTM - 跟踪除某些类之外的所有下载
- windows - 当我的脚本运行的节点是 jenkins 管道脚本时,如何调用 REST API?
- matlab - 如何处理涉及三个向量的索引?
- c++ - C ++中的枚举与原始值?
- android - 如何修复“NavHostFragment 的后退堆栈上的无效后退堆栈条目”?
- python - OpenCV解码JPEG返回嘈杂的图像
- javascript - 使用 vb-modal 指令和 @click 的 Bootstrap-vue 问题
- kubernetes - 当我运行 sudo minikube start --vm-driver=none 它给我错误