首页 > 解决方案 > 如何从以下文本中提取信息?

问题描述

我正在尝试从不同网站的文本中提取标题、描述和地址。我目前正在做一些网络抓取,以提取上述信息。但是,我无法想出一个与下面我想要的预期文本输出相匹配的正则表达式。

我可以知道如何改进我的正则表达式并嵌入建议的规则集来满足和提取上述信息吗?

我的正则表达式:

(^.+\n)(^.+\n)?(^\d+.*\d{6})

要嵌入的规则集:

First line (title)
    - can contain any alphabets and numbers
    - should not contain dot(.)
Second line (description or additonal information)
    - can contain any alphabets and numbers
    - should contain dot(.)
    - second line can be empty
    - if its empty then extract the first line which is the title
Third line (address)
    - address extraction

输入文本:

View store information
TAMPINES MART
11559.33Km Away,
5 TAMPINES ST 32, #01-07/16 TAMPINESS MART, 529284
67817232
Open Now
Full Menu
View store information
THE SIGNATURE
The SIGNATURE is a wonderful destination for shopping text.
51, CHANGI BUSINESS PARK CENTRAL 2, #01-15, THE SIGNATURE, 486066
65883667
Open Now
Full Menu
Jewel Changi Airport
Jewel Changi Airport is a breath-taking place for families text.
78 Airport Boulevard, #B2-275-277 Jewel Changi Airport, Singapore, 819666

预期的文本输出:(理想情况下)

TAMPINES MART
11559.33Km Away,
5 TAMPINES ST 32, #01-07/16 TAMPINESS MART, 529284

THE SIGNATURE
11559.97Km Away,
51, CHANGI BUSINESS PARK CENTRAL 2, #01-15, THE SIGNATURE, 486066

Jewel Changi Airport
78 Airport Boulevard, #B2-275-277 Jewel Changi Airport, Singapore, 819666

标签: javascriptregex

解决方案


一种选择是使用\w并重复第一个捕获组来匹配单词,以获得最后一次迭代的值作为标题。

^(\w+(?: \w+)*\r?\n)*(?:(?![^.\r\n]*\.|.*\d{6}).*\r?\n)*(?:([^\r\n.]*\..*(?:\r?\n(?!.* \d{6}).*)*)\r?\n)?(.* \d{6}(?:\r?\n(?![A-Z]).*)*)$

正则表达式演示

const regex = /^(\w+(?: \w+)*\r?\n)*(?:(?![^.\r\n]*\.|.*\d{6}).*\r?\n)*(?:([^\r\n.]*\..*(?:\r?\n(?!.* \d{6}).*)*)\r?\n)?(.* \d{6}(?:\r?\n(?![A-Z]).*)*)$/mg;
const str = `View store information
TAMPINES MART
11559.33Km Away,
5 TAMPINES ST 32, #01-07/16 TAMPINESS MART, 529284
67817232
Open Now
Full Menu
View store information
THE SIGNATURE
The SIGNATURE is a wonderful destination for shopping text.
51, CHANGI BUSINESS PARK CENTRAL 2, #01-15, THE SIGNATURE, 486066
65883667
Open Now
Full Menu
Jewel Changi Airport
Jewel Changi Airport is a breath-taking place for families text.
78 Airport Boulevard, #B2-275-277 Jewel Changi Airport, Singapore, 819666`;
let m;

while ((m = regex.exec(str)) !== null) {
  // This is necessary to avoid infinite loops with zero-width matches
  if (m.index === regex.lastIndex) {
    regex.lastIndex++;
  }

  console.log("Title: " + m[1]);
  if (undefined !== m[2]) {
    console.log("Description: " + m[2]);
  }
  console.log("Address: " + m[3]);
  console.log("\n")
}


推荐阅读