首页 > 解决方案 > 基于R中自定义列表的实体提取

问题描述

我有文本列表,也有实体列表。

文本列表通常采用矢量化字符串。

实体列表有点复杂。一些实体,可以详尽列出,例如世界主要城市列表。一些实体虽然不可能详尽列出,但可以通过正则表达式模式捕获。


list_of_text <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 133.001.00.00 ...', 'Lorem ipsum 12-01-2021, Copenhagen www.stackoverflow.com swimming', ...)

entity_city <- c('Copenhagen', 'Paris', 'New York', ...)

entity_IP_address <- c('regex code for IP address')

entity_IP_address <- c('regex code for URL')

entity_verb <- c('verbs')

给定list_of_text和 的列表entities,我想为每个文本找到匹配的实体。

例如c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...'),它有c(eat, drink, sleep)for entity_verbc(133.001.00.00)forentity_IP等。


res <- extract_entity(text = c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')
                      ,entities <- c(entity_verb, entity_IP_address, entity_city))

res[['verb']]
c('eat', 'drink', 'sleep')

res[['IP']]
c('133.001.00.00')

res[['city']]
c('Copenhagen')

R package我可以利用的吗?

标签: rnlptext-miningr-packagenamed-entity-recognition

解决方案


请查看地图和 qdapDictionaries。对于世界城市,I 子集是人口超过 100 万的城市。否则,会出现“正则表达式太大”的错误。

library(maps)
library(qdapDictionaries)

list_of_text  <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 192.41.196.888','192.41.199.888','Lorem ipsum 12-01-2021, Copenhagen www.stackoverflow.com swimming')
#regex needs adjusted. Not extracting the first IP Address
ipRegex   <- "(?(?=.*?(\\d+\\.\\d+\\.\\d+\\.\\d+).*?)(\\1|))"

regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE))[
  regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE)) != '']

verbRegex <- substr(paste0((unlist(action.verbs)),'|',collapse = ""),
                     start = 1,nchar(paste0((unlist(action.verbs)),'|',collapse = ""))-1)

unlist(regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE))[
  regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE)) != ''])

citiesRegex <- substr(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""),
                    start = 1,nchar(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""))-1)

unlist(regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE))[
  regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE)) != ''])

推荐阅读