r - 基于R中自定义列表的实体提取
问题描述
我有文本列表,也有实体列表。
文本列表通常采用矢量化字符串。
实体列表有点复杂。一些实体,可以详尽列出,例如世界主要城市列表。一些实体虽然不可能详尽列出,但可以通过正则表达式模式捕获。
list_of_text <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 133.001.00.00 ...', 'Lorem ipsum 12-01-2021, Copenhagen www.stackoverflow.com swimming', ...)
entity_city <- c('Copenhagen', 'Paris', 'New York', ...)
entity_IP_address <- c('regex code for IP address')
entity_IP_address <- c('regex code for URL')
entity_verb <- c('verbs')
给定list_of_text
和 的列表entities
,我想为每个文本找到匹配的实体。
例如c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')
,它有c(eat, drink, sleep)
for entity_verb
、c(133.001.00.00)
forentity_IP
等。
res <- extract_entity(text = c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')
,entities <- c(entity_verb, entity_IP_address, entity_city))
res[['verb']]
c('eat', 'drink', 'sleep')
res[['IP']]
c('133.001.00.00')
res[['city']]
c('Copenhagen')
有R package
我可以利用的吗?
解决方案
请查看地图和 qdapDictionaries。对于世界城市,I 子集是人口超过 100 万的城市。否则,会出现“正则表达式太大”的错误。
library(maps)
library(qdapDictionaries)
list_of_text <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 192.41.196.888','192.41.199.888','Lorem ipsum 12-01-2021, Copenhagen www.stackoverflow.com swimming')
#regex needs adjusted. Not extracting the first IP Address
ipRegex <- "(?(?=.*?(\\d+\\.\\d+\\.\\d+\\.\\d+).*?)(\\1|))"
regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE))[
regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE)) != '']
verbRegex <- substr(paste0((unlist(action.verbs)),'|',collapse = ""),
start = 1,nchar(paste0((unlist(action.verbs)),'|',collapse = ""))-1)
unlist(regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE))[
regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE)) != ''])
citiesRegex <- substr(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""),
start = 1,nchar(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""))-1)
unlist(regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE))[
regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE)) != ''])
推荐阅读
- ruby-on-rails - 在 ActionMailer.deliver_later 中找不到 OrderInstruction,但该对象存在于数据库中
- reactjs - React 如何以正确的方式分离逻辑和 UI?
- python - 如何从另一个 .conf 文件中的文件(.TXT 或 .conf)读取值
- php - 如何将html输入的值与sql数据库的值进行比较
- python - Python Tkinter:使用循环创建多个按钮
- azure-data-studio - azure datas studio:e.getTreeItem 不是函数
- scala - 遍历状态列表直到所有批处理
- node.js - 用传单在地图上显示点
- python - NumPy - 将向量乘以标量添加到矩阵
- python - Django 与 MongoDB 不使用 ORM