首页 > 解决方案 > 如何在 R 中的 data.table 列中对最长的公共子字符串进行矢量化

问题描述

如何创建一个函数,让我可以快速计算最长公共子字符串中的字符数,或者返回 R 中大型 data.table 中两列或更多列之间的最长公共子字符串?

我修改了这个问题的答案:查找字符串中重叠的长度,但有 1.) 跨向量应用的问题,因为当应用于使用 sapply 创建新的结果列时,由于空白和其他字符串特征失败,2.) 问题适用于超过2列,3.)给定的答案不包括潜在匹配中的空格,我想。该功能也很慢,我想应用于大数据。

创建示例数据:

sampdata <- data.frame(
  str1=c("Doug Olivas", "GRANT MANAGEMENT LLC", "LUNA VAN DERESH", "wendy t marzardo", "AMIN NYGUEN COMPANY LLC", "GERARDO CONTRARAS", "miguel martinez","albert marks porter"),
  str2=c("doug olivas", "miguel grant", "LUNA VAN DERESH MANAGEMENT LLC", "marzardo", "amin nyguen llc", "gerardo contraras", "miggy martinez","albert"),
  str3=c("Martin Olivas", "GRANT PROPERTIES", "luna company", "wendy marzardo", "the company of amin nyguen llc", "gerardo c", "miguel t martinez","")
  )

组成功能“lcsfoo”所需功能1:

#option type="nchar" to return number of characters INCLUDING SPACES, IGNORING CASE in max common substring
sampdata$desired_LCSnchar <- lcsfoo(sampdata$str1,sampdata$str2,sampdata$str3,type="nchar")

#option type="str" to return the string INCLUDING SPACES, IGNORING CASE of the longest common substring between the columns
sampdata$desired_LCSstr <- lcsfoo(sampdata$str1,sampdata$str2,sampdata$str3,type="str")

#DESIRED RESULTS 1:以上将返回样本数据的以下内容

sampdata$desired_LCSnchar <- c(7,5,5,8,12,9,9,0)
sampdata$desired_LCSstr<- c(" olivas","grant","luna ","marzardo","amin nyguen ","gerardo c"," martinez","")

**理想情况下 lcsfoo 也将采用可变数量的列输入(即此处为 2 列而不是上面的 3 列):

sampdata$str1str2_LCSnchar <- lcsfoo(sampdata$str1,sampdata$str2,type="nchar")
sampdata$str1str2_LCSstr <- lcsfoo(sampdata$str1,sampdata$str2,type="str")

#DESIRED RESULTS 2:以上将返回样本数据的以下内容

sampdata$str1str2_LCSstr<- c("doug olivas","grant","luna van deresh","marzardo","amin nyguen ","gerardo contraras"," martinez","albert")
sampdata$str1str2_LCSnchar <- c(11,5,15,8,12,17,9,6)

我还需要该功能可以跨大数据工作:

library(data.table)
###Create sample big data from previous sampledata and apply on huge DT
samplist <- lapply(c(1:1000),FUN=function(x){sampdata})
bigsampdata <- rbindlist(samplist)

DESIRED FUNCTION APPLIED ON BIG DATA: 
bigsampdata$desired_LCSnchar <- lcsfoo(bigsampdata$str1,bigsampdata$str2,bigsampdata$str3,type="nchar")
bigsampdata$desired_LCSstr <- lcsfoo(bigsampdata$str1,bigsampdata$str2,bigsampdata$str3,type="str")

标签: rstringdata.tablesubstringlcs

解决方案


推荐阅读