regex - 从R中数据框中的文本字符串中删除正则表达式
问题描述
我有一个包含 1000 行的数据集,其中的文本包含灯的订单描述。数据中充满了不一致的正则表达式模式,在参考了几个解决方案后,我得到了一些帮助,但并没有解决问题。 R删除数据框中的多个文本字符串 从文本向量r中删除多个模式
我想删除所有分隔符,并且只保留 wordstoreplace 向量中存在的单词。
我尝试使用 lapply 删除分隔符并发布我创建了 2 个向量 - “wordstoremove”和“wordstoreplace”
我正在尝试应用“str_remove_all()”和“str_replace_all()”。第一个功能有效,但第二个无效。
最初我尝试使用一种非常幼稚的方法,但它太笨拙了。
mydata_sample=data.frame(x=c("LAMP, FLUORESCENT;TYPE TUBE LIGHT, POWER 8 W, POTENTIAL 230 V, COLORWHITE, BASE G5, LENGTH 302.5 MM; P/N: 37755,Mnfr:SuryaREF: MODEL: FW/T5/33 GE 1/25,",
"LAMP, INCANDESCENT;TYPE HALOGEN, POWER 1 KW, POTENTIAL 230 V, COLORWHITE, BASE R7S; Make: Surya",
"BALLAST, LAMP; TYPE: ELECTROMAGNETIC, LAMP TYPE: TUBELIGHT/FLUORESCENT, POWER: 36/40 W, POTENTIAL: 240VAC 50HZ; LEGACY NO:22038 Make :Havells , Cat Ref No : LHB7904025",
"SWITCH,ELECTRICAL,TYPE:1 WCR WAY,VOLTAGE:230V,CURRENT RATED:10A,NUMBEROFPOLES:1P,ADDITIONAL INFORMATION:FOR SNAPMODULESWITCH",
"Brief Desc:HIGH PRES. SODIUM VAPOUR LAMP 250W/400WDetailed Desc:Purchase order text :Short Description :HIGH PRES. SODIUM VAPOURLAMP 250W/400W===============================Part No :SON-T 250W/400W===============================Additional Specification :HIGH PRESSURE SODIUM VAPOUR LAMPSON-T 250W/400W USED IN SURFACE INS SYSTEM TOP LIGHT"))
delimiters1=c('"',"\r\n",'-','=',';')
delimiters2=c('*',',',':')
library(dplyr)
library(stringr)
dat <- mydata_sample %>%
mutate(x1 = str_remove_all(x1, regex(str_c("\\b",delimiters1, "\\b", collapse = '|'), ignore_case = T)))
dat <- mydata_sample %>%
mutate(x1 = str_remove_all(x1, regex(str_c("\\b",delimiters2, "\\b", collapse = '|'), ignore_case = T)))
####Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)
wordstoremove=c('Mnfr','MNFR',"VAPOURTYPEHIGH",'LHZZ07133099MNFR',"BJHF","BJOS",
"BGEMF","BJIR","LIGHTING","FFT","FOR","ACCOMMODATIONQUANTITY","Cat",
"Ref","No","Type","TYPE","QUANTITY","P/N")
wordstoreplace=c('HAVELLS','Havells','Bajaj','BAJAJGrade A','PHILIPS',
'Philips',"MAKEBAJAJ/CG","philips","Philips/Grade A/Grade A/CG/GEPurchase","CG","Bajaj",
"BAJAJ")
dat1 <- dat%>%
mutate(x1 = str_remove_all(x1, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))
dat1=dat1 %>%
mutate(x1=str_replace_all(x1, wordstoreplace, 'Grade A'),ignore_case = T)
###Warning message:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
longer object length is not a multiple of shorter object length
解决方案
正则表达式失败,因为您需要转义所有特殊字符。在此处查看差异:
# orig delimiters1=c('"', "\r\n", '-', '=', ';')
delimiters1=c('\\"', "\r\n", '-', '\\=', ';')
# orig delimiters2=c('*', ',', ':')
delimiters2=c('\\*', ',', '\\:')
对于您,str_replace_all()
您需要单词是由 a 分隔的单个字符串,|
而不是 12 的向量
wordstoreplace <-
c('HAVELLS','Havells','Bajaj','BAJAJGrade A','PHILIPS',
'Philips',"MAKEBAJAJ/CG","philips","Philips/Grade A/Grade A/CG/GEPurchase","CG","Bajaj",
"BAJAJ") %>%
paste0(collapse = "|")
# "HAVELLS|Havells|Bajaj|BAJAJGrade A|PHILIPS|Philips|MAKEBAJAJ/CG|philips|Philips/Grade A/Grade A/CG/GEPurchase|CG|Bajaj|BAJAJ"
然后运行而不会引发错误
dat1 <-
dat %>%
mutate(
x1 =
str_remove_all(x1, regex(str_c("\\b", wordstoremove, "\\b", collapse = "|"), ignore_case = T)),
x1 = str_replace_all(x1, wordstoreplace, "Grade A")
)
推荐阅读
- javascript - Three.js:克隆网格和材质 » 切换克隆的不透明度
- sql - SQL 计算具有相同用户 ID 的列
- django - Django 在 M2M 上执行查询的有效方法
- python-3.x - 在 matplotlib 中内联绘图
- mysql - Groupwise max mysql 查询忽略 NULL 列
- css - 如何在 sass/scss 的嵌套选择器中间插入父选择器
- fonts - xterm 无法加载字体
- flutter - 移动应用程序的汇款API?
- python - 如何根据 Django 视图中查询集的常见元素按值分组?
- java - ActiveMQ 和自定义 JAAS 登录模块 - 授权?