r - 从 R 中的另一个数据帧进行字符串匹配和替换的快速方法
问题描述
我有两个看起来像这样的数据帧(尽管第一个数据帧的长度超过 9000 万行,第二个数据帧的行数略高于 1400 万行)第二个数据帧也是随机排序的
df1 <- data.frame(
datalist = c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
"strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
"the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
"anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
"authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
"the first known use of this word was in 1539"),
words = c("anarchist_schools_of_thought individualism collectivism", "social_anarchism individualist_anarchism",
"anarchy -ism", "privative privative_alpha", "infinitive", ""),
stringsAsFactors=FALSE)
df2 <- data.frame(
vocabword = c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",
"anarchy","-ism","privative","privative_alpha", "1310_the_ticket", "infinitive"),
token = c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism", "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
"-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive"),
stringsAsFactors = F)
我能够将短语“wiki/”之后的所有单词提取到另一列中。这些单词需要替换为与第二个数据框中的 vocabword 匹配的标记列。因此,例如,我会查看第一个数据帧第一行中 wiki/ 之后的作品“anarchist_schools_of_thought”,然后在第二个数据帧中的词汇单词下找到术语“anarchist_schools_of_thought”,我想用相应的替换它令牌是“Anarchist_schools_of_thought”。
所以它最终应该看起来像这样:
1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an- i.e
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein
6 the first known use of this word was in 1539
我意识到他们中的很多人只是将单词的第一个字母大写,但其中一些有很大的不同。我可以做一个 for 循环,但我认为这会花费太多时间,我更喜欢使用 data.table 方式或可能是 stringi 或 stringr 方式。而且我通常只会进行合并,但是由于需要在一行中替换多个单词,这会使事情变得复杂。
提前感谢您的帮助。
解决方案
您可以使用str_replace_all
from执行此操作stringr
:
library(stringr)
str_replace_all(df1$datalist, setNames(df2$vocabword, df2$token))
基本上,str_replace_all
允许您提供一个命名向量,其中原始字符串是名称,替换是向量的元素。您通过创建字符串和替换的“字典”完成了所有艰苦的工作。str_replace_all
只需拿走它并自动进行更换。
结果:
[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"
[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"
[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Privative_alpha an- i.e"
[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"
[6] "the first known use of this word was in 1539"
推荐阅读
- networking - OpenVPN 会加密我的计算机和 VPN 服务器之间的流量吗?
- r - R中矩阵中因子列的比例
- ios - 将 Label 放入空的垂直 StackView 会自动收缩堆栈
- javascript - 禁用特定命名空间的 eslint
- mysql - Min Max 日期作为 sql 中的新列
- php - 将多个值作为数组插入一列
- c++ - 通过交换技巧立即释放内存(C++)
- cron - Cronjobs 不发送任何电子邮件
- delphi - 如何解释从 VBscript 传递到 Delphi COM 服务器应用程序的数组
- automationanywhere - Automation Anywhere 失败并显示“未找到或无法访问目标邮箱”