r - 将整洁的文本与同义词结合起来创建数据框
问题描述
我有如下示例数据框:
quoteiD <- c("q1","q2","q3","q4", "q5")
quote <- c("Unthinking respect for authority is the greatest enemy of truth.",
"In the middle of difficulty lies opportunity.",
"Intelligence is the ability to adapt to change.",
"Science is not only a disciple of reason but, also, one of romance and passion.",
"If I have seen further it is by standing on the shoulders of Giants.")
library(dplyr)
quotes <- tibble(quoteiD = quoteiD, quote= quote)
quotes
我创建了一些整洁的文本如下
library(tidytext)
data(stop_words)
tidy_words <- quotes %>%
unnest_tokens(word, quote) %>%
anti_join(stop_words) %>%
count( word, sort = TRUE)
tidy_words
此外,我使用qdap包搜索了同义词,如下所示
library(qdap)
syns <- synonyms(tidy_words$word)
qdap 输出是一个列表,我希望为整洁的数据框中的每个单词选择前 5 个同义词,并创建一个名为 synonyms 的列,如下所示:
word n synonyms
ability 1 adeptness, aptitude, capability, capacity, competence
adapt 1 acclimatize, accommodate, adjust, alter, apply,
authority 1 ascendancy, charge, command, control, direction
从 qdap 同义词函数合并 5 个单词列表并用逗号分隔的优雅方法是什么?
解决方案
使用解决方案可以做到这一点的一种tidyverse
方法是
library(plyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:plyr':
#>
#> arrange, count, desc, failwith, id, mutate, rename, summarise,
#> summarize
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidytext)
library(qdap)
#> Loading required package: qdapDictionaries
#> Loading required package: qdapRegex
#>
#> Attaching package: 'qdapRegex'
#> The following object is masked from 'package:dplyr':
#>
#> explain
#> Loading required package: qdapTools
#>
#> Attaching package: 'qdapTools'
#> The following object is masked from 'package:dplyr':
#>
#> id
#> The following object is masked from 'package:plyr':
#>
#> id
#> Loading required package: RColorBrewer
#>
#> Attaching package: 'qdap'
#> The following object is masked from 'package:dplyr':
#>
#> %>%
#> The following object is masked from 'package:base':
#>
#> Filter
library(tibble)
library(tidyr)
#>
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:qdap':
#>
#> %>%
quotes <- tibble(quoteiD = paste0("q", 1:5),
quote= c(".\n\nthe ebodac consortium consists of partners: janssen (efpia), london school of hygiene and tropical medicine (lshtm),",
"world vision) mobile health software development and deployment in resource limited settings grameen\n\nas such, the ebodac consortium is well placed to tackle.",
"Intelligence is the ability to adapt to change.",
"Science is a of reason of romance and passion.",
"If I have seen further it is by standing on ."))
quotes
#> # A tibble: 5 x 2
#> quoteiD quote
#> <chr> <chr>
#> 1 q1 ".\n\nthe ebodac consortium consists of partners: janssen (efpia~
#> 2 q2 "world vision) mobile health software development and deployment~
#> 3 q3 Intelligence is the ability to adapt to change.
#> 4 q4 Science is a of reason of romance and passion.
#> 5 q5 If I have seen further it is by standing on .
data(stop_words)
tidy_words <- quotes %>%
unnest_tokens(word, quote) %>%
anti_join(stop_words) %>%
count( word, sort = TRUE)
#> Joining, by = "word"
tidy_words
#> # A tibble: 33 x 2
#> word n
#> <chr> <int>
#> 1 consortium 2
#> 2 ebodac 2
#> 3 ability 1
#> 4 adapt 1
#> 5 change 1
#> 6 consists 1
#> 7 deployment 1
#> 8 development 1
#> 9 efpia 1
#> 10 grameen 1
#> # ... with 23 more rows
syns <- synonyms(tidy_words$word)
#> no match for the following:
#> consortium, ebodac, consists, deployment, efpia, grameen, janssen, london, lshtm, partners, settings, software, tropical
#> ========================
syns %>%
plyr::ldply(data.frame) %>% # Change the list to a dataframe (See https://stackoverflow.com/questions/4227223/r-list-to-data-frame)
rename("Word_DefNumber" = 1, "Syn" = 2) %>% # Rename the columns with a name that is more intuitive
separate(Word_DefNumber, c("Word", "DefNumber"), sep = "\\.") %>% # Find the word part of the word and definition number
group_by(Word) %>% # Group by words, so that when we select rows it is done for each word
slice(1:5) %>% # Keep the first 5 rows for each word
summarise(synonyms = paste(Syn, collapse = ", ")) %>% # Combine the synonyms together comma separated using paste
ungroup() # So there are not unintended effects of having the data grouped when using the data later
#> # A tibble: 20 x 2
#> Word synonyms
#> <chr> <chr>
#> 1 ability adeptness, aptitude, capability, capacity, competence
#> 2 adapt acclimatize, accommodate, adjust, alter, apply
#> 3 change alter, convert, diversify, fluctuate, metamorphose
#> 4 development advance, advancement, evolution, expansion, growth
#> 5 health fitness, good condition, haleness, healthiness, robustness
#> 6 hygiene cleanliness, hygienics, sanitary measures, sanitation
#> 7 intelligence acumen, alertness, aptitude, brain power, brains
#> 8 limited bounded, checked, circumscribed, confined, constrained
#> 9 medicine cure, drug, medicament, medication, nostrum
#> 10 mobile ambulatory, itinerant, locomotive, migrant, motile
#> 11 passion animation, ardour, eagerness, emotion, excitement
#> 12 reason apprehension, brains, comprehension, intellect, judgment
#> 13 resource ability, capability, cleverness, ingenuity, initiative
#> 14 romance affair, affaire (du coeur), affair of the heart, amour, at~
#> 15 school academy, alma mater, college, department, discipline
#> 16 science body of knowledge, branch of knowledge, discipline, art, s~
#> 17 standing condition, credit, eminence, estimation, footing
#> 18 tackle accoutrements, apparatus, equipment, gear, implements
#> 19 vision eyes, eyesight, perception, seeing, sight
#> 20 world earth, earthly sphere, globe, everybody, everyone
由reprex 包(v0.2.1)于 2019 年 4 月 5 日创建
请注意,plyr
应该在加载之前dplyr
推荐阅读
- r - 为什么 rbind 在 R 的函数循环中不起作用
- google-cloud-platform - 如何为非工程师成员添加 GCS 存储桶的读取权限?
- java - 当 ManyToOne 与 OneToMany 直通关系表一起使用时,持续不工作
- matlab - 如何在不使用 for 循环的情况下计算图像中像素强度的出现次数?
- jenkins - Jenkins Permission Denied on shell 执行
- ios - 显示名称/描述 - 应用内购买已退回
- sql-server - 根据数据库中保存的经度和纬度显示附近的地方。angular6 + sql服务器
- php - 随机选择记录,与先前选择的记录不同
- kubernetes - io.k8s.api.core.v1.PersistentVolumeClaim 中的未知字段“存储”
- configuration - 是否可以为 bitbucket 管道中的不同更改文件触发不同的任务?