首页 > 解决方案 > 从 R 中的文本中提取字符级 n-gram

问题描述

我有一个带有文本的数据框,我想为 R 中的每个文本提取字符级二元组(n = 2),例如“st”、“ac”、“ck”。

我还想计算文本中每个字符级二元组的频率。

数据:

df$text

[1] "hy my name is"
[2] "stackover flow is great"
[3] "how are you"

标签: rnlpcharactern-gram

解决方案


I'm not quite sure of your expected output here. I would have thought that the bigrams for "stack" would be "st", "ta", "ac", and "ck", since this captures each consecutive pair.

For example, if you wanted to know how many instances of the bigram "th" the word "brothers" had in it, and you split it into the bigrams "br", "ot", "he" and "rs", then you would get the answer 0, which is wrong.

You can build up a single function to get all bigrams like this:

# This function takes a vector of single characters and creates all the bigrams
# within that vector. For example "s", "t", "a", "c", "k" becomes 
# "st", "ta", "ac", and "ck"

pair_chars <- function(char_vec) {
  all_pairs <- paste0(char_vec[-length(char_vec)], char_vec[-1])
  return(as.vector(all_pairs[nchar(all_pairs) == 2]))
}

# This function splits a single word into a character vector and gets its bigrams

word_bigrams <- function(words){
  unlist(lapply(strsplit(words, ""), pair_chars))
}

# This function splits a string or vector of strings into words and gets their bigrams

string_bigrams <- function(strings){
  unlist(lapply(strsplit(strings, " "), word_bigrams))
}

So now we can test this on your example:

df <- data.frame(text = c("hy my name is", "stackover flow is great", 
                          "how are you"), stringsAsFactors = FALSE)

string_bigrams(df$text)
#>  [1] "hy" "my" "na" "am" "me" "is" "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "fl"
#> [16] "lo" "ow" "is" "gr" "re" "ea" "at" "ho" "ow" "ar" "re" "yo" "ou"

If you want to count occurrences, you can just use table:

table(string_bigrams(df$text))

#> ac am ar at ck ea er fl gr ho hy is ko lo me my na ou ov ow re st ta ve yo 
#>  1  1  1  1  1  1  1  1  1  1  1  2  1  1  1  1  1  1  1  2  2  1  1  1  1 

However, if you are going to be doing a fair bit of text mining, you should look into specific R packages like stringi, stringr, tm and quanteda that help with the basic tasks

For example, all of the base R functions I wrote above can be replaced using the quanteda package like this:

library(quanteda)
char_ngrams(unlist(tokens(df$text, "character")), concatenator = "")
#>  [1] "hy" "ym" "my" "yn" "na" "am" "me" "ei" "is" "ss" "st" "ta" "ac" "ck" 
#> [15] "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow" "wi" "is" "sg" "gr" "re" "ea"
#> [29] "at" "th" "ho" "ow" "wa" "ar" "re" "ey" "yo" "ou"

Created on 2020-06-13 by the reprex package (v0.3.0)


推荐阅读