r - 在 R 中需要一种有效的方法来将彩色 utf-8 表情符号字符转换为其默认皮肤
问题描述
有没有什么有效的方法可以从矢量中去除彩色表情符号并将它们变成标准形式?例如,请看两个输出,我可能没有使用适当的术语。目前我正在这样做:
library(rjson)
library(stringi)
library(stringr)
# this function gets name from emojis one at a time
emoji_json_file <- "https://raw.githubusercontent.com/ToadHanks/emojisLib_json/master/emojis.json"
json_data <- rjson::fromJSON(paste(readLines(emoji_json_file), collapse = "")) #read line by line make
# gets the name i.e. get_name_from_emoji("") output should be "yum"
get_name_from_emoji <- function(emoji_unicode, emoji_data = json_data) {
emoji_evaluated <- stringi::stri_unescape_unicode(emoji_unicode)
vector_of_emoji_names_and_characters <- unlist(
lapply(json_data, function(x){
x$char
})
)
name_of_emoji <- attr(
which(vector_of_emoji_names_and_characters == emoji_evaluated)[1],
"names"
)
return(name_of_emoji)
}
# Fill an empty vector with names
emoji_pouch_copy <- c("","","","","","") #we can't render U+1F3FB (light-skin graft), U+1F3FF (dark-skin graft) here that's why "?"
emoji_keywords_pouch <- c()
for(i in 1: length(emoji_pouch_copy)){
emoji_keywords_pouch <- c(emoji_keywords_pouch, get_name_from_emoji(emoji_pouch_copy[i]))
}
emoji_keywords_pouch #output: "shushing","point_down_fairly_dark","point_right_dark","fu_light","dark_skin_tone","light_skin_tone"
#Function to remove the skin tones
remove_all_skins <- function(string, pattern) {
str_replace_all(string, pattern, "000")
}
#remove these and their nativ renders at a positions
skin_tones <- c("medium_skin_tone", "fairly_dark_skin_tone", "dark_skin_tone", "fairly_light_skin_tone", "light_skin_tone", "_light","_dark","_medium","_fairly")
emoji_keywords_pouch <- remove_all_skins(emoji_keywords_pouch, skin_tones[1])
emoji_keywords_pouch <- remove_all_skins(emoji_keywords_pouch, skin_tones[2])
emoji_keywords_pouch <- remove_all_skins(emoji_keywords_pouch, skin_tones[3])
emoji_keywords_pouch <- remove_all_skins(emoji_keywords_pouch, skin_tones[4])
emoji_keywords_pouch <- remove_all_skins(emoji_keywords_pouch, skin_tones[5])
emoji_keywords_pouch <- emoji_keywords_pouch[emoji_keywords_pouch != "000"] #free the memory
#It has to be this order, otherwise good strings will go bad in the variable containing keywords
emoji_keywords_pouch <- stringr::str_remove_all(emoji_keywords_pouch, skin_tones[6])
emoji_keywords_pouch <- stringr::str_remove_all(emoji_keywords_pouch, skin_tones[7])
emoji_keywords_pouch <- stringr::str_remove_all(emoji_keywords_pouch, skin_tones[8])
emoji_keywords_pouch <- stringr::str_remove_all(emoji_keywords_pouch, skin_tones[9])
#Reverse the function get_name... to get_emoji and rebuild the emoji_pouch
#i.e. get_emoji_from_name("yum") output should be ""
get_emoji_from_name <- function(emoji_name, emoji_data = json_data) {
vector_of_emoji_names_and_characters <- unlist(
lapply(json_data, function(x){
x$char
})
)
emoji_character <- unname(
vector_of_emoji_names_and_characters[
names(vector_of_emoji_names_and_characters) == emoji_name
]
)
return(emoji_character)
}
#reset the original emoji_...copy to include standard tones
emoji_pouch_copy <- c()
for(i in 1: length(emoji_keywords_pouch)){
# Sys.sleep(1)
emoji_pouch_copy <- c(emoji_pouch_copy, get_emoji_from_name(emoji_keywords_pouch[i]))
}
#All of the skin tones are removed, because there are no standad skin tones
emoji_pouch_copy #output: """" "" ""
#Finished
简而言之,我将从表情符号到他们的名字。然后通过去除皮肤状况来清洁他们的名字,然后恢复他们的表情符号形式。我有近 1000 个表情符号,而 for 循环导致了 5 秒的延迟。是否有一些软件包可以比我做得更好?
解决方案
我不完全确定我得到了你的问题。但是你可以像这样摆脱不同的颜色:
从数据开始
library(rjson)
# this function gets name from emojis one at a time
emoji_json_file <- "https://raw.githubusercontent.com/ToadHanks/emojisLib_json/master/emojis.json"
json_data <- rjson::fromJSON(paste(readLines(emoji_json_file), collapse = "")) #read line by line make
仅提取表情符号:
emojis <- sapply(json_data, function(x) x$char)
现在这些颜色的方式是将两个Unicode字符粘在一起。例如:
emojis[114]
#> raised_hands_light
#> "<U+0001F64C><U+0001F3FB>"
我们可以用strsplit(emojis, "")
. 如果没有着色,这将导致一个向量长度为 1 的列表,如果一个表情符号被着色或以其他方式改变(例如,男性/女性),则长度为 2。我们只保留列表中每个向量的第一个元素:
emojis_clean <- sapply(strsplit(emojis, ""), "[[", 1)
现在表情符号 114 看起来像这样:
emojis_clean[114]
#> raised_hands_light
#> "<U+0001F64C>"
额外:标志问题
上述方法快速但愚蠢。它无法识别组合的表情符号何时正确组合。例如,标志由两个 Unicode 字符组合而成。可能还有其他例子。names
我们可以通过在 emoji 向量中查找一些关键字来将它们替换为原始向量:
# Look for flags
flags <- grep("flag", names(emojis))
# replace flags with original values
emojis_clean[flags] <- emojis[flags]
这种方法可用于其他类型的表情符号。
推荐阅读
- html - 对齐底部栏元素
- druid - 德鲁伊可以执行嵌套查询,使得每个查询都包含一个维度和关联维度的列表吗?
- javascript - 如何更改 window.open 上的 URL?为什么这不起作用?
- javascript - 试图把这一切都集中在一条线上
- python - 如何使用python在excel工作表中插入真正的数据透视表?
- javascript - dc.js 百分比直方图 - 组的大小
- python - 大多数pythonic可调用生成True?
- sql - 对 DB2 中的所有行求和
- reactjs - Typescript + React:键入扩展给定类组件的功能包装器组件
- reactjs - (Native React)无法在 Native React 中单独为这两个项目工作