r - 将对话小标题转换为 .txt,然后再返回
问题描述
我想取一个代表对话的 tibble 并将其转换为可以在文本编辑器中手动编辑的 .txt,然后返回到 tibble 进行处理。
我遇到的主要挑战是以某种方式分隔文本块,以便在编辑后可以将它们重新导入为类似的格式,同时保留“发言人”的名称。
速度很重要,因为文件量和每个文本段的长度都很大。
这是输入小标题:
tibble::tribble(
~word, ~speakerTag,
"been", 1L,
"going", 1L,
"on", 1L,
"and", 1L,
"what", 1L,
"your", 1L,
"goals", 1L,
"are.", 1L,
"Yeah,", 2L,
"so", 2L,
"so", 2L,
"John", 2L,
"has", 2L,
"15", 2L
)
这是 .txt 中所需的输出:
###Speaker 1###
been going on and what your goals are.
###Speaker 2###
Yeah, so so John has 15
这是手动更正错误后所需的回报:
~word, ~speakerTag,
"been", 1L,
"going", 1L,
"on", 1L,
"and", 1L,
"what", 1L,
"your", 1L,
"goals", 1L,
"in", 1L,
"r", 1L,
"Yeah,", 2L,
"so", 2L,
"so", 2L,
"John", 2L,
"hates", 2L,
"50", 2L
)
解决方案
一种方法是"\n"
在每个开头添加演讲者姓名speakerTag
library(data.table)
library(dplyr)
library(tidyr)
setDT(df)[, word := replace(word, 1, paste0("\n\nSpeaker",
first(speakerTag), '\n\n', first(word))), rleid(speakerTag)]
我们可以在文本文件中使用
writeLines(paste(df$word, collapse = " "), 'Downloads/temp.txt')
它看起来像这样:
cat(paste(df$word, collapse = " "))
#Speaker1
#
#been going on and what your goals are.
#
#Speaker2
#
#Yeah, so so John has 15
要在 R 中读回它,我们可以这样做:
read.table('Downloads/temp.txt', sep="\t", col.names = 'word') %>%
mutate(SpeakerTag = replace(word, c(FALSE, TRUE), NA)) %>%
fill(SpeakerTag) %>%
slice(seq(2, n(), 2)) %>%
separate_rows(word, sep = "\\s") %>%
filter(word != '')
# word SpeakerTag
#1 been Speaker1
#2 going Speaker1
#3 on Speaker1
#4 and Speaker1
#5 what Speaker1
#6 your Speaker1
#7 goals Speaker1
#8 are. Speaker1
#9 Yeah, Speaker2
#10 so Speaker2
#11 so Speaker2
#12 John Speaker2
#13 has Speaker2
#14 15 Speaker2
Obviously we can remove "Speaker"
part in SpeakerTag
column if it is not needed.
推荐阅读
- python - OpenCV 中的 StereoCalibration:不应该没有 ObjectPoints 吗?
- jquery - json 消息问题(数据/processData)
- youtube - 如何使用 iframe 为嵌入式 youtube 放置海报
- python - 根据另一个变量标准化 seaborn catplot
- javascript - 单击以
在每行添加句子+数组编号和结果 - c++ - 如何交换单链表中两个节点的位置,只修改指针?
- c# - 使用配置文件发布 .NetCore 应用程序
- swift - Firestore 模型组/成员查询或数据库规则
- sql - 插入两个表并将其中一个表列的 id 影响到另一个表中
- gradle - 使用快照 Gradle 多项目依赖项