首页 > 解决方案 > 替换制表符和换行符 R

问题描述

我正在清理一个大文本文件以读入 R。几乎每一行都由制表符分隔,但一些长引号也有换行符。我正在使用选项卡将文档分隔成一个带有扬声器列和评论列的数据框,这些换行符破坏了我的格式,因为 R 认为每一行都是一个新的扬声器,但是当它没有时说扬声器是NA '找不到选项卡。下面是我所拥有的示例:

Interviewer: How are you?

Subject: I'm just incredibly frustrated. <br/>
*NA* Really, R is frustrating me. <br/>
*NA* But maybe someone has a solution for me?

Interviewer: Fortunately, I have an answer for you.

这就是我想要的:

Interviewer: How are you?

Subject: I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?

Interviewer: Fortunately, I have an answer for you.

我正在以这种方式阅读文档:

atas <- stri_read_lines("ATAS2.txt") %>% str_replace_all("\t", "TABS_TO_BE_DELETED")

(我有那个随机字符串,因为当我将文本文档设为数据框时,R 会不断擦除选项卡,仅供参考)。

现在,要删除换行符,我尝试过:

atas2 <- gsub("\r?\n|\r", " ", atas) 

atas2 <- str_replace_all(atas, "\n" , " ")

我也不能只删除所有特殊字符或格式来解决这个问题。如果我必须删除所有非字母数字字符,我需要保留制表符(至少足够长,以便在它们的位置放置一些模糊的字符串,以便以后拆分),?, . []():

我想让它忽略那些换行符或以某种方式将行合并在一起。仅告诉它与不匹配的行合并的唯一注意事项是我自己有一些行,没有任何扬声器需要在扬声器列中没有归属,例如(但不限于):

(Laughter)

Interview 41

[Inaudible cross-talk]

感谢您提供的任何帮助!

标签: rstringtextline-breaksstringr

解决方案


如果输出与 Andrew Gustar 显示的一样,您可以执行以下操作:

read.csv(text=gsub("\\n(?!\\w+:)","",text,perl = T),sep=":",h=F)
           V1                                                                                                     V2
1 Interviewer                                                                                           How are you?
2     Subject  I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?
3 Interviewer                                                                 Fortunately, I have an answer for you.

推荐阅读