首页 > 解决方案 > 如何在 R 中将 2 行或更多行文本合并为 1 行条件

问题描述

我想将 Thunderbird Mozilla 的 Sent 文件读入 R。有时 2 行或更多行必须放在 1 行中。这些是以“,”逗号结尾的行例如:

From: Frans  <frans@zeenit.nl>
Subject: volledig overzicht beschikbaar
To: aldjan@gmail.com, clen@zeenit.nl, pinge1@zeenit.nl,
 griepje@zeenit.nl, Jowialj@live.com, pelicaan@hotmail.com,
 pico11@zeenit.nl
Date: Mon, 21 Mar 2016 14:17:09 +0100

合并后:

From: Frans  <frans@zeenit.nl>
Subject: volledig overzicht beschikbaar
To: aldjan@gmail.com, clen@zeenit.nl, pinge1@zeenit.nl, griepje@zeenit.nl, Jowialj@live.com, pelicaan@hotmail.com,  pico11@zeenit.nl
Message-ID: <56EFF455.5000006@zeenit.nl>
Date: Mon, 21 Mar 2016 14:17:09 +0100

标签: rregex

解决方案


我不会只依赖逗号。使用grep您可以识别带有To:标签的行并将所有行粘贴到带有以下标签的行Message-ID:/Date:

cleanHeader <- function(x) {
  line.to <- grep("^To", header)
  line.next <- grep("^Date|^Mess", header)[1]
  new.to <- paste(header[line.to:(line.next - 1)], collapse="")
  c(header[1:(line.to - 1)], new.to, header[line.next:length(header)])
}

结果

cleanHeader(header1)  
[1] "From: Frans  <frans@zeenit.nl>"
[2] "Subject: volledig overzicht beschikbaar"
[3] "To: aldjan@gmail.com, clen@zeenit.nl, pinge1@zeenit.nl, griepje@zeenit.nl, 
     Jowialj@live.com, pelicaan@hotmail.com, pico11@zeenit.nl"
[4] "Date: Mon, 21 Mar 2016 14:17:09 +0100"                                                   

cleanHeader(header2)
[1] "From: Frans  <frans@zeenit.nl>"                                                  
[2] "Subject: volledig overzicht beschikbaar"
[3] "To: aldjan@gmail.com, clen@zeenit.nl, pinge1@zeenit.nl, griepje@zeenit.nl, 
     Jowialj@live.com, pelicaan@hotmail.com,  pico11@zeenit.nl"
[4] "Message-ID: <56EFF455.5000006@zeenit.nl>"
[5] "Date: Mon, 21 Mar 2016 14:17:09 +0100"

数据:

tmp <- tempfile()

cat("From: Frans  <frans@zeenit.nl>
Subject: volledig overzicht beschikbaar
To: aldjan@gmail.com, clen@zeenit.nl, pinge1@zeenit.nl,
 griepje@zeenit.nl, Jowialj@live.com, pelicaan@hotmail.com,
 pico11@zeenit.nl
Date: Mon, 21 Mar 2016 14:17:09 +0100", file=tmp, sep="\n")

header1 <- readLines(tmp)

cat("From: Frans  <frans@zeenit.nl>
Subject: volledig overzicht beschikbaar
To: aldjan@gmail.com, clen@zeenit.nl, pinge1@zeenit.nl, griepje@zeenit.nl, Jowialj@live.com, pelicaan@hotmail.com,  pico11@zeenit.nl
Message-ID: <56EFF455.5000006@zeenit.nl>
Date: Mon, 21 Mar 2016 14:17:09 +0100", file=tmp, sep="\n")

header2 <- readLines(tmp)

推荐阅读