首页 > 解决方案 > R:解析引用文本文件/分成段落

问题描述

我正在寻找一个 R 解决方案来解决解析引用文本文件(如下所示)的问题,该解决方案给出一个 data.frame,每个引用一个观察值,以及变量textsource如下所述。

DIAGRAMS are of great utility for illustrating certain questions of vital statistics by
conveying ideas on the subject through the eye, which cannot be so readily grasped when
contained in figures.
--- Florence Nightingale, Mortality of the British Army, 1857

To give insight to statistical information it occurred to me, that making an
appeal to the eye when proportion and magnitude are concerned, is the best and
readiest method of conveying a distinct idea. 
--- William Playfair, The Statistical Breviary (1801), p. 2


Regarding numbers and proportions, the best way to catch the imagination is to speak to the eyes.
--- William Playfair, Elemens de statistique, Paris, 1802, p. XX.

The aim of my carte figurative is to convey promptly to the eye the relation not given quickly by numbers requiring mental calculation.
--- Charles Joseph Minard

在这里,每个引用都是一个段落,与下一个由 分隔"\n\n"。在该段落中,直到开头的所有行都---包含text,接下来---source

我想如果我可以先将文本行分成段落(由'\\n\\n+'(2 个或更多空行)分隔),我可以解决这个问题,但我在这样做时遇到了麻烦。

标签: rparsingparagraphquotations

解决方案


假设您在rawText变量中加载了初始文本

library(stringr)

strsplit(rawText, "\n\n")[[1]] %>% 
  str_split_fixed("\n--- ", 2) %>% 
  as.data.frame() %>% 
  setNames(c("text", "source"))

推荐阅读