首页 > 解决方案 > 如何在不保留原始格式的情况下将文本从 pdf 文件复制到文本文件

问题描述

我有一个要从中提取文本的 pdf 文件。但是,我不想保持 pdf 文件的相同间距。我希望文本看起来好像我手动复制并粘贴了 pdf 中的行。这会从我的文本文件中删除一些美观但不必要的制表符和间距复杂性。

例如,如果我使用 R 正常提取文本,我会得到类似于以下的格式:

                             This is the title
                             of this document
1.0 Hello my name is John and blah balh blah blah blah.
        1.1 blah blah blah blah

如果我只是手动复制和粘贴,我会得到类似于:

This is the title of this document
1.0 Hello my name is John and blah balh blah blah blah.
1.1 blah blah blah blah blah

我想知道是否有任何方法可以通过 R 中的代码来执行此操作,而不仅仅是手动复制和粘贴。

一个真实的例子是pdf:https ://www.researchgate.net/profile/James_Hamilton11/publication/24108242_Oil_and_the_Macroeconomy_since_World_War_II/links/0c9605252c0916e709000000.pdf

如果我要手动复制和粘贴第 228 页或 pdf 中的第 3 页的一部分,我会得到:

Oil and the Macroeconomy since World War 11
James D. Hamilton
University (f/' Virgiiwa
All but one of the U.S. recessions since World War II have been
preceded, typically with a lag of around three-fourths of a year, by a
dramatic increase in the price of crude petroleum. This does not
mean that oil shocks caused these recessions. Evidence is presented,
however, that even over the period 1948-72 this correlation is statistically
significant and nonspurlious, supporting the proposition that
oil shocks were a contributing factor in at least some of the U.S.
recessions prior to 1972. By extension, energy price increases may
account for much of post-OPEC macroeconomic performance.
I. Introduction
The poor performance of the U.S. economy since 1973 is well documented:

1. The rate of growth of real GNP has fallen from an average of
4.0 percent during 1960-72 to 2.4 percent for 1973-81.
2. The 7.6 percent average inflation rate during 1973-81 was
more than double the 3.1 percent realized for 1960-72.
3. The average unemployment rate over 1973-81 of 6.7 percent
was higher than in any year between 1948 and 1972 with the single
exception of the recession of 1958.
This paper is drawn from chap. 2 of my Ph.D. dissertation at the University of
California, Berkeley. Earlier versions of this paper were presented at the NBER/NSF

这与 pdf 中的格式完全不同。

奖励: 我发布的示例出错了。如果我从谷歌浏览器的 pdf 文档中复制并粘贴,我会得到该输出。如果我从 Microsoft Edge 复制和粘贴,我会得到如下信息:

Oil and the Macroeconomy since World War 11 
James D. Hamilton 
University (f/' Virgiiwa 
All but one of the U.S. recessions since World War II have been preceded, typically with a lag of around three-fourths of a year, by a dramatic increase in the price of crude petroleum. This does not mean that oil shocks caused these recessions. Evidence is presented, however, that even over the period 1948-72 this correlation is statis- tically significant and nonspurlious, supporting the proposition that oil shocks were a contributing factor in at least some of the U.S. recessions prior to 1972. By extension, energy price increases may account for much of post-OPEC macroeconomic performance. 
I. Introduction 
The poor performance of the U.S. economy since 1973 is well docu- mented: 1. The rate of growth of real GNP has fallen from an average of 4.0 percent during 1960-72 to 2.4 percent for 1973-81. 2. The 7.6 percent average inflation rate during 1973-81 was more than double the 3.1 percent realized for 1960-72. 3. The average unemployment rate over 1973-81 of 6.7 percent was higher than in any year between 1948 and 1972 with the single exception of the recession of 1958. 
This paper is drawn from chap. 2 of my Ph.D. dissertation at the University of California, Berkeley. Earlier versions of this paper were presented at the NBER/NSF 

对不起这个错误。先前的答案对我当时提出的问题有效,但这是我试图获得的输出类型。

标签: r

解决方案


据我所知,区别只是每行开头是否存在空白。您可以在 R 中使用gsub. 例如:

library(pdftools)
doc <- "https://www.researchgate.net/profile/James_Hamilton11/publication/24108242_Oil_and_the_Macroeconomy_since_World_War_II/links/0c9605252c0916e709000000.pdf"
text <- pdf_text(doc)[[3]]
text_no_ws <- gsub("^|\n +", "\n", text)
cat(text_no_ws)

推荐阅读