r - 如何在不保留原始格式的情况下将文本从 pdf 文件复制到文本文件
问题描述
我有一个要从中提取文本的 pdf 文件。但是,我不想保持 pdf 文件的相同间距。我希望文本看起来好像我手动复制并粘贴了 pdf 中的行。这会从我的文本文件中删除一些美观但不必要的制表符和间距复杂性。
例如,如果我使用 R 正常提取文本,我会得到类似于以下的格式:
This is the title
of this document
1.0 Hello my name is John and blah balh blah blah blah.
1.1 blah blah blah blah
如果我只是手动复制和粘贴,我会得到类似于:
This is the title of this document
1.0 Hello my name is John and blah balh blah blah blah.
1.1 blah blah blah blah blah
我想知道是否有任何方法可以通过 R 中的代码来执行此操作,而不仅仅是手动复制和粘贴。
如果我要手动复制和粘贴第 228 页或 pdf 中的第 3 页的一部分,我会得到:
Oil and the Macroeconomy since World War 11
James D. Hamilton
University (f/' Virgiiwa
All but one of the U.S. recessions since World War II have been
preceded, typically with a lag of around three-fourths of a year, by a
dramatic increase in the price of crude petroleum. This does not
mean that oil shocks caused these recessions. Evidence is presented,
however, that even over the period 1948-72 this correlation is statistically
significant and nonspurlious, supporting the proposition that
oil shocks were a contributing factor in at least some of the U.S.
recessions prior to 1972. By extension, energy price increases may
account for much of post-OPEC macroeconomic performance.
I. Introduction
The poor performance of the U.S. economy since 1973 is well documented:
1. The rate of growth of real GNP has fallen from an average of
4.0 percent during 1960-72 to 2.4 percent for 1973-81.
2. The 7.6 percent average inflation rate during 1973-81 was
more than double the 3.1 percent realized for 1960-72.
3. The average unemployment rate over 1973-81 of 6.7 percent
was higher than in any year between 1948 and 1972 with the single
exception of the recession of 1958.
This paper is drawn from chap. 2 of my Ph.D. dissertation at the University of
California, Berkeley. Earlier versions of this paper were presented at the NBER/NSF
这与 pdf 中的格式完全不同。
奖励: 我发布的示例出错了。如果我从谷歌浏览器的 pdf 文档中复制并粘贴,我会得到该输出。如果我从 Microsoft Edge 复制和粘贴,我会得到如下信息:
Oil and the Macroeconomy since World War 11
James D. Hamilton
University (f/' Virgiiwa
All but one of the U.S. recessions since World War II have been preceded, typically with a lag of around three-fourths of a year, by a dramatic increase in the price of crude petroleum. This does not mean that oil shocks caused these recessions. Evidence is presented, however, that even over the period 1948-72 this correlation is statis- tically significant and nonspurlious, supporting the proposition that oil shocks were a contributing factor in at least some of the U.S. recessions prior to 1972. By extension, energy price increases may account for much of post-OPEC macroeconomic performance.
I. Introduction
The poor performance of the U.S. economy since 1973 is well docu- mented: 1. The rate of growth of real GNP has fallen from an average of 4.0 percent during 1960-72 to 2.4 percent for 1973-81. 2. The 7.6 percent average inflation rate during 1973-81 was more than double the 3.1 percent realized for 1960-72. 3. The average unemployment rate over 1973-81 of 6.7 percent was higher than in any year between 1948 and 1972 with the single exception of the recession of 1958.
This paper is drawn from chap. 2 of my Ph.D. dissertation at the University of California, Berkeley. Earlier versions of this paper were presented at the NBER/NSF
对不起这个错误。先前的答案对我当时提出的问题有效,但这是我试图获得的输出类型。
解决方案
据我所知,区别只是每行开头是否存在空白。您可以在 R 中使用gsub
. 例如:
library(pdftools)
doc <- "https://www.researchgate.net/profile/James_Hamilton11/publication/24108242_Oil_and_the_Macroeconomy_since_World_War_II/links/0c9605252c0916e709000000.pdf"
text <- pdf_text(doc)[[3]]
text_no_ws <- gsub("^|\n +", "\n", text)
cat(text_no_ws)
推荐阅读
- c# - 如何在 C# 中解析 HTTP 获取和发布文本?
- cuda - 任何保证 Torch 不会弄乱已经分配的 CUDA 数组?
- php - 如何查询数组中的数组
- stripe-payments - 我如何在条纹上以十进制支付金额
- javascript - Javascript 字符串包括完全匹配,没有进一步匹配
- java - 为什么使用 Maven 和 Hibernate 建立一个简单的项目会失败?
- javascript - Firebase 规则:如何使用推送 id 验证重复
- json - 未处理的拒绝(TypeError):无法读取未定义的属性“过滤器”
- php - 如果产品属于 WooCommerce 中的某个类别,请在存档页面上的价格下方添加自定义文本
- python - 使用 argparse 定义文件的路径