首页 > 解决方案 > 在r中提取特定的文本行

问题描述

我有一个包含数千行的.txt文件。在这个文件中,我有一个关于研究文章的元信息。每篇论文都有关于出版年份 (PY)、标题 (TI)、DOI 号 (DI)、出版类型 (PT) 和摘要 (AB) 的信息。因此,文本文件中存在近 300 篇论文的信息。前两篇文章的信息格式如下。

PT J
AU Filieri, Raffaele
   Acikgoz, Fulya
   Ndou, Valentina
   Dwivedi, Yogesh
TI Is TripAdvisor still relevant? The influence of review credibility,
   review usefulness, and ease of use on consumers' continuance intention
SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
DI 10.1108/IJCHM-05-2020-0402
EA NOV 2020
PY 2020
AB Purpose - Recent figures show that users are discontinuing their usage
   of TripAdvisor, the leading user-generated content (UGC) platform in the
   tourism sector. Hence, it is relevant to study the factors that
   influence travelers' continued use of TripAdvisor.
   Design/methodology/approach - The authors have integrated constructs
   from the technology acceptance model, information systems (IS)
   continuance model and electronic word of mouth literature. They used
   PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297
   users of TripAdvisor recruited through Prolific.
   Findings - Findings reveal that perceived ease of use, online consumer
   review (OCR) credibility and OCR usefulness have a positive impact on
   customer satisfaction, which ultimately leads to continuance intention
   of UGC platforms. Customer satisfaction mediates the effect of the
   independent variables on continuance intention.
   Practical implications - Managers of UGC platforms (i.e. TripAdvisor)
   can benefit from the findings of this study. Specifically, they should
   improve the ease of use of their platforms by facilitating travelers'
   information searches. Moreover, they should use signals to make credible
   and helpful content stand out from the crowd of reviews.
   Originality/value - This is the first study that adopts the IS
   continuance model in the travel and tourism literature to research the
   factors influencing consumers' continued use of travel-based UGC
   platforms. Moreover, the authors have extended this model by including
   new constructs that are particularly relevant to UGC platforms, such as
   performance heuristics and OCR credibility.
ZR 0
ZA 0
Z8 0
ZS 0
TC 0
ZB 0
Z9 0
SN 0959-6119
EI 1757-1049
UT WOS:000592516500001
ER

PT J
AU Li, Yelin
   Bu, Hui
   Li, Jiahong
   Wu, Junjie
TI The role of text-extracted investor sentiment in Chinese stock price
   prediction with the enhancement of deep learning
SO INTERNATIONAL JOURNAL OF FORECASTING
VL 36
IS 4
BP 1541
EP 1562
DI 10.1016/j.ijforecast.2020.05.001
PD OCT-DEC 2020
PY 2020
AB Whether investor sentiment affects stock prices is an issue of
   long-standing interest for economists. We conduct a comprehensive study
   of the predictability of investor sentiment, which is measured directly
   by extracting expectations from online user-generated content (UGC) on
   the stock message board of Eastmoney.com in the Chinese stock market. We
   consider the influential factors in prediction, including the selections
   of different text classification algorithms, price forecasting models,
   time horizons, and information update schemes. Using comparisons of the
   long short-term memory (LSTM) model, logistic regression, support vector
   machine, and Naive Bayes model, the results show that daily investor
   sentiment contains predictive information only for open prices, while
   the hourly sentiment has two hours of leading predictability for closing
   prices. Investors do update their expectations during trading hours.
   Moreover, our results reveal that advanced models, such as LSTM, can
   provide more predictive power with investor sentiment only if the inputs
   of a model contain predictive information. (C) 2020 International
   Institute of Forecasters. Published by Elsevier B.V. All rights
   reserved.
CT 14th International Conference on Services Systems and Services
   Management (ICSSSM)
CY JUN 16-18, 2017
CL Dongbei Univ Finance & Econ, Sch Management Sci & Engn, Dalian, PEOPLES
   R CHINA
HO Dongbei Univ Finance & Econ, Sch Management Sci & Engn
SP Tsinghua Univ; Chinese Univ Hong Kong; IEEE Syst Man & Cybernet Soc
ZA 0
TC 0
ZB 0
ZS 0
Z8 0
ZR 0
Z9 0
SN 0169-2070
EI 1872-8200
UT WOS:000570797300025
ER

现在,我想提取每篇文章的摘要并将其存储在数据框中。为了提取摘要,我有以下代码,这给了我摘要的第一个匹配项。

f = readLines("sample.txt")
#extract first match....
pattern <- "AB\\s*(.*?)\\s*ZR"
result <- regmatches(as.String(f), regexec(pattern, as.String(f)))
result[[1]][2]
[1] "Purpose - Recent figures show that users are discontinuing their usage\n   of TripAdvisor, the leading user-generated content (UGC) platform in the\n   tourism sector. Hence, it is relevant to study the factors that\n   influence travelers' continued use of TripAdvisor.\n   Design/methodology/approach - The authors have integrated constructs\n   from the technology acceptance model, information systems (IS)\n   continuance model and electronic word of mouth literature. They used\n   PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297\n   users of TripAdvisor recruited through Prolific.\n   Findings - Findings reveal that perceived ease of use, online consumer\n   review (OCR) credibility and OCR usefulness have a positive impact on\n   customer satisfaction, which ultimately leads to continuance intention\n   of UGC platforms. Customer satisfaction mediates the effect of the\n   independent variables on continuance intention.\n   Practical implications - Managers of UGC platforms (i.e. TripAdvisor)\n   can benefit from the findings of this study. Specifically, they should\n   improve the ease of use of their platforms by facilitating travelers'\n   information searches. Moreover, they should use signals to make credible\n   and helpful content stand out from the crowd of reviews.\n   Originality/value - This is the first study that adopts the IS\n   continuance model in the travel and tourism literature to research the\n   factors influencing consumers' continued use of travel-based UGC\n   platforms. Moreover, the authors have extended this model by including\n   new constructs that are particularly relevant to UGC platforms, such as\n   performance heuristics and OCR credibility."

问题是,我想提取所有摘要,但大多数摘要的模式会有所不同。所以所有摘要的特定模式是我应该从AB开始提取文本,并且每下一行在前面都有空格。任何机构都可以在这方面帮助我吗?

标签: rregextextnlp

解决方案


您可以首先对行进行分组:只要一行不以空格字符开头,组计数器就会向上移动一个。

然后您可以f按组聚合并从聚合向量中选择摘要:

group <- cumsum(!grepl("^ ", f))
f2 <- aggregate(f, list(group), function(x) paste(x, collapse = " "))[, 2]

f2[grepl("^AB ", f2)]

推荐阅读