r - 在r中提取特定的文本行
问题描述
我有一个包含数千行的.txt文件。在这个文件中,我有一个关于研究文章的元信息。每篇论文都有关于出版年份 (PY)、标题 (TI)、DOI 号 (DI)、出版类型 (PT) 和摘要 (AB) 的信息。因此,文本文件中存在近 300 篇论文的信息。前两篇文章的信息格式如下。
PT J
AU Filieri, Raffaele
Acikgoz, Fulya
Ndou, Valentina
Dwivedi, Yogesh
TI Is TripAdvisor still relevant? The influence of review credibility,
review usefulness, and ease of use on consumers' continuance intention
SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
DI 10.1108/IJCHM-05-2020-0402
EA NOV 2020
PY 2020
AB Purpose - Recent figures show that users are discontinuing their usage
of TripAdvisor, the leading user-generated content (UGC) platform in the
tourism sector. Hence, it is relevant to study the factors that
influence travelers' continued use of TripAdvisor.
Design/methodology/approach - The authors have integrated constructs
from the technology acceptance model, information systems (IS)
continuance model and electronic word of mouth literature. They used
PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297
users of TripAdvisor recruited through Prolific.
Findings - Findings reveal that perceived ease of use, online consumer
review (OCR) credibility and OCR usefulness have a positive impact on
customer satisfaction, which ultimately leads to continuance intention
of UGC platforms. Customer satisfaction mediates the effect of the
independent variables on continuance intention.
Practical implications - Managers of UGC platforms (i.e. TripAdvisor)
can benefit from the findings of this study. Specifically, they should
improve the ease of use of their platforms by facilitating travelers'
information searches. Moreover, they should use signals to make credible
and helpful content stand out from the crowd of reviews.
Originality/value - This is the first study that adopts the IS
continuance model in the travel and tourism literature to research the
factors influencing consumers' continued use of travel-based UGC
platforms. Moreover, the authors have extended this model by including
new constructs that are particularly relevant to UGC platforms, such as
performance heuristics and OCR credibility.
ZR 0
ZA 0
Z8 0
ZS 0
TC 0
ZB 0
Z9 0
SN 0959-6119
EI 1757-1049
UT WOS:000592516500001
ER
PT J
AU Li, Yelin
Bu, Hui
Li, Jiahong
Wu, Junjie
TI The role of text-extracted investor sentiment in Chinese stock price
prediction with the enhancement of deep learning
SO INTERNATIONAL JOURNAL OF FORECASTING
VL 36
IS 4
BP 1541
EP 1562
DI 10.1016/j.ijforecast.2020.05.001
PD OCT-DEC 2020
PY 2020
AB Whether investor sentiment affects stock prices is an issue of
long-standing interest for economists. We conduct a comprehensive study
of the predictability of investor sentiment, which is measured directly
by extracting expectations from online user-generated content (UGC) on
the stock message board of Eastmoney.com in the Chinese stock market. We
consider the influential factors in prediction, including the selections
of different text classification algorithms, price forecasting models,
time horizons, and information update schemes. Using comparisons of the
long short-term memory (LSTM) model, logistic regression, support vector
machine, and Naive Bayes model, the results show that daily investor
sentiment contains predictive information only for open prices, while
the hourly sentiment has two hours of leading predictability for closing
prices. Investors do update their expectations during trading hours.
Moreover, our results reveal that advanced models, such as LSTM, can
provide more predictive power with investor sentiment only if the inputs
of a model contain predictive information. (C) 2020 International
Institute of Forecasters. Published by Elsevier B.V. All rights
reserved.
CT 14th International Conference on Services Systems and Services
Management (ICSSSM)
CY JUN 16-18, 2017
CL Dongbei Univ Finance & Econ, Sch Management Sci & Engn, Dalian, PEOPLES
R CHINA
HO Dongbei Univ Finance & Econ, Sch Management Sci & Engn
SP Tsinghua Univ; Chinese Univ Hong Kong; IEEE Syst Man & Cybernet Soc
ZA 0
TC 0
ZB 0
ZS 0
Z8 0
ZR 0
Z9 0
SN 0169-2070
EI 1872-8200
UT WOS:000570797300025
ER
现在,我想提取每篇文章的摘要并将其存储在数据框中。为了提取摘要,我有以下代码,这给了我摘要的第一个匹配项。
f = readLines("sample.txt")
#extract first match....
pattern <- "AB\\s*(.*?)\\s*ZR"
result <- regmatches(as.String(f), regexec(pattern, as.String(f)))
result[[1]][2]
[1] "Purpose - Recent figures show that users are discontinuing their usage\n of TripAdvisor, the leading user-generated content (UGC) platform in the\n tourism sector. Hence, it is relevant to study the factors that\n influence travelers' continued use of TripAdvisor.\n Design/methodology/approach - The authors have integrated constructs\n from the technology acceptance model, information systems (IS)\n continuance model and electronic word of mouth literature. They used\n PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297\n users of TripAdvisor recruited through Prolific.\n Findings - Findings reveal that perceived ease of use, online consumer\n review (OCR) credibility and OCR usefulness have a positive impact on\n customer satisfaction, which ultimately leads to continuance intention\n of UGC platforms. Customer satisfaction mediates the effect of the\n independent variables on continuance intention.\n Practical implications - Managers of UGC platforms (i.e. TripAdvisor)\n can benefit from the findings of this study. Specifically, they should\n improve the ease of use of their platforms by facilitating travelers'\n information searches. Moreover, they should use signals to make credible\n and helpful content stand out from the crowd of reviews.\n Originality/value - This is the first study that adopts the IS\n continuance model in the travel and tourism literature to research the\n factors influencing consumers' continued use of travel-based UGC\n platforms. Moreover, the authors have extended this model by including\n new constructs that are particularly relevant to UGC platforms, such as\n performance heuristics and OCR credibility."
问题是,我想提取所有摘要,但大多数摘要的模式会有所不同。所以所有摘要的特定模式是我应该从AB开始提取文本,并且每下一行在前面都有空格。任何机构都可以在这方面帮助我吗?
解决方案
您可以首先对行进行分组:只要一行不以空格字符开头,组计数器就会向上移动一个。
然后您可以f
按组聚合并从聚合向量中选择摘要:
group <- cumsum(!grepl("^ ", f))
f2 <- aggregate(f, list(group), function(x) paste(x, collapse = " "))[, 2]
f2[grepl("^AB ", f2)]
推荐阅读
- mongodb - MongoDB,查找属性 id 等于记录 id 的所有文档
- python - 如何使 Python 的功能可以查看导入的库
- java - 无法提取响应:没有找到适合响应类型类和内容类型 [text/html] 的 HttpMessageConverter
- mongodb - 将 MongoDB $max 结果转换为 golang 数据
- codeceptjs - 无法连接到 WebDriver。错误:连接 ECONNREFUSED 127.0.0.1:4444 (codeceptjs)
- reactjs - 如何开玩笑地模拟 useHistory 钩子?
- nativescript - NativeView 在返回应用后捕获 Exception: TypeError
- python - 如何组合两列的浮点值并将其放在数据框的另一列中?
- python - 烧瓶根据其他表的ID添加一个新行
- javascript - CSP:拒绝执行内联脚本