r - 从 JSON 日志文件中提取字段值的 R 代码
问题描述
我有一个文件,其中包含来自日志集合的 50,000 条记录。我需要为每条记录提取 "State": & "Code": 后面的值。我尝试过正则表达式,但无法正常工作。相反,我尝试使用此命令来查看是否可以得到其中的 1 个值,但它只是超时。
#this never completes
sub(".*?Code(.*?);.*", "\\1", logfile
我没有这类工作的经验,所以我很感激任何帮助!这正是日志文件的格式(假设是 JSON)。我的目标是返回以下值(如果不能包含状态和代码,则可以):
(状态:红色,代码:空(状态:蓝色,代码:无收据)
以下是带有 2 条记录的日志文件的确切语法:
"
2020-05-12 00:07:00.9681200, z123-asddfas,"
========== mode for SKU ==========
========== Records found ==========
No records found
========== DRecords found ==========
No drecords found
"
2020-05-12 00:08:46.5076411,qwer98-asdha,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Red",
"Code": null
}
========== DRecords found ==========
No drecords found
"
2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Blue",
"Code": "no receipt"
}
解决方案
读入你的文字
logIn <- read_lines('"
2020-05-12 00:07:00.9681200, z123-asddfas,"
========== mode for SKU ==========
========== Records found ==========
No records found
========== DRecords found ==========
No drecords found
"
2020-05-12 00:08:46.5076411,qwer98-asdha,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Red",
"Code": null
}
========== DRecords found ==========
No drecords found
"
2020-05-12 00:10:02.6607640,qweaso-34324-asda,"
========== mode for SKU ==========
========== records found ==========
{
"State": "Blue",
"Code": "no receipt"
}')
把它做成可折叠的形式,清理干净,过滤
@library(tidyverse)
tibble(lines = logIn) %>%
# Keep only the lines with 'state' or 'code'
filter(str_detect(lines, "(?ix) ( state | code )")) %>%
# Clean out all the whitespace and punct, except the ':'
mutate(lines = str_replace_all(lines, '["\\s,]', '')) %>%
# Use separate to divide into two new columns
separate(lines, c("ATTR", "VALUE"), sep = ":")
我们得到了什么?
# A tibble: 4 x 2
ATTR VALUE
<chr> <chr>
1 State Red
2 Code null
3 State Blue
4 Code noreceipt
##################### 按要求
tibble(lines = logIn) %>%
# Keep only the lines with 'state' or 'code'
filter(str_detect(lines, "(?ix) ( state | code )")) %>%
# This ID will come in useful
rowid_to_column("ID") %>%
# Clean out all the whitespace and punct, except the ':'
mutate(lines = str_replace_all(lines, '["\\s,]', ''),
# Give each State and Code the same ID.
ID = floor((ID + 1) / 2)) %>%
# Use separate to divide into two new columns
separate(lines, c("ATTR", "VALUE"), sep = ":") %>%
# spread take it from longform to wideform
spread(key = ATTR, value = VALUE) %>%
select(ID, State, Code)
# A tibble: 2 x 3
ID State Code
<dbl> <chr> <chr>
1 1 Red null
2 2 Blue noreceipt
推荐阅读
- alexa - 安装 ASK-CLI 后如何修复“bash: ask: command not found”错误?
- wordpress - Wordpress 多站点子域添加
- excel - 删除和读取 SeriesCollection 后分配标记的 VBA 错误
- python - 在python中将日期时间从字符串转换为日期时间对象时出错
- vue.js - 如何将外部 Vue 模板包含到 HTML 文件中
- postgresql - 在不更改校验和的情况下使用 flyway 重新运行可重复迁移
- javascript - ER_BAD_FIELD_ERROR:“字段列表”中的未知列
- vba - 如何使用 VBA 在 Word 中获取可编辑范围?
- sass - 使用 SASS 自定义按钮颜色
- java - Hotswap Spring Application with war部署在Tomcat中