首页 > 解决方案 > 如何在 Powershell 的分隔记录中删除 JSON 文本(包括 CRLF)

问题描述

有一个奇怪的问题,我需要在波浪号分隔的文件中删除 JSON 文本(由于 JSON 每一行末尾的 CRLF,JSON 会中断导入)。示例行:

Test Plan Work~Response Status: BadRequest Bad Request,Response Content: {
  "trace": "0HM5285F2",
  "errors": [
    {
      "code": "server_error",
      "message": "Couldn't access service ",
      "moreInfoUrl": null,
      "target": {
        "type": null,
        "name": null
      }
    }
  ]
},Request: https://www.test.com Headers: Accept: application/json
SubscriberId: 
~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM

或者像这样没有 JSON 但仍然具有我需要的相同模式的那些:

Test Plan Pay Work~Response Status: InternalServerError Internal Server Error,Response Content: Error,Request: https://api.test.com Headers: Accept: application/json
Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5c
SubscriberId: eb7aee
~9d05b16e-e57b-44be-b028-b6ddsdfsdf62a5~1/20/2021 7:07:53 PM

需要这两种类型的 CSV 文本格式为:

Test Plan Work~Response Status: BadRequest Bad Request~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM

JSON(包括 JSON 每一行末尾的 CRLF)正在中断将数据导入 Powershell。任何帮助或见解将不胜感激!

标签: powershell

解决方案


PowerShell(或者更确切地说,.NET)在其正则表达式引擎中有两个特殊的功能,可能非常适合这个用例 -平衡组条件

平衡组是一个需要完全解释的复杂功能,但它本质上允许我们“记录”正则表达式模式中特定命名子表达式的出现次数,应用时看起来像这样:

PS ~> $string = 'Here is text { but wait { it has } nested { blocks }} here is more text'
PS ~> $string -replace '\{(?>\{(?<depth>)|[^{}]+|\}(?<-depth>))*(?(depth)(?!))\}'
Here is text  here is more text

让我们分解正则表达式模式:

\{                    # match literal '{'
(?>                   # begin atomic group* 
     \{(?<depth>)     #     match literal '{' and increment counter
  |  [^{}]+           #  OR match any sequence of characters that are NOT '{' or '}'
  |  \}(?<-depth>)    #  OR match literal '}' and decrement counter
)*                    # end atomic group, whole group should match 0 or more times
(?                    # begin conditional group*
    (depth)(?!)       # if the 'depth' counter > 0, then FAIL!
)                     # end conditional group
\}                    # match literal '}' (corresponding to the initial '{')

*)(?>...) 原子分组可防止回溯 - 防止意外计数多次。

对于其余字段中的 CRLF 字符,我们可以在模式前加上(?s)- 这使得正则表达式引擎在匹配.“任何”元字符时包含新行,直到我们到达之前的位置~87c5...

(?s),Response Content:\s*\{(?>\{(?<depth>)|[^{}]+|\}(?<-depth>))*(?(depth)(?!))\}.*?(?=~)

或者我们可以,也许更准确地说,将 JSON 之后的字段描述为重复的,“和”对,

,Response Content:\s*(?:\{(?>\{(?<depth>)|[^{}]+|\}(?<-depth>))*(?(depth)(?!))\})?\s*(?:,[^,]+?)*(?=~)

让我们尝试一下您的多行输入字符串:

$string = @'
Test Plan Work~Response Status: BadRequest Bad Request,Response Content: {
  "trace": "0HM5285F2",
  "errors": [
    {
      "code": "server_error",
      "message": "Couldn't access service ",
      "moreInfoUrl": null,
      "target": {
        "type": null,
        "name": null
      }
    }
  ]
},Request: https://www.test.com Headers: Accept: application/json
SubscriberId: 
~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM
'@
$string -replace ',Response Content:\s*(?:\{(?>\{(?<depth>)|[^{}]+|\}(?<-depth>))*(?(depth)(?!))\})?\s*(?:,[^,]+?)*(?=~)'

输出:

Test Plan Work~Response Status: BadRequest Bad Request~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM

推荐阅读