首页 > 解决方案 > 正则表达式/方法来评论日文文本

问题描述

我有一个以下格式的大文本文件。

{
    "glossary": {
        "title": "example glossary",
        cm="私は今プログラミングーをしています"; 
        "text2": "example glossary",
        cm="私はABあああをしています"
}

我需要注释掉包含日文字符的行。此行的开头有 4 个或多个选项卡。每行的制表符数不同。我需要将上述文件更改如下:

{
    "glossary": {
        "title": "example glossary",
        */cm="私は今プログラミングーをしています";*/
        "text2": "example glossary",
        */cm="私はABあああをしています";*/
}

环境:

★ 我可以运行一个批处理文件。

★ 我可以运行一个VB脚本。

★ 我可以使用樱花编辑器。(首选)

★ 我无法使用/下载第三方软件。

我尝试过的事情。

■ 使用正则表达式 ➞ 我尝试使用正则表达式 \p{Hiragana} 和 \p{Katakana} 将日文文本替换为 "",然后 \p{Han} 但这些仍然是符号。

■ 使用VBA 我尝试使用vba 读取文本文件的每一行并将匹配的行替换为“*/”。我不知道为什么,但它替换了整个文件。我使用的代码如下:

Set objFSO = CreateObject("Scripting.FileSystemObject")
If objFSO.FileExists("C:\Users\s162138\Desktop\test.txt") then
Set objFile = objFSO.OpenTextFile("C:\Users\s162138\Desktop\test.txt", 1)

Do Until objFile.AtEndOfStream
strLine = objFile.Readline
If strNextLine = "cm=*" then
strLine = "text"+ strLine + "text"
End If

strNewText = strLine + vbcrlf
Loop
Set objFile = Nothing

Set objFile = objFSO.OpenTextFile("C:\Users\s162138\Desktop\test.txt", 2)
objFile.Write strNewText
Set objFile = Nothing
End If

如果有人可以帮助我,我将不胜感激..

标签: regexvba

解决方案


使用https://gist.github.com/ryanmcgrath/982242提供的日语正则表达式,如下所示:

^([ \t]*)(.*?(?:[\u3000-\u303F]|[\u3040-\u309F]|[\u30A0-\u30FF]|[\uFF00-\uFFEF]|[\u4E00-\u9FAF]|[\u2605-\u2606]|[\u2190-\u2195]|\u203B).*?)([ \t]*)$

替换为$1/*$2*/$3。见证明

解释

                         EXPLANATION
--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [ \t]*                   any character of: ' ', '\t' (tab) (0 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      [\u3000-\u303F]          punctuation
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      [\u3040-\u309F]          hiragana
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      [\u30A0-\u30FF]          katakana
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      [\uFF00-\uFFEF]          Full-width roman + half-width katakana
                               
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      [\u4E00-\u9FAF]          Common and uncommon kanji
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      [\u2605-\u2606]          Stars
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      [\u2190-\u2195]          arrows
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      \u203B                    Weird asterisk thing
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    [ \t]*                   any character of: ' ', '\t' (tab) (0 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string

推荐阅读