首页 > 解决方案 > 正则表达式不适用于在 word 文档的范围内查找

问题描述

正则表达式不工作,要提取两个部分之间的内容(功能工作正常,但可能我没有提到正确的正则表达式来查找)

ExtractFromWordDoc"D:\Scan.doc" '(?:\d{2}\.\d).*(?:Non-Payment)'  '(?:\d{2}\.\d).*(?:Financial covenants and other obligation)'

Word文档内容(需要提取29.1和29.2之间的信息)

29.1 不付款

债务人不会在到期日支付任何根据财务文件在其表示为应付的地点和货币支付的应付款项,除非:

(a) 其未能付款的原因是: (i) 行政或技术错误;(b) [付款在以下期限内进行:(i)(在上述 (a)(i) 段的情况下),到期日的 [ ] 个工作日;

29.2 财务契约和其他义务

(a) 第 27 条(财务契约)的任何要求未得到满足[或债务人不遵守第 26 条(信息承诺)][和/或第 28 条(一般承诺)的规定]。

function ExtractFromWordDoc{
Param([string]$SourceFile, [string]$SearchKeyword1, [string]$SearchKeyword2)

$word = New-Object -ComObject Word.Application
$word.Visible = $false
$doc = $word.Documents.Open($SourceFile,$false,$true)
$sel = $word.Selection 
$paras = $doc.Paragraphs 
foreach ($para in $paras) 
{ 
    if ($para.Range.Text -match $SearchKeyword1)
    {
        $startPosition = $para.Range.Start
       }
    if ($para.Range.Text -match $SearchKeyword2)
    {
        $endPosition = $para.Range.Start
        break
    }
} 

[array]$content=New-Object System.Collections.ArrayList
$doc.Range($startPosition, $endPosition).Copy()
$content=Get-Clipboard -Raw
$content = $content -replace "'", ""

# cleanup com objects
$doc.Close()
$word.Quit()
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($doc) | Out-Null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($word) | Out-Null
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
}

标签: regexpowershell

解决方案


您在正则表达式中只有一个小错误。

示例文本Non-payment但正则表达式匹配Non-Payment(区分大小写)

如果你改变'(?:\d{2}\.\d).*(?:Non-Payment)''(?:\d{2}\.\d).*(?:Non-payment)'应该工作。

另一个注意事项是您缺少sfrom obligationsin(?:\d{2}\.\d).*(?:Financial covenants and other obligation)但我不预见它会导致问题。

免责声明:我没有测试你的代码,只有你的正则表达式。

编辑:

我测试了以下

function ExtractFromWordDoc{
Param([string]$SourceFile, [string]$SearchKeyword1, [string]$SearchKeyword2)

$word = New-Object -ComObject Word.Application
$word.Visible = $false
$doc = $word.Documents.Open($SourceFile,$false,$true)
$sel = $word.Selection 
$paras = $doc.Paragraphs 
foreach ($para in $paras) 
{ 
    if ($para.Range.Text -match $SearchKeyword1)
    {
        #"Point 1"
        $startPosition = $para.Range.Start
       }
    if ($para.Range.Text -match $SearchKeyword2)
    {
        #"Point 2"
        $endPosition = $para.Range.Start
        break
    }
} 

[array]$content=New-Object System.Collections.ArrayList
$doc.Range($startPosition, $endPosition).Copy()
$content=Get-Clipboard -Raw
$content = $content -replace "'", ""

# cleanup com objects
$doc.Close()
$word.Quit()
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($doc) | Out-Null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($word) | Out-Null
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
}

ExtractFromWordDoc "C:\testing\test.doc" '(?:\d{2}\.\d).*(?:Non-payment)'  '(?:\d{2}\.\d).*(?:Financial covenants and other obligation)'

剪贴板中的输出是:

29.1 Non-payment
An Obligor does not pay on the due date any amount payable pursuant to a Finance Document at the place at and in the currency in which it is expressed to be payable unless:
(a) its failure to pay is caused by: (i) administrative or technical error; or (b) [payment is made within: (i) (in the case of paragraph (a)(i) above), [ ] Business Days of its due date; 

如果我添加$content到函数的末尾,它会将此文本输出到控制台。


推荐阅读