首页 > 解决方案 > Regular expression exclude matches surrounded by quotation marks and lines starting with %

问题描述

I want to make a regular expression that is able to do the following:

So I came up with the following regex (with flags g, m):

^[^%]*?(?<=[^\'\"])\b(addpaths|addpath|test)\b(?=[^\'\"]).*?$?

And this gives me the following result (see regex101):

function addpaths()                         --> match, correct
  % function addpaths to add paths to path  --> no match, correct
  fprintf('running addpaths')               --> no match, correct
  fprintf('addpaths running')               --> no match, correct
  fprintf('running addpaths.')              --> match, wrong
  fprintf('running addpaths function')      --> match, wrong

  % fprintf('running addpaths')             --> no match, correct
  % fprintf('addpaths running')             --> no match, correct
  % fprintf('running addpaths function')    --> no match, correct

  % test what happens to 'test'     --> no match, correct
  run('test')                       --> no match, correct
  'this is a test.'                 --> match, wrong
  test                              --> match, correct

So the regex works when one of the exact matching words is next to a ', but not when there is another word, whitespace or . next to it. Why?

import re

text = '''function addpaths()
  % function addpaths to add paths to path
  fprintf('running addpaths')
  fprintf('addpaths running')
  fprintf('running addpaths function')

  % fprintf('running addpaths')
  % fprintf('addpaths running')
  % fprintf('running addpaths function')

  % test what happens to 'test'
  run('test')
  'this is a test.'
  test
'''

pattern = '^[^%]*?(?<=[^\'\"])\\b(addpaths|addpath|test)\\b(?=[^\'\"]).*?$'
regex = re.compile(pattern, re.M)

matches = regex.findall(text)
for m in matches:
    print(m)

标签: pythonregex

解决方案


Try this:

import re


text = '''function addpaths()
  % function addpaths to add paths to path
  fprintf('running addpaths')
  fprintf('addpaths running')
  fprintf('running addpaths function')

  % fprintf('running addpaths')
  % fprintf('addpaths running')
  % fprintf('running addpaths function')

  % test what happens to 'test'
  run('test')
  'this is a test.'
  test'''

pattern = r"""^(?!\s*%)[^'\"]+?\b(addpaths|addpath|test)\b(?!.*?['\"]).*?$"""
regex = re.compile(pattern, re.M)

for line in text.split('\n'):
    print(line.ljust(50, ' '), regex.match(line) and 'OK' or 'NO MATCH')

OUPUT:

function addpaths()                                OK
  % function addpaths to add paths to path         NO MATCH
  fprintf('running addpaths')                      NO MATCH
  fprintf('addpaths running')                      NO MATCH
  fprintf('running addpaths function')             NO MATCH
                                                   NO MATCH
  % fprintf('running addpaths')                    NO MATCH
  % fprintf('addpaths running')                    NO MATCH
  % fprintf('running addpaths function')           NO MATCH
                                                   NO MATCH
  % test what happens to 'test'                    NO MATCH
  run('test')                                      NO MATCH
  'this is a test.'                                NO MATCH
  test                                             OK

I used negative lookahead (?!.*?['\"]) because 'this is a test.' after the word test there is . but in you regex (addpaths|addpath|test)\b(?=[^\'\"]) you excluded the text that is followed directly by quotes. and this why this run('test') didn't mach.


推荐阅读