首页 > 解决方案 > Python - 如何在文本文件中使用正则表达式搜索词

问题描述

我对编码很陌生,所以任何帮助将不胜感激。

所以我在这里有一个正则表达式函数来查找 .txt 文件中的某些术语。

返回正则表达式的函数


def find_regex(start_regex, stop_regex, page_words_raw):
    # need to initialize because of bad return function
    start_char = None
    end_char_0 = None
    # searches the raw text for the start regex phrase
    for match in re.finditer(start_regex, page_words_raw):
        # just care about where the first character of the matched text starts ([0])
        start_char = match.span()[0]

    for match in re.finditer(stop_regex, page_words_raw[start_char:]):
        # but we need to know the start and end of the stop character so we can subtract it from the return
        # since we want to look for stop word after our start word we need to add the indexes lost
        # at the page_words_raw[start_char:] bit
        end_char_0 = match.span()[0] + start_char
        end_char_1 = match.span()[1] + start_char

    # if found return string minus the stop regex stuff
    if type(start_char) == int and type(end_char_0) == int:
        return page_words_raw[start_char : (end_char_1 - (end_char_1 - end_char_0))]
    else:
        print("Regex Not Found")
        return "Regext Not Found"

所有 .txt 文件都有不同的数字(即 4410、4408、4405 等),下一个字符串将始终是一个字母后跟 7 个数字的序列(即 C90253453、D0004323、N1235423)

找到四个编号序列的函数是:

    #this function finds 44xx, it's meant to stop at Mxxxxxxx 
    found_stuff = find_regex('44\d{2}', ('\s\d{7}'), page_words_raw)

当我运行它时,它会返回 4407,但不会在 C0243543 处停止。有没有办法解决这个问题?

标签: pythonregexfunction

解决方案


如果您可以容忍将整个文件读入 Python,那么您的要求很容易使用re.findall

text = """4410 C90253453 4408 D0004323 4405 N1235423"""
nums = re.findall(r'\b(\d{4})\s+[A-Z]\d+\b', text)
print(nums)

这打印:

['4410', '4408', '4405']

推荐阅读