首页 > 解决方案 > Python:在文本页面(文件)中重新编号脚注的算法

问题描述

假设您有文本文件,每个文件都包含书中页面的文本。假设每页有 0 到 10 个脚注 - 并且对于一章中的所有页面,它们的编号为 1 到 N。现在,假设书中一章的最后一页也将与下一章的第一页重叠。

脚注用以下语法声明: (1) 在页面的文本中。

正是重叠的页面让我适合每页重新编号脚注。我希望每一页都有从 1 到 N 的脚注。

这是一个对我提出的所有循环都有问题的特殊情况的示例:

示例原始页面文本:

A footnote from the last part of a chapter might begin with any number footnote(2).  
This might be in the last paragraph of some chapter that is ending.

Some Next Chapter DD

A single line(1) of text might have multiple footnotes(2) in it on the same line.
Then a new line of text has another footnote(3) in it.

我想重新编号上面示例页面的脚注,以生成下面的示例页面:

Desire Footnotes Renumbered Page:
----- 带脚注的示例页面开始 -----

A footnote from the last part of a chapter might begin with any number footnote(1).  
This might be in the last paragraph of some chapter that is ending. 

Some Next Chapter DD

A single line(2) of text might have multiple footnotes(3) in it on the same line.
Then a new line of text has another footnote(4) in it.

使用 Python,我还没有找到任何有效的循环算法——无论你是立即对文件进行更正,还是缓冲更正——循环的下一次循环可能会正确地重新编号正确的脚注,或者可能会弄乱已经更正的脚注上一个循环通过。我是否需要使用文件查找操作,或者某种正则表达式循环可以处理这个?

标签: pythonalgorithmstring-matching

解决方案


我现在有一个解决这个问题的方法。事实证明,内联更改有时会导致在同一行出现两个相同的脚注,其中第二个相同的脚注是下一个要更改的脚注。使用正则表达式将命中先前更改的第一个。处理这种情况需要一点小心。

对于下面的代码,page是来自 file_handle.readlines() 的文本行列表



def replace_nth_substring_in_string(string, old, new, nth):
    split_location = [m.start() for m in re.finditer(old, string)][nth - 1]
    (head, tail) = (string[:split_location], string[split_location:])
    tail = tail.replace(old, new, 1)
    return head + tail


new_num = 1
for i in range(len(page)):
    footnote_matches = re.findall( '\(\d+\)', page[i] )
    for nth, match in enumerate( footnote_matches, start=1):
        old = match
        new = '({})'.format(new_num)

        # grabbing this piece of info is key !
        num_old_foots_on_line = page[i].count( old ) 

        # normal case; simple replace
        if num_old_foots_on_line == 1:
            page[i] = page[i].replace( old, new, nth )

        # if a previous correction has now caused two idential footnotes
        # then replace the last one only ...
        elif num_old_foots_on_line == 2:
            page[i] = replace_nth_substring_in_string(page[i], old, new, 1)

        # for my case, there should never be more than three identical
        # but for others, they may have to handle this case
        else:
            print("There are three (or more) footnotes on this line")
            sys.exit()
        new_num+=1



推荐阅读