首页 > 解决方案 > 预处理文本并排除表单脚注、额外空格和

问题描述

我需要清理我的语料库,它包括这些问题

例如,你在这里看到它:

On 1580 November 12 at 10h 50m,1 they set Mars down at 8° 36’ 50” Gemini2 without mentioning the horizontal variations, by which term I wish the diurnal parallaxes and the refractions to be understood in what follows. Now this observation is distant and isolated. It was reduced to the moment of opposition using the diurnal motion from the Prutenic Tables  . 

我已经使用这些功能完成了

def fix4token(x):
    x=re.sub('”', '\"', x)
    if (x[0].isdigit()== False )| (bool(re.search('[a-zA-Z]', x))==True ):
        res=x.rstrip('0123456789')
        output = re.split(r"\b,\b",res, 1)[0]
        return output  
    else:
        return x
def removespaces(x):
    res=x.replace("  ", " ")
    return(res)

它对此效果不错,但结果如此

On 1580 November 12 at 10h 50m, they set Mars down at 8° 36’ 50" Gemini without mentioning the horizontal variations,  by which term I wish the diurnal parallaxes and the refractions to be understood in what follows. Now this observation is distant and isolated. It was reduced to the moment of opposition using the diurnal motion from the Prutenic Tables.

但问题是它损坏了其他段落。它不能很好地工作,

我想是因为这会破坏其他东西

x=re.sub('”', '\"', x)
    if (x[0].isdigit()== False )| (bool(re.search('[a-zA-Z]', x))==True ):
        res=x.rstrip('0123456789')
        output = re.split(r"\b,\b",res, 1)[0]

做这些最安全的方法是什么?

1-删除这些短语中的脚注

不更改文本的另一部分(例如,我的方法会将“DC2”打破为“DC”,这是不希望的

2-删除点之前的多个空格。像“表”。到没有空格或删除多个之前,例如:“,由哪个术语”到此 9only 一个空格)“,由哪个术语”

3-替换未知” -> 替换“ ...完成

谢谢你

标签: pythonregex

解决方案



推荐阅读