首页 > 解决方案 > 为什么所有文档的 BM25 评分结果都是负数?

问题描述

我一直在尝试计算 BM25 分数,以根据查询从一组文档中找到相关和不相关的文档。事实证明,由于文档的长度或文档的数量,我的 BM25 分数对于所有文档都是负数。代码在python中。文档集是 xml 文件。

****例如:

分号前面的数字是文档ID,分号后面的数字是基于查询的BM25分数。

python中的输出如下:****

查询《民办或宗教学校政府资助学券利弊研究文件》

bm25 --> {'47424': -15.148270578009287, '59301': -4.5324888278916955, '51441': -6.39047920340723, '35782': -12.439912866021055, '6409': -14.577866844394313, '70619': -17.00481043343906, '44597' : -8.697203569753626, '73731': -5.178938184315641, '56686': -15.217341163205859, '75558': -13.864972462927318, '6386': -15.063892406359518, '50516': -15.224423885623839, '53259': -15.251308584218336, '53364': -15.66416744135934, '7914': -5.66450911939087, '32528': -16.092842615677846, '16723': -10.120903791415753, '86068': -15.109324552334709, '67169': -16.25963854937583, '41521': -14.689271200861244, '25029': - 10.542716008819404,“32963”:-15.995234645822308,“49023”:-5。128845432659929, '46632': -4.522709302815306, '76481': -15.331654599460377, '19526': -16.5397158773958, '68829': -6.237632920251847, '49731': -16.64556902599432, '61487': -15.777496075315927, '16841': -4.970159012101008 , '6399': -16.14724012989677, '55974': -16.679111714509364, '76556': -14.37904634631273, '61644': -9.167501264772618, '8085': -15.048433817371734, '55891': -8.227333733937748, '3648': -16.5003155647673, '70606': -16.840470957025445, '64336': -4.2650402909943, '31281': -15.991922110559575, '2800': -14.793472384723657, '67135': -14.008735870771446, '41355': -14.200897078737842, '70854': -16.398911831821696}76481': -15.331654599460377, '19526': -16.5397158773958, '68829': -6.237632920251847, '49731': -16.64556902599432, '61487': -15.777496075315927, '16841': -4.970159012101008, '6399': -16.14724012989677, '55974 ': -16.679111714509364, '76556': -14.37904634631273, '61644': -9.167501264772618, '8085': -15.048433817371734, '55891': -8.227333733937748, '3648': -16.5003155647673, '70606': -16.840470957025445, '64336' : -4.2650402909943, '31281': -15.991922110559575, '2800': -14.793472384723657, '67135': -14.008735870771446, '41355': -14.200897078737842, '70854': -16.398911831821696}76481': -15.331654599460377, '19526': -16.5397158773958, '68829': -6.237632920251847, '49731': -16.64556902599432, '61487': -15.777496075315927, '16841': -4.970159012101008, '6399': -16.14724012989677, '55974 ': -16.679111714509364, '76556': -14.37904634631273, '61644': -9.167501264772618, '8085': -15.048433817371734, '55891': -8.227333733937748, '3648': -16.5003155647673, '70606': -16.840470957025445, '64336' : -4.2650402909943, '31281': -15.991922110559575, '2800': -14.793472384723657, '67135': -14.008735870771446, '41355': -14.200897078737842, '70854': -16.398911831821696}64556902599432, '61487': -15.777496075315927, '16841': -4.970159012101008, '6399': -16.14724012989677, '55974': -16.679111714509364, '76556': -14.37904634631273, '61644': -9.167501264772618, '8085': -15.048433817371734 , '55891': -8.227333733937748, '3648': -16.5003155647673, '70606': -16.840470957025445, '64336': -4.2650402909943, '31281': -15.991922110559575, '2800': -14.793472384723657, '67135': -14.008735870771446, “41355”:-14.200897078737842,“70854”:-16.398911831821696}64556902599432, '61487': -15.777496075315927, '16841': -4.970159012101008, '6399': -16.14724012989677, '55974': -16.679111714509364, '76556': -14.37904634631273, '61644': -9.167501264772618, '8085': -15.048433817371734 , '55891': -8.227333733937748, '3648': -16.5003155647673, '70606': -16.840470957025445, '64336': -4.2650402909943, '31281': -15.991922110559575, '2800': -14.793472384723657, '67135': -14.008735870771446, “41355”:-14.200897078737842,“70854”:-16.398911831821696}-15.048433817371734, '55891': -8.227333733937748, '3648': -16.5003155647673, '70606': -16.840470957025445, '64336': -4.2650402909943, '31281': -15.991922110559575, '2800': -14.793472384723657, '67135': - 14.008735870771446,'41355':-14.200897078737842,'70854':-16.398911831821696}-15.048433817371734, '55891': -8.227333733937748, '3648': -16.5003155647673, '70606': -16.840470957025445, '64336': -4.2650402909943, '31281': -15.991922110559575, '2800': -14.793472384723657, '67135': - 14.008735870771446,'41355':-14.200897078737842,'70854':-16.398911831821696}

标签: algorithmsortingnlpsearch-engine

解决方案


这可能是由于您的 IDF 函数,该函数通常由 IDF = log((N - n + 0.5) / (n + 0.5)) 计算。这导致出现在一半以上的语料库文档中的术语出现负值。由于您将此 IDF 值与 BM25 公式的其他部分相乘,因此您的整体 BM25 结果也将变为负数。


推荐阅读