首页 > 解决方案 > Python:NLTK - 正则表达式标记器产生空输出

问题描述

我试图标记 NLTK 教科书上可用的文本(使用 python 2.7),但输出不符合预期。有什么我想念的吗?

text = 'That U.S.A. poster-print costs $12.40...'

pattern = r'''(?x)     # set flag to allow verbose regexps
   ([A-Z]\.)+          # abbreviations, e.g. U.S.A.
   | \w+(-\w+)*        # words with optional internal hyphens
   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
   | \.\.\.            # ellipsis
   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
   '''

nltk.regexp_tokenize(text, pattern)


Output: 
 [('', '', ''),
 ('A.', '', ''),
 ('', '-print', ''),
 ('', '', ''),
 ('', '', '.40'),
 ('', '', '')]

Expected:
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

标签: pythonnlpnltk

解决方案


推荐阅读