首页 > 解决方案 > 从 Wikipedia 数据库转储构建语料库

问题描述

首先,我获取并用于此任务的文件enwiki-latest-pages-articles.xml.bz2. (从这里开始。)

然后我把这个解压缩的文件放在我.py文件的同一目录中,我把它命名为WikiCorpus.py. 以下是python摘自Building a Wikipedia Text Corpus for Natural Language Processing的代码:

# -*- coding: utf-8 -*-
"""
Created on Sun May  3 11:33:46 2020

@author: Standard
"""

"""
Creates a corpus from Wikipedia dump file.
Inspired by:
https://github.com/panyang/Wikipedia_Word2vec/blob/master/v1/process_wiki.py
"""

import sys
from gensim.corpora import WikiCorpus

def make_corpus(in_f, out_f):

    """Convert Wikipedia xml dump file to text corpus"""

    output = open(out_f, 'w')
    wiki = WikiCorpus(in_f)

    i = 0
    for text in wiki.get_texts():
        output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
        i = i + 1
        if (i % 10000 == 0):
            print('Processed ' + str(i) + ' articles')
    output.close()
    print('Processing complete!')


if __name__ == '__main__':

    if len(sys.argv) != 3:
        print('Usage: python make_wiki_corpus.py <wikipedia_dump_file> <processed_text_file>')
        sys.exit(1)
    in_f = sys.argv[1]
    out_f = sys.argv[2]
    make_corpus(in_f, out_f)

但是当我去执行它时,我得到以下错误:

Usage: python make_wiki_corpus.py <wikipedia_dump_file> <processed_text_file>
An exception has occurred, use %tb to see the full traceback.

SystemExit: 1

标签: pythonwikipedia

解决方案


推荐阅读