首页 > 解决方案 > 从源数据训练 Word2Vec 模型 - 问题标记化数据

问题描述

我最近从 Google Bigquery 获取并整理了很多 reddit 数据。

数据集如下所示:

数据预览

在将此数据传递给 word2vec 以创建词汇表并进行训练之前,我需要正确标记“body_cleaned”列。

我已经尝试使用手动创建的函数和 NLTK 的 word_tokenize 进行标记化,但现在我将重点放在使用 word_tokenize 上。

因为我的数据集比较大,接近 1200 万行,我不可能一次性打开数据集并对其执行功能。Pandas 尝试将所有内容加载到 RAM 中,正如您所理解的那样,它会崩溃,即使在具有 24GB 内存的系统上也是如此。

我面临以下问题:

为了解决这个问题,我创建了一个很小的数据子集,并尝试以两种不同的方式对该数据执行标记化:

reddit_subset = reddit_data[:50]

reddit_subset['tokens'] = reddit_subset['body_cleaned'].apply(lambda x: word_tokenize(x))

这会产生以下结果: 代币化数据预览

这实际上适用于 word2vec 并产生可以使用的模型。到目前为止很棒。

由于我无法一次性对如此庞大的数据集进行操作,因此我不得不对如何处理这个数据集进行创新。我的解决方案是批处理数据集并使用 Panda 自己的 batchsize 参数在小迭代中处理它。

我编写了以下函数来实现这一点:

def reddit_data_cleaning(filepath, batchsize=20000):
    if batchsize:
        df = pd.read_csv(filepath, encoding='utf-8', error_bad_lines=False, chunksize=batchsize, iterator=True, lineterminator='\n')
    print("Beginning the data cleaning process!")
    start_time = time.time()
    flag = 1
    chunk_num = 1
    for chunk in df:
        chunk[u'tokens'] = chunk[u'body_cleaned'].apply(lambda x: word_tokenize(x))
        chunk_num += 1
    if flag == 1:
            chunk.dropna(how='any')
            chunk = chunk[chunk['body_cleaned'] != 'deleted']
            chunk = chunk[chunk['body_cleaned'] != 'removed']
            print("Beginning writing a new file")
            chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='w+', index=None, header=True)
            flag = 0
        else:
            chunk.dropna(how='any')
            chunk = chunk[chunk['body_cleaned'] != 'deleted']
            chunk = chunk[chunk['body_cleaned'] != 'removed']
            print("Adding a chunk into an already existing file")
            chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='a', index=None, header=None)
    end_time = time.time()
    print("Processing has been completed in: ", (end_time - start_time), " seconds.")

尽管这段代码允许我实际处理这个庞大的数据集块并产生结果,否则我会因内存故障而崩溃,但我得到的结果不符合我的 word2vec 要求,这让我很困惑原因为了它。

我使用上述函数对 Data 子集执行相同的操作,以比较两个函数的结果有何不同,得到以下结果:

方法比较

所需的结果在 new_tokens 列上,分块数据帧的函数会产生“tokens”列结果。

有没有人更聪明地帮助我理解为什么相同的标记化函数会根据我对数据帧的迭代方式产生完全不同的结果?

如果您通读整个问题并坚持下去,我将不胜感激!

标签: pythonpandastokenizeword2vec

解决方案


First & foremost, beyond a certain size of data, & especially when working with raw text or tokenized text, you probably don't want to be using Pandas dataframes for every interim result.

They add extra overhead & complication that isn't fully 'Pythonic'. This is particularly the case for:

  • Python list objects where each word is a separate string: once you've tokenized raw strings into this format, as for example to feed such texts to Gensim's Word2Vec model, trying to put those into Pandas just leads to confusing list-representation issues (as with your columns where the same text might be shown as either ['yessir', 'shit', 'is', 'real'] – which is a true Python list literal – or [yessir, shit, is, real] – which is some other mess likely to break if any tokens have challenging characters).
  • the raw word-vectors (or later, text-vectors): these are more compact & natural/efficient to work with in raw Numpy arrays than Dataframes

So, by all means, if Pandas helps for loading or other non-text fields, use it there. But then use more fundamntal Python or Numpy datatypes for tokenized text & vectors - perhaps using some field (like a unique ID) in your Dataframe to correlate the two.

Especially for large text corpuses, it's more typical to get away from CSV and instead use large text files, with one text per newline-separated line, and any each line being pre-tokenized so that spaces can be fully trusted as token-separated.

That is: even if your initial text data has more complicated punctuation-sensative tokenization, or other preprocessing that combines/changes/splits other tokens, try to do that just once (especially if it involves costly regexes), writing the results to a single simple text file which then fits the simple rules: read one text per line, split each line only by spaces.

Lots of algorithms, like Gensim's Word2Vec or FastText, can either stream such files directly or via very low-overhead iterable-wrappers - so the text is never completely in memory, only read as needed, repeatedly, for multiple training iterations.

For more details on this efficient way to work with large bodies of text, see this artice: https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/


推荐阅读