首页 > 解决方案 > 数据集不平衡,大小限制为 60mb,电子邮件分类

问题描述

我有一个 1gb 原始电子邮件的高度不平衡数据集(大约 1:100),必须将这些邮件分为 15 个类别。

我遇到的问题是用于训练模型的文件大小限制不能超过 40mb。

所以我想过滤掉最能代表整个类别的每个类别的邮件。

例如:对于 A 类,数据集中有 100 封电子邮件,由于大小限制,我只想过滤掉 10 封电子邮件,这将代表所有 100 封电子邮件的最大特征。

我读到 tfidf 可用于执行此操作,为所有类别创建该特定类别的所有电子邮件的语料库,然后尝试找到最能代表但不确定如何执行此操作的电子邮件。代码片段将有很大帮助。

再加上数据集中有很多垃圾词和哈希值,我应该清理所有这些,即使我尝试了很多清理和手动清理它的难度。

标签: machine-learningnlptfidfvectorizer

解决方案


TF-IDF 代表词频,逆词频。这个想法是根据一般性和特殊性找出哪些词更具代表性。

您提出的想法并没有那么糟糕,并且可以适用于肤浅的方法。这是一个片段,可帮助您了解如何执行此操作:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

## Suppose Docs1 and Docs2 are the groups of e-mails. Notice that docs1 has more lines than docs2
docs1 = ['In digital imaging, a pixel, pel,[1] or picture element[2] is a physical point in a raster image, or the smallest addressable element in an all points addressable display device; so it is the smallest controllable element of a picture represented on the screen',
       'Each pixel is a sample of an original image; more samples typically provide more accurate representations of the original. The intensity of each pixel is variable. In color imaging systems, a color is typically represented by three or four component intensities such as red, green, and blue, or cyan, magenta, yellow, and black.',
        'In some contexts (such as descriptions of camera sensors), pixel refers to a single scalar element of a multi-component representation (called a photosite in the camera sensor context, although sensel is sometimes used),[3] while in yet other contexts it may refer to the set of component intensities for a spatial position.',
        'The word pixel is a portmanteau of pix (from "pictures", shortened to "pics") and el (for "element"); similar formations with \'el\' include the words voxel[4] and texel.[4]',
        'The word "pixel" was first published in 1965 by Frederic C. Billingsley of JPL, to describe the picture elements of video images from space probes to the Moon and Mars.[5] Billingsley had learned the word from Keith E. McFarland, at the Link Division of General Precision in Palo Alto, who in turn said he did not know where it originated. McFarland said simply it was "in use at the time" (circa 1963).[6]'
       ]

docs2 = ['In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers. Dichotomization is the special case of discretization in which the number of discrete classes is 2, which can approximate a continuous variable as a binary variable (creating a dichotomy for modeling purposes, as in binary classification).',
         'Discretization is also related to discrete mathematics, and is an important component of granular computing. In this context, discretization may also refer to modification of variable or category granularity, as when multiple discrete variables are aggregated or multiple discrete categories fused.',
         'Whenever continuous data is discretized, there is always some amount of discretization error. The goal is to reduce the amount to a level considered negligible for the modeling purposes at hand.',
         'The terms discretization and quantization often have the same denotation but not always identical connotations. (Specifically, the two terms share a semantic field.) The same is true of discretization error and quantization error.'
         ]

## We sum them up to have a universal TF-IDF dictionary, so that we can 'compare oranges to oranges'
docs3 = docs1+docs2

## Using Sklearn TfIdfVectorizer - it is easy and straight forward!
vectorizer = TfidfVectorizer()

## Now we make the universal TF-IDF dictionary, MAKE SURE TO USE THE MERGED LIST AND fit() [not fittransform]
X = vectorizer.fit(docs3)

## Checking the array shapes after using transform (fitting them to the tf-idf dictionary)
## Notice that they are the same size but with distinct number of lines
print(X.transform(docs1).toarray().shape, X.transform(docs2).toarray().shape)

(5, 221) (4, 221)

## Now, to "merge" them all, there are many ways to do it - here I used a simple "mean" method.
transformed_docs1 = np.mean(X.transform(docs1).toarray(), axis=0)
transformed_docs2 = np.mean(X.transform(docs1).toarray(), axis=0)
print(transformed_docs1)
print(transformed_docs2)
[0.02284796 0.02284796 0.02805426 0.06425141 0.         0.03212571
 0.         0.03061173 0.02284796 0.         0.         0.04419432
 0.08623564 0.         0.         0.         0.03806573 0.0385955
 0.04569592 0.         0.02805426 0.02805426 0.         0.04299283
...
 0.         0.02284796 0.         0.05610853 0.02284796 0.03061173
 0.         0.02060219 0.         0.02284796 0.04345487 0.04569592
 0.         0.         0.02284796 0.         0.03061173 0.02284796
 0.04345487 0.07529817 0.04345487 0.02805426 0.03061173]
## These are the final Shapes.
print(transformed_docs1.shape, transformed_docs2.shape)

(221,) (​​221,)

关于删除垃圾词,TF-IDF 将稀有词(例如数字等)平均出来 - 如果它太稀有,则无关紧要。但这可能会大大增加输入向量的大小,因此我建议您找到一种清理它们的方法。此外,考虑一些 NLP 预处理步骤,例如词形还原,以降低维度。


推荐阅读