python - 在python代码中实现n-gram用于多类文本分类
问题描述
我是 python 新手,致力于建筑行业合同文件的多类文本分类。我在我的代码中实现 n-gram 时遇到了问题,我通过从不同的在线资源获得帮助来生成表单。我想在我的代码中实现 unigram、bi-gram 和 tri-gram。在这方面的任何帮助将不胜感激。
我在我的代码的 Tfidf 部分中尝试了二元组和三元组,但它正在工作。
df = pd.read_csv('projectdataayes.csv')
df = df[pd.notnull(df['types'])]
my_types = ['Requirement','Non-Requirement']
#converting to lower case
df['description'] = df.description.map(lambda x: x.lower())
#Removing the punctuation
df['description'] = df.description.str.replace('[^\w\s]', '')
#splitting the word into tokens
df['description'] = df['description'].apply(tokenize.word_tokenize)
#stemming
stemmer = PorterStemmer()
df['description'] = df['description'].apply(lambda x: [stemmer.stem(y) for y in x])
print(df[:10])
## This converts the list of words into space-separated strings
df['description'] = df['description'].apply(lambda x: ' '.join(x))
count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['description'])
X_train, X_test, y_train, y_test = train_test_split(counts, df['types'], test_size=0.3, random_state=39)
tfidf_vect_ngram = TfidfVectorizer(analyzer='word',
token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(df['description'])
X_train_Tfidf = tfidf_vect_ngram.transform(X_train)
X_test_Tfidf = tfidf_vect_ngram.transform(X_test)
model = MultinomialNB().fit(X_train, y_train)
文件“C:\Users\fhassan\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py”,第 328 行,在 tokenize(preprocess(self.decode(doc))),stop_words)
文件“C:\Users\fhassan\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py”,第 256 行,返回 lambda x: strip_accents(x.lower())
文件“C:\Users\fhassan\anaconda3\lib\site-packages\scipy\sparse\base.py”,第 686 行,在getattr raise AttributeError(attr + " not found")
AttributeError:未找到下限
解决方案
首先,您在文本上安装矢量化器:
tfidf_vect_ngram.fit(df['description'])
然后尝试将其应用于计数:
counts = count_vect.fit_transform(df['description'])
X_train, X_test, y_train, y_test = train_test_split(counts, df['types'], test_size=0.3, random_state=39)
tfidf_vect_ngram.transform(X_train)
您需要将矢量化器应用于文本,而不是计数:
X_train, X_test, y_train, y_test = train_test_split(df['description'], df['types'], test_size=0.3, random_state=39)
tfidf_vect_ngram.transform(X_train)
推荐阅读
- batch-file - 通过 .bat 文件在网络映射目录中打开后无法保存文件
- c++ - 在 Windows 上读取二进制不断变化的文件
- javascript - 从服务到组件的角度数据绑定
- javascript - Boom - 抛出的错误和作为响应的错误的差异
- android - 找不到参数的方法 implementation() [com.google.firebase:firebase-ml-model-interpreter:15.0.0]
- python - 熊猫:如果条件从另一列更新列值
- python - Python pandas,按月选择数据
- python-3.x - 通过目录结构递归搜索
- python - 如何用'/0'替换'\0'?
- java - 返回列表时“找不到MessageBodyWriter”
使用 JAX-RS 中的响应对象