首页 > 解决方案 > 如何创建用于编码 nltk 频率分布的 pandas 数据帧

问题描述

嘿,我是 Python 的绝对初学者(受过培训的语言学家),不知道如何将我用 Twint 抓取的 twitter 数据(存储在 a 中csv-file)放入DataFramePandas 中的 a 中以便能够编码nltk frequency-distributions。实际上,我什至不确定创建一个测试文件和一个训练文件是否重要,就像我所做的那样(参见下面的代码)。我知道这是一个非常基本的问题。但是,获得一些帮助会很棒!谢谢你。

这是我到目前为止所拥有的:

import pandas as pd
data = pd.read_csv("test_newtest90.csv") 
data = pd.read_csv("train_newtest90.csv")

import re
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import string
import nltk
import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)

%matplotlib inline

train  = pd.read_csv("train_newtest90.csv")
test = pd.read_csv("test_newtest90.csv")
combi = train.append(test, ignore_index=True)

如果我检查:

combi["tidy_tweet"].dtypes

我明白了:

dtype("0")

标签: pythonpandasnltk

解决方案


您不需要将 csv 拆分为训练集和测试集。仅当您要训练模型时才需要这样做,而事实并非如此。所以只需加载原始未拆分的 csv 文件:

import pandas as pd
data = pd.read_csv("filename.csv") 

下一步是清理推文以去除主题标签、网址等:

import regex as re

# use RegEx to clean tweets
def cleaningTweets(twt):
    twt = re.sub('@[A-Za-z0-9]+', '', twt)
    twt = re.sub('#', '', twt)
    twt = re.sub('https?:\/\/\S+', '', twt)
    return twt

# apply previous function to the current df, assuming the relevant column name is "tweets"
df.tweets = df.tweets.apply(cleaningTweets)
# make all words lowercase
df.tweets = df.tweets.str.lower()

现在你可以开始做一些有趣的事情了,比如词频计数。但为了这样做,建议首先删除停用词和标点符号:

import nltk
import string 

# load a list of stopwords from nltk to clean the tweets from stop words
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
#remove stopwords and punctuation
df['no_stopwords'] = df.tweets.apply(lambda x: ' '.join([i for i in x.split(" ") if i not in string.punctuation if i not in stopwords]))
# count word frequency
df_word_freq = df.no_stopwords.str.split(expand=True).stack().value_counts()
# save the top 50 to csv
df_word_freq.head(50).to_csv('word_count.csv')
#df_word_freq.to_csv('word_count.csv') for saving the entire df
# create bar chart of top 50
df_word_freq.head(50).plot.bar()

或者你可以做一个词云:

from wordcloud import WordCloud

# first merge all tweets into one string
whole_words = " ".join([tweets for tweets in df.tweets])
# feed the string to WordCloud
word_cloud = WordCloud(width = 700, height = 500, random_state = 1, min_font_size = 10, stopwords = stopwords).generate(whole_words)
# save wordcloud as png file
word_cloud.to_file('wordcloud.png')

我希望这能让你开始!


推荐阅读