首页 > 解决方案 > 清理和组织 Twitter 数据 python

问题描述

我以这种方式提取了推特数据:

import tweepy

# Authentication
consumerKey = ''
consumerSecret = ""
accessToken = ""
accessTokenSecret =''
auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessTokenSecret)
api = tweepy.API(auth,wait_on_rate_limit=True, timeout=1000)

#Sentiment Analysis
keyword = "small businesses kenya OR msme OR sme"
noOfTweet = 2000
tweets = tweepy.Cursor(api.search_tweets, q=keyword).items(noOfTweet)

tweet_list = []

for tweet in tweets:
    #print(tweet.text)
     tweet_list.append(tweet.text)

#Number of Tweets (Total, Positive, Negative, Neutral)
tweet_list = pd.DataFrame(tweet_list)
print("total number: ",len(tweet_list))

我已经很好地收到了推特数据,但这是我的主要挑战。我确实想清理这些数据并保存在 CSV 中以供进一步分析。我希望 CSV 有如下列: 'tweeter_handle','timestamp','orig_tweet','likes','retweets','hashtags','mentions','location','tweet_text'。我曾尝试将RT @拆分为这样的另一列,但无法正常工作:

import emoji
nltk.download('words')
words = set(nltk.corpus.words.words())
def cleaner(tweet):
    tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
    tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
    tweet = " ".join(tweet.split())
    tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
    tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
    tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
         if w.lower() in words or not w.isalpha())
    return tweet
tweet_list['text'] = tweet_list['tweet'].map(lambda x: cleaner(x))
tweet_list.to_csv('business_tweets.csv')

任何有关清理和构建此数据的基本结构的帮助将不胜感激。

标签: pythonpandastwittertweepy

解决方案


推荐阅读