python - 重复抓取数据时,Tweepy 返回相同的推文
问题描述
我正在从 Twitter 抓取推文数据,因为 Twitter 对此有限制,我每 15 分钟抓取 2500 条推文数据,但是,我观察到每次运行 15 分钟后都会返回相同的推文。有什么办法可以让我使用一些偏移量跳过以前抓取的推文数据。谢谢你!
这是我的代码:
# Import libraries
from tweepy import OAuthHandler
#from tweepy.streaming import StreamListener
import tweepy
import csv
import pandas as pd
#import re
#from textblob import TextBlob
#import string
#import preprocessor as p
#import os
import time
# Twitter credentials
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''
# Pass your twitter credentials to tweepy via its OAuthHandler
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
def extract_tweets(search_words,date_since,numTweets):
return(tweepy.Cursor(api.search, q=search_words, lang="en", since=date_since, tweet_mode='extended').items(numTweets))
def scrapetweets(search_words, date_since, numTweets, numRuns):
# Define a pandas dataframe to store the date:
db_tweets = pd.DataFrame(columns = ['username', 'acctdesc', 'location', 'following', 'followers', 'totaltweets', 'usercreatedts', 'tweetcreatedts', 'retweetcount', 'text', 'hashtags'])
#db_tweets = pd.DataFrame()
for i in range(numRuns):
tweets = extract_tweets(search_words,date_since,numTweets)
# Store these tweets into a python list
tweet_list = [tweet for tweet in tweets]
print(len(tweet_list))
noTweets = 0
for tweet in tweet_list:
username = tweet.user.screen_name
acctdesc = tweet.user.description
location = tweet.user.location
following = tweet.user.friends_count
followers = tweet.user.followers_count
totaltweets = tweet.user.statuses_count
usercreatedts = tweet.user.created_at
tweetcreatedts = tweet.created_at
retweetcount = tweet.retweet_count
hashtags = tweet.entities['hashtags']
lst=[]
for h in hashtags:
lst.append(h['text'])
try:
text = tweet.retweeted_status.full_text
except AttributeError: # Not a Retweet
text = tweet.full_text
itweet = [username,acctdesc,location,following,followers,totaltweets,usercreatedts,tweetcreatedts,retweetcount,text,lst]
db_tweets.loc[len(db_tweets)] = itweet
noTweets += 1
print(noTweets,itweet)
#filename = "tweets.csv"
#with open(filename, "a", newline='') as fp:
# wr = csv.writer(fp, dialect='excel')
# wr.writerow(itweet)
print('no. of tweets scraped for run {} is {}'.format(i + 1, noTweets))
if i+1 != numRuns:
time.sleep(920)
filename = "tweets.csv"
# Store dataframe in csv with creation date timestamp
db_tweets.to_csv(filename, mode='a', index = False)
# Initialise these variables:
search_words = "#India OR #COVID-19"
date_since = "2020-04-29"
#date_until = "2020-05-01"
numTweets = 2500
numRuns = 10
# Call the function scrapetweets
program_start = time.time()
scrapetweets(search_words, date_since, numTweets, numRuns)
program_end = time.time()
print('Scraping has completed!')
print('Total time taken to scrape is {} minutes.'.format(round(program_end - program_start)/60, 2))
为此,我参考了媒体上的博客。
解决方案
您可以添加一个变量作为验证器并将其存储到一个可能是 tweetid.txt 的文件中
每次运行脚本时,都会打开 di tweetid.txt
如果tweetid 与txt 中的tweet id 相同,则通过它。
推荐阅读
- citations - 引用 Mendeley 的机构作者
- css - webpack:将全局样式文件导入所有组件
- javascript - 在两个 ajax 请求都完成后,这段代码片段如何运行最终函数?
- r - 如何将 Benjamini-Hochberge 校正的 p 值添加到箱线图中,而不是 R 中默认的“holm”校正 p 值?
- java - 如何在 javaparser 中使用 getRange 获取源代码
- encoding - 将 std::string 从 UTF8、UTF16、ISO88591 转换为十六进制
- mongodb - 如何配置mongodb索引以在特定类型的时间范围内按值排序?
- python - 在 python3 中使用 pickle 加载时的内存泄漏
- c++ - 在此代码中验证半径时遇到问题
- angular - 从服务访问模板引用