python - 为特定推文抓取 Twitter API
问题描述
我正在尝试在 Twitter 上抓取特定的关键字,这些关键字已放入数组中
keywords = ["art", "railway", "neck"]
我正在尝试在特定位置搜索这些单词,我将其写为
PLACE_LAT = 29.7604
PLACE_LON = -95.3698
PLACE_RAD = 200
然后我尝试应用一个函数来查找至少 200 条推文,但我知道每个查询只能搜索 100 条。到目前为止,我的代码如下,但是,此代码不起作用。
def retrieve_tweets(api, keyword, batch_count, total_count, latitude, longitude, radius):
"""
collects tweets using the Twitter search API
api: Twitter API instance
keyword: search keyword
batch_count: maximum number of tweets to collect per each request
total_count: maximum number of tweets in total
"""
# the collection of tweets to be returned
tweets_unfiltered = []
tweets = []
# the number of tweets within a single query
batch_count = str(batch_count)
'''
You are required to insert your own code where instructed to perform the first query to Twitter API.
Hint: revise the practical session on Twitter API on how to perform query to Twitter API.
'''
# per the first query, to obtain max_id_str which will be used later to query sub
resp = api.request('search/tweets', {'q': keywords,
'count': '100',
'lang':'en',
'result_type':'recent',
'geocode':'{PLACE_LAT},{PLACE_LONG},{PLACE_RAD}mi'.format(latitude, longitude, radius)})
# store the tweets in a list
# check first if there was an error
if ('errors' in resp.json()):
errors = resp.json()['errors']
if (errors[0]['code'] == 88):
print('Too many attempts to load tweets.')
print('You need to wait for a few minutes before accessing Twitter API again.')
if ('statuses' in resp.json()):
tweets_unfiltered += resp.json()['statuses']
tweets = [tweet for tweet in tweets_unfiltered if ((tweet['retweeted'] != True) and ('RT @' not in tweet['text']))]
# find the max_id_str for the next batch
ids = [tweet['id'] for tweet in tweets_unfiltered]
max_id_str = str(min(ids))
# loop until as many tweets as total_count is collected
number_of_tweets = len(tweets)
while number_of_tweets < total_count:
resp = api.request('search/tweets', {'q': keywords,
'count': '50',
'lang':'en',
'result_type': 'recent',
'max_id': max_id_str,
'geocode':'{PLACE_LAT},{PLACE_LONG},{PLACE_RAD}mi'.format(latitude, longitude, radius)}
)
if ('statuses' in resp.json()):
tweets_unfiltered += resp.json()['statuses']
tweets = [tweet for tweet in tweets_unfiltered if ((tweet['retweeted'] != True) and ('RT @' not in tweet['text']))]
ids = [tweet['id'] for tweet in tweets_unfiltered]
max_id_str = str(min(ids))
number_of_tweets = len(tweets)
print("{} tweets are collected for keyword {}. Last tweet created at {}".format(number_of_tweets,
keyword,
tweets[number_of_tweets-1]['created_at']))
return tweets
我只需要在上面写着#Insert your code 的地方写代码。我需要进行哪些更改才能使其正常工作
def retrieve_tweets(api, keyword, batch_count, total_count, latitude, longitude, radius):
"""
collects tweets using the Twitter search API
api: Twitter API instance
keyword: search keyword
batch_count: maximum number of tweets to collect per each request
total_count: maximum number of tweets in total
"""
# the collection of tweets to be returned
tweets_unfiltered = []
tweets = []
# the number of tweets within a single query
batch_count = str(batch_count)
'''
You are required to insert your own code where instructed to perform the first query to Twitter API.
Hint: revise the practical session on Twitter API on how to perform query to Twitter API.
'''
# per the first query, to obtain max_id_str which will be used later to query sub
resp = api.request('search/tweets', {'q': #INSERT YOUR CODE
'count': #INSERT YOUR CODE
'lang':'en',
'result_type':'recent',
'geocode':'{},{},{}mi'.format(latitude, longitude, radius)})
# store the tweets in a list
# check first if there was an error
if ('errors' in resp.json()):
errors = resp.json()['errors']
if (errors[0]['code'] == 88):
print('Too many attempts to load tweets.')
print('You need to wait for a few minutes before accessing Twitter API again.')
if ('statuses' in resp.json()):
tweets_unfiltered += resp.json()['statuses']
tweets = [tweet for tweet in tweets_unfiltered if ((tweet['retweeted'] != True) and ('RT @' not in tweet['text']))]
# find the max_id_str for the next batch
ids = [tweet['id'] for tweet in tweets_unfiltered]
max_id_str = str(min(ids))
# loop until as many tweets as total_count is collected
number_of_tweets = len(tweets)
while number_of_tweets < total_count:
resp = api.request('search/tweets', {'q': #INSERT YOUR CODE
'count': #INSERT YOUR CODE
'lang':'en',
'result_type': #INSERT YOUR CODE
'max_id': max_id_str,
'geocode': #INSERT YOUR CODE
)
if ('statuses' in resp.json()):
tweets_unfiltered += resp.json()['statuses']
tweets = [tweet for tweet in tweets_unfiltered if ((tweet['retweeted'] != True) and ('RT @' not in tweet['text']))]
ids = [tweet['id'] for tweet in tweets_unfiltered]
max_id_str = str(min(ids))
number_of_tweets = len(tweets)
print("{} tweets are collected for keyword {}. Last tweet created at {}".format(number_of_tweets,
keyword,
tweets[number_of_tweets-1]['created_at']))
return tweets
解决方案
你的问题或问题是什么?我在你的帖子里没看到。
一些建议...从您的请求中删除lang
和参数。result_type
因为你正在使用geocode
你不应该期待很多结果,因为几乎没有人在他们发推文时打开位置。
max_id
此外,您可能希望查看TwitterPager
为您处理此问题的类,而不是使用参数。这是一个示例:https ://github.com/geduldig/TwitterAPI/blob/master/examples/page_tweets.py 。
推荐阅读
- flutter - Flutter 运行旧应用程序但不运行新应用程序
- android - 反应原生 - 无法从共享意图中获取 url
- google-cloud-firestore - 最佳实践:编辑帖子标题同时也是页面上数据的文档 ID?
- heroku - 尝试在 Heroku 上部署胖 Jar 时出错
- spring-jdbc - 如何在 JdbcTemplate 中自动将行映射到类?
- r - 无法以闪亮的方式显示绘图热图
- ruby-on-rails - jwt_session rails 6 令牌过期
- c - 使用select的客户端和服务器通信不能用C相互发送和接收
- javascript - 使用相同站点创建 cookie:“Lax”
- r - 将循环的输出添加到 R 中的数据帧