首页 > 技术文章 > scrapy爬取知乎用户信息并存入mongodb

francischeng 2018-09-08 10:45 原文

大致的思路流程如下:

通过这样的爬取流程,可以不断递归爬去大量数。

我们以轮子哥作为选定起始人,利用Google chrome 的审查观察网页的network

可以看出实际上我们在浏览器在获得这follower 和followee数据时,是调用了一个知乎的api.并且我们观察每个用户的数据,可以知道url_token这个参数是非常重要的,因为它标实了每个用户,我们可以用过url_token来定位这个用户。

为了获取单个用户更加详细的信息,我们把鼠标放在用户头像上

这时候network中会出现一个相对应的A-jax请求:

打开cmd,通过scrapy startproject zhihuuser 命令创建项目开始正式写代码

  1. 需要先把settings.py中的ROBOTSTXT_OBEY选择为false,让爬虫畅通无阻不受协议的约束,否则有些地方是访问不了的。
  2. 进入zhihuuser的文件夹,使用scrapy genspider zhihu www.zhihu.com创建知乎的爬虫

  3. 首先我们测试这个爬虫,直接在cmd中运行爬虫。运行过程中出现了卡顿,并且出现了400代号的服务器错误相应。这是因为知乎服务器有user-agent识别,于是我们修改settings.py中的DEFAULT_REQUEST_HEADERS 参数,添加上浏览器默认的user-agent
  4. 改修初始请求:

    第一步创建了几个user_url follow_url的格式,并且添加上了他的include参数(即follow_query)

    第二步是重写了start_requests,初始的用户选择轮子哥(start_usr),通过format 补全了网址,并且分别选择回调函数parse_user和parse_follows

  5. 创建储存数据的容器items

因为数据太多,我这边没有写全。
6.储存每个用户的信息。

我们在爬虫中写parse_user函数,在其中引入UserItem()对象通过item.fields属性,可以得到useritem中的fields。从而填充容器

7.对于followers和followees的Ajax请求

在Ajax请求中有两个数据块:data和paging

①对于data,我们主要需要的信息是url_token,它是某一个用户的唯一标识,通过这个标识可以构建出这个用户的主页网址。

 

    

②对于paging

paging是判断followers和followees的Ajax请求,也就是粉丝列表是否已经是最后一列了。Is_end是判断是否是最后一页,next是下一页的url,这两个数据是我们需要的。

首先判断是否有paging,当is_end是false的时候获取到下一页的followes和folloee列表的网址,然后yield Request

8.把用户信息存到Mongodb中,即改写pipeline

Scrapy官方文档有mongodb的代码,直接复制

    为了去重,需要修改一条代码:

当然还需要在settings.py中修改设置参数

 

最后运行代码

 

代码 :

爬虫代码(zhihu.py):

# -*- coding: utf-8 -*-
import json

import scrapy
from scrapy import Request, Spider

from zhihuuser.items import UserItem


class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['www.zhihu.com']
    start_user = 'excited-vczh'
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    user_query = 'allow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count,follower_count,articles_count,gender,badge[?(type=best_answerer)].topics'

    follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}'
    followers_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    def start_requests(self):
        yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
        yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, offset=0, limit=20),
                      self.parse_follows)
        yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, offset=0, limit=20),
                      self.parse_followers)

    def parse_user(self, response):
        result = json.loads(response.text)
        item = UserItem()  # 声明一个item对象的引用
        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        yield item
        # 获取关注列表 followees
        yield Request(
            self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
            callback=self.parse_user)
        # 获取粉丝列表 followers
        yield Request(
            self.followers_url.format(user=result.get('url_token'), include=self.followers_query, limit=20, offset=0),
            callback=self.parse_user)

    def parse_follows(self, response):
        results = json.loads(response.text)
        if 'data' in results.keys():
            for result in results.get('data'):
                yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              callback=self.parse_user)
        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
            next_page = results.get('paging').get('next')
            yield Request(next_page, callback=self.parse_follows)

    def parse_followers(self, response):
        results = json.loads(response.text)
        if 'data' in results.keys():
            for result in results.get('data'):
                yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              callback=self.parse_user)
        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
            next_page = results.get('paging').get('next')
            yield Request(next_page, callback=self.parse_followers)

pipeline代码:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


import pymongo
from scrapy import item


class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        try:
            self.db['user'].update({'url_token': item['url_token']},{'$set':item}, True)
            #第一个参数是一个去重的条件 #第二个参数是传入一个item变量, 第三个True表示如果找到就会执行更新,else执行插入
        except:
            pass
        return item

 items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy import Field as f


class UserItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    id = f()
    name = f()
    employments = f()
    name = f()
    url_token = f()
    follower_count = f()
    url = f()
    answer_count = f()
    headline = f()

 

推荐阅读