首页 > 技术文章 > python+scrapy爬虫(爬取链家的二手房信息)

z-ww 2018-09-10 14:33 原文

 

 

之前用过selenium和request爬取数据,但是感觉速度慢,然后看了下scrapy教程,准备用这个框架爬取试一下。

1、目的:通过爬取成都链家的二手房信息,主要包含小区名,小区周边环境,小区楼层以及价格等信息。并且把这些信息写入mysql。

2、环境:scrapy1.5.1 +python3.6

3、创建项目:创建scrapy项目,在项目路径执行命令:scrapy startproject LianJiaScrapy

4、项目路径:(其中run.py新加的,run.py是在eclipse里面启动scrapy项目,方便调试的)

这些文件分别是:

  • scrapy.cfg:项目的配置文件
  • LianJiaScrapy:该项目的python模块。之后您将在此加入代码。
  • LianJiaScrapy/items.py:项目中的item文件,设置对应的参数名,把抓取的数据存到对应的字段里面。(类似字典来存数据,然后可提供给后面的pipelines.py处理数据)
  • LianJiaScrapy/pipelines.py:项目中的pipelines文件,抓取后的数据通过这个文件进行处理。(比如我把数据写到数据库里面就是在这里操作的)
  • LianJiaScrapy/spiders/:放置spider代码的目录。(数据抓取的过程,并且把抓取的数据和items的数据一一对应)

5、创建爬虫的主文件:cmd进入到主目录,输入命令:scrapy genspider lianjia_spider,查看spiders目录下,新建了一个lianjia_spider.py

6、items.py编写:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Field, Item


class ScrapylianjiaItem(Item):
  '''
  houseName:小区楼盘
  description:房子描述
  floor:此条信息的关注度和发布时间
  positionIcon:房子所属区
  followInfo:楼层信息
  subway:是否临近地铁
  taxfree:是否有税
  haskey:是否随时看房
  totalPrice:总价
  unitPrice:单价
  '''
  houseName = Field()
  description = Field()
  floor = Field()
  positionIcon = Field()
  followInfo = Field()
  subway = Field()
  taxfree = Field()
  haskey = Field()
  totalPrice = Field()
  unitPrice = Field()

7、爬虫文件lianjia_spider.py编写

# -*- coding: utf-8 -*-
'''
Created on 2018年8月23日

@author: zww
'''
import scrapy
import random
import time
from LianJiaScrapy.items import ScrapylianjiaItem


class LianJiaSpider(scrapy.Spider):
    name = "Lianjia"
    start_urls = [
        "https://cd.lianjia.com/ershoufang/pg1/",
    ]

    def parse(self, response):
        # 组装下一页要抓取的网址
        init_url = 'https://cd.lianjia.com/ershoufang/pg'
        # 房子列表在//li[@class="clear LOGCLICKDATA"]路径下面,每页有30条
        sels = response.xpath('//li[@class="clear LOGCLICKDATA"]')
        # 这里是一次性全部获取30条的信息
        houseName_list = sels.xpath(
            '//div[@class="houseInfo"]/a/text()').extract()
        description_list = sels.xpath(
            '//div[@class="houseInfo"]/text()').extract()
        floor_list = sels.xpath(
            '//div[@class="positionInfo"]/text()').extract()
        positionIcon_list = sels.xpath(
            '//div[@class="positionInfo"]/a/text()').extract()
        followInfo_list = sels.xpath(
            '//div[@class="followInfo"]/text()').extract()
        subway_list = sels.xpath('//span[@class="subway"]/text()').extract()
        taxfree_list = sels.xpath('//span[@class="taxfree"]/text()').extract()
        haskey_list = sels.xpath('//span[@class="haskey"]/text()').extract()
        totalPrice_list = sels.xpath(
            '//div[@class="totalPrice"]/span/text()').extract()
        unitPrice_list = sels.xpath(
            '//div[@class="unitPrice"]/span/text()').extract()
        # 爬取的数据和item文件里面的数据对应起来
        i = 0
        for sel in sels:
            item = ScrapylianjiaItem()

            item['houseName'] = houseName_list[i].strip()
            item['description'] = description_list[i].strip()
            item['floor'] = floor_list[i].strip()
            item['positionIcon'] = positionIcon_list[i].strip()
            item['followInfo'] = followInfo_list[i].strip()
            item['subway'] = subway_list[i].strip()
            item['taxfree'] = taxfree_list[i].strip()
            item['haskey'] = haskey_list[i].strip()
            item['totalPrice'] = totalPrice_list[i].strip()
            item['unitPrice'] = unitPrice_list[i].strip()
            i += 1
            yield item
        # 获取当前页数,获取出来的格式是{"totalPage":100,"curPage":98}
        has_next_page = sels.xpath(
            '//div[@class="page-box fr"]/div[1]/@page-data').extract()[0]
        # 取出来的值是str类型的,转成字典,然后取curPage这个字段的值
        to_dict = eval(has_next_page)
        current_page = to_dict['curPage']
        # 链家只展示100页的内容,抓完100页就终止爬虫
        if current_page != 100:
            next_page = current_page + 1
            url = ''.join([init_url, str(next_page), '/'])
            print('starting crapy url:', url)
            # 随机爬取时间,防止封ip
            time.sleep(round(random.uniform(1, 2), 2))
            yield scrapy.Request(url, callback=self.parse)
        else:
            print('scrapy done!')

  

8、数据处理文件pipelines.py的编写:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
from scrapy.utils.project import get_project_settings


class LianjiascrapyPipeline(object):
    InsertSql = '''insert into scrapy_LianJia
        (houseName,description,floor,followInfo,haskey,
        positionIcon,subway,taxfree,totalPrice,unitPrice)  
        values('{houseName}','{description}','{floor}','{followInfo}',
        '{haskey}','{positionIcon}','{subway}','{taxfree}','{totalPrice}','{unitPrice}')'''

    def __init__(self):
        self.settings = get_project_settings()
        # 连接数据库
        self.connect = pymysql.connect(
            host=self.settings.get('MYSQL_HOST'),
            port=self.settings.get('MYSQL_PORT'),
            db=self.settings.get('MYSQL_DBNAME'),
            user=self.settings.get('MYSQL_USER'),
            passwd=self.settings.get('MYSQL_PASSWD'),
            charset='utf8',
            use_unicode=True)
        # 通过cursor执行增删查改
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        sqltext = self.InsertSql.format(
            houseName=item['houseName'], description=item['description'], floor=item['floor'], followInfo=item['followInfo'],
            haskey=item['haskey'], positionIcon=item['positionIcon'], subway=item['subway'], taxfree=item['taxfree'],
            totalPrice=item['totalPrice'], unitPrice=item['unitPrice'])
        try:
            self.cursor.execute(sqltext)
            self.connect.commit()
        except Exception as e:
            print('插入数据失败', e)
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.connect.close()

9、要使用pipelines文件,需要在settings.py里面设置:

ITEM_PIPELINES = {
'LianJiaScrapy.pipelines.LianjiascrapyPipeline': 300,
}

#设置mysql连接信息:

MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'test_scrapy'
MYSQL_USER = ‘这里填写连接库的账号’
MYSQL_PASSWD = '填写密码'
MYSQL_PORT = 3306

#设置爬虫的信息头

DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': '填写的你cookie',
'Host': 'cd.lianjia.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}

10、在mysql的库test_scrapy里面新建表:

CREATE TABLE `scrapy_lianjia` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`houseName` varchar(255) DEFAULT NULL COMMENT '小区名',
`description` varchar(255) DEFAULT NULL COMMENT '房子描述',
`floor` varchar(255) DEFAULT NULL COMMENT '楼层',
`followInfo` varchar(255) DEFAULT NULL COMMENT '此条信息的关注度和发布时间',
`haskey` varchar(255) DEFAULT NULL COMMENT '看房要求',
`positionIcon` varchar(255) DEFAULT NULL COMMENT '房子所属区',
`subway` varchar(255) DEFAULT NULL COMMENT '是否近地铁',
`taxfree` varchar(255) DEFAULT NULL COMMENT '房屋税',
`totalPrice` varchar(11) DEFAULT NULL COMMENT '总价',
`unitPrice` varchar(255) DEFAULT NULL COMMENT '单价',
PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=3001 DEFAULT CHARSET=utf8;

11、运行爬虫项目:

这里可以直接在cmd里面输入命令:scrapy crawl Lianjia执行。

我在写脚本的时候,需要调试,所以新加了run.py,可以直接运行,也可以debug。

我的run.py文件:

 

# -*- coding: utf-8 -*-
'''
Created on 2018年8月23日

@author: zww
'''
from scrapy import cmdline
name = 'Lianjia'
cmd = 'scrapy crawl {0}'.format(name)

#下面这2中方式都可以的,好像python2.7版本和3.6版本还有点不一样,
#2.7版本用第二种的话,需要加空格
cmdline.execute(cmd.split())
# cmdline.execute(['scrapy', 'crawl', name])

  

 

#下面这2中方式都可以的,好像python2.7版本和3.6版本还有点不一样,
#2.7版本用第二种的话,需要加空格
cmdline.execute(cmd.split())
# cmdline.execute(['scrapy', 'crawl', name])

12、爬取的过程:

13、爬取的结果:

 

推荐阅读