首页 > 技术文章 > Python Scrapy环境配置教程+使用Scrapy爬取李毅吧内容

Transkai 2019-03-15 01:46 原文

Python爬虫框架Scrapy

Scrapy框架

1、Scrapy框架安装

直接通过这里安装scrapy会提示报错:

error: Microsoft Visual C++ 14.0 is required <Unable to find vcvarsall.bat>

building 'twisted test.raiser' extension error:Unable to find cyarsall.bat

Failed building wheel for lxml

解决方法:

http://www.lfd.uci.edu/~gohlke/pythonlibs/有很多用于windows的编译好的Python第三方库,我们下载好对应自己Python版本的库即可。

  1. 在cmd中输入指令python,查看python的版本,如下:

从上图可以看出可以看出我的Python版本为Python3.7.1-64bit。

  1. 登陆http://www.lfd.uci.edu/~gohlke/pythonlibs/,Ctrl+F搜索Lxml、Twisted、Scrapy,下载对应的版本,例如:lxml-3.7.3-cp35-cp35m-win_adm64.whl,表示lxml的版本为3.7.3,对应的python版本为3.5-64bit。我下载的版本如下图所示:


  1. 在cmd中输入DOS指令,进入下载好的whl文件夹下,例如我的三个whl文件放在了Scrapy文件夹下:

  1. 依次执行如下命令:

    需要先安装pip:

    pip官网:https://pypi.org/project/pip/#files

下载完成之后,随便解压到一个文件夹,最好解压到python文件夹,用CMD控制台进入解压目录,即 cd *****(地址),输入:

PS D:\work\setup\script_code_language\Python\others\pip-18.1> python setup.py install
creating 'dist\pip-18.1-py3.7.egg' and adding 'build\bdist.win-amd64\egg' to it
removing 'build\bdist.win-amd64\egg' (and everything under it)
Processing pip-18.1-py3.7.egg
creating d:\work\setup\script_code_language\python\python37\lib\site-packages\pip-18.1-py3.7.egg
Extracting pip-18.1-py3.7.egg to d:\work\setup\script_code_language\python\python37\lib\site-packages
Adding pip 18.1 to easy-install.pth file
Installing pip-script.py script to D:\work\setup\script_code_language\Python\Python37\Scripts
Installing pip.exe script to D:\work\setup\script_code_language\Python\Python37\Scripts
Installing pip3-script.py script to D:\work\setup\script_code_language\Python\Python37\Scripts
Installing pip3.exe script to D:\work\setup\script_code_language\Python\Python37\Scripts
Installing pip3.7-script.py script to D:\work\setup\script_code_language\Python\Python37\Scripts
Installing pip3.7.exe script to D:\work\setup\script_code_language\Python\Python37\Scripts

Installed d:\work\setup\script_code_language\python\python37\lib\site-packages\pip-18.1-py3.7.egg
Processing dependencies for pip==18.1
Finished processing dependencies for pip==18.1
PS D:\work\setup\script_code_language\Python\others\pip-18.1>

安装好之后,我们直接在命令行输入pip,同样会显示‘pip’不是内部命令,也不是可运行的程序。因为我们还没有添加环境变量。

PS D:\work\setup\script_code_language\Python\others\pip-18.1> pip

Usage:
  pip <command> [options]

Commands:
  install                     Install packages.
  download                    Download packages.
  uninstall                   Uninstall packages.
  freeze                      Output installed packages in requirements format.
  list                        List installed packages.
  show                        Show information about installed packages.
  check                       Verify installed packages have compatible dependencies.
  config                      Manage local and global configuration.
  search                      Search PyPI for packages.
  wheel                       Build wheels from your requirements.
  hash                        Compute hashes of package archives.
  completion                  A helper command used for command completion.
  help                        Show help for commands.

General Options:
  -h, --help                  Show help.
  --isolated                  Run pip in an isolated mode, ignoring environment variables and user configuration.
  -v, --verbose               Give more output. Option is additive, and can be used up to 3 times.
  -V, --version               Show version and exit.
  -q, --quiet                 Give less output. Option is additive, and can be used up to 3 times (corresponding to WARNING, ERROR, and CRITICAL logging levels).
  --log <path>                Path to a verbose appending log.
  --proxy <proxy>             Specify a proxy in the form [user:passwd@]proxy.server:port.
  --retries <retries>         Maximum number of retries each connection should attempt (default 5 times).
  --timeout <sec>             Set the socket timeout (default 15 seconds).
  --exists-action <action>    Default action when a path already exists: (s)witch, (i)gnore, (w)ipe, (b)ackup, (a)bort).
  --trusted-host <hostname>   Mark this host as trusted, even though it does not have valid or any HTTPS.
  --cert <path>               Path to alternate CA bundle.
  --client-cert <path>        Path to SSL client certificate, a single file containing the private key and the certificate in PEM format.
  --cache-dir <dir>           Store the cache data in <dir>.
  --no-cache-dir              Disable the cache.
  --disable-pip-version-check
                              Don't periodically check PyPI to determine whether a new version of pip is available for download. Implied with --no-index.
  --no-color                  Suppress colored output
PS D:\work\setup\script_code_language\Python\others\pip-18.1>
   pip install Wheel   #1) 先安装whell

   PS D:\work\setup\script_code_language\Python\others\scrapy> pip install .\lxml-4.2.5-cp37-cp37m-win_amd64.whl   # 2) 安装lxml
   Processing d:\work\setup\script_code_language\python\others\scrapy\lxml-4.2.5-cp37-cp37m-win_amd64.whl
   Installing collected packages: lxml
   Successfully installed lxml-4.2.5
   PS D:\work\setup\script_code_language\Python\others\scrapy> pip install .\Twisted-18.9.0-cp37-cp37m-win_amd64.whl  # 3) 安装twisted
   Processing d:\work\setup\script_code_language\python\others\scrapy\twisted-18.9.0-cp37-cp37m-win_amd64.whl
   Collecting Automat>=0.3.0 (from Twisted==18.9.0)
     Using cached https://files.pythonhosted.org/packages/a3/86/14c16bb98a5a3542ed8fed5d74fb064a902de3bdd98d6584b34553353c45/Automat-0.7.0-py2.py3-none-any.whl
   Collecting zope.interface>=4.4.2 (from Twisted==18.9.0)
     Using cached https://files.pythonhosted.org/packages/a8/d2/e2fb1052cdf1c1d05a23c5f7a192a8dc104d5afda0539f86b9839264e1cc/zope.interface-4.6.0-cp37-cp37m-win_amd64.whl
   Collecting incremental>=16.10.1 (from Twisted==18.9.0)
     Using cached https://files.pythonhosted.org/packages/f5/1d/c98a587dc06e107115cf4a58b49de20b19222c83d75335a192052af4c4b7/incremental-17.5.0-py2.py3-none-any.whl
   Collecting PyHamcrest>=1.9.0 (from Twisted==18.9.0)
     Using cached https://files.pythonhosted.org/packages/9a/d5/d37fd731b7d0e91afcc84577edeccf4638b4f9b82f5ffe2f8b62e2ddc609/PyHamcrest-1.9.0-py2.py3-none-any.whl
   Collecting constantly>=15.1 (from Twisted==18.9.0)
     Using cached https://files.pythonhosted.org/packages/b9/65/48c1909d0c0aeae6c10213340ce682db01b48ea900a7d9fce7a7910ff318/constantly-15.1.0-py2.py3-none-any.whl
   Collecting attrs>=17.4.0 (from Twisted==18.9.0)
     Using cached https://files.pythonhosted.org/packages/3a/e1/5f9023cc983f1a628a8c2fd051ad19e76ff7b142a0faf329336f9a62a514/attrs-18.2.0-py2.py3-none-any.whl
   Collecting hyperlink>=17.1.1 (from Twisted==18.9.0)
     Using cached https://files.pythonhosted.org/packages/a7/b6/84d0c863ff81e8e7de87cff3bd8fd8f1054c227ce09af1b679a8b17a9274/hyperlink-18.0.0-py2.py3-none-any.whl
   Collecting six (from Automat>=0.3.0->Twisted==18.9.0)
     Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
   Requirement already satisfied: setuptools in d:\work\setup\script_code_language\python\python37\lib\site-packages (from zope.interface>=4.4.2->Twisted==18.9.0) (39.0.1)
   Collecting idna>=2.5 (from hyperlink>=17.1.1->Twisted==18.9.0)
     Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
   Installing collected packages: attrs, six, Automat, zope.interface, incremental, PyHamcrest, constantly, idna, hyperlink, Twisted
   Successfully installed Automat-0.7.0 PyHamcrest-1.9.0 Twisted-18.9.0 attrs-18.2.0 constantly-15.1.0 hyperlink-18.0.0 idna-2.8 incremental-17.5.0 six-1.12.0 zope.interface-4.6.0
   PS D:\work\setup\script_code_language\Python\others\scrapy>
   PS D:\work\setup\script_code_language\Python\others\scrapy> pip install .\Scrapy-1.5.1-py2.py3-none-any.whl    # 4) 安装scrapy
   Processing d:\work\setup\script_code_language\python\others\scrapy\scrapy-1.5.1-py2.py3-none-any.whl
   Requirement already satisfied: lxml in d:\work\setup\script_code_language\python\python37\lib\site-packages (from Scrapy==1.5.1) (4.2.5)
   Collecting parsel>=1.1 (from Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/96/69/d1d5dba5e4fecd41ffd71345863ed36a45975812c06ba77798fc15db6a64/parsel-1.5.1-py2.py3-none-any.whl
   Collecting pyOpenSSL (from Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/96/af/9d29e6bd40823061aea2e0574ccb2fcf72bfd6130ce53d32773ec375458c/pyOpenSSL-18.0.0-py2.py3-none-any.whl
   Requirement already satisfied: Twisted>=13.1.0 in d:\work\setup\script_code_language\python\python37\lib\site-packages (from Scrapy==1.5.1) (18.9.0)
   Requirement already satisfied: six>=1.5.2 in d:\work\setup\script_code_language\python\python37\lib\site-packages (from Scrapy==1.5.1) (1.12.0)
   Collecting w3lib>=1.17.0 (from Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl
   Collecting cssselect>=0.9 (from Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/7b/44/25b7283e50585f0b4156960691d951b05d061abf4a714078393e51929b30/cssselect-1.0.3-py2.py3-none-any.whl
   Collecting PyDispatcher>=2.0.5 (from Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/cd/37/39aca520918ce1935bea9c356bcbb7ed7e52ad4e31bff9b943dfc8e7115b/PyDispatcher-2.0.5.tar.gz
   Collecting queuelib (from Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/4c/85/ae64e9145f39dd6d14f8af3fa809a270ef3729f3b90b3c0cf5aa242ab0d4/queuelib-1.5.0-py2.py3-none-any.whl
   Collecting service-identity (from Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/e9/7c/2195b890023e098f9618d43ebc337d83c8b38d414326685339eb024db2f6/service_identity-18.1.0-py2.py3-none-any.whl
   Collecting cryptography>=2.2.1 (from pyOpenSSL->Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/c7/c7/2b97b9af3b16ea45aeabd5fe7eb07902c0c24c5d596b68f677c1f50ae55e/cryptography-2.4.2-cp37-cp37m-win_amd64.whl
   Requirement already satisfied: zope.interface>=4.4.2 in d:\work\setup\script_code_language\python\python37\lib\site-packages (from Twisted>=13.1.0->Scrapy==1.5.1) (4.6.0)
   Requirement already satisfied: attrs>=17.4.0 in d:\work\setup\script_code_language\python\python37\lib\site-packages (from Twisted>=13.1.0->Scrapy==1.5.1) (18.2.0)
   Requirement already satisfied: Automat>=0.3.0 in d:\work\setup\script_code_language\python\python37\lib\site-packages (from Twisted>=13.1.0->Scrapy==1.5.1) (0.7.0)
   Requirement already satisfied: constantly>=15.1 in d:\work\setup\script_code_language\python\python37\lib\site-packages (from Twisted>=13.1.0->Scrapy==1.5.1) (15.1.0)
   Requirement already satisfied: hyperlink>=17.1.1 in d:\work\setup\script_code_language\python\python37\lib\site-packages (from Twisted>=13.1.0->Scrapy==1.5.1) (18.0.0)
   Requirement already satisfied: PyHamcrest>=1.9.0 in d:\work\setup\script_code_language\python\python37\lib\site-packages (from Twisted>=13.1.0->Scrapy==1.5.1) (1.9.0)
   Requirement already satisfied: incremental>=16.10.1 in d:\work\setup\script_code_language\python\python37\lib\site-packages (from Twisted>=13.1.0->Scrapy==1.5.1) (17.5.0)
   Collecting pyasn1-modules (from service-identity->Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/19/02/fa63f7ba30a0d7b925ca29d034510fc1ffde53264b71b4155022ddf3ab5d/pyasn1_modules-0.2.2-py2.py3-none-any.whl
   Collecting pyasn1 (from service-identity->Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/d1/a1/7790cc85db38daa874f6a2e6308131b9953feb1367f2ae2d1123bb93a9f5/pyasn1-0.4.4-py2.py3-none-any.whl
   Requirement already satisfied: idna>=2.1 in d:\work\setup\script_code_language\python\python37\lib\site-packages (from cryptography>=2.2.1->pyOpenSSL->Scrapy==1.5.1) (2.8)
   Collecting cffi!=1.11.3,>=1.7 (from cryptography>=2.2.1->pyOpenSSL->Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/ca/f2/e375b7469a2dfe9d1feac81a10df97f18cd771b9a10ac62ca9864b760f7c/cffi-1.11.5-cp37-cp37m-win_amd64.whl
   Collecting asn1crypto>=0.21.0 (from cryptography>=2.2.1->pyOpenSSL->Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/ea/cd/35485615f45f30a510576f1a56d1e0a7ad7bd8ab5ed7cdc600ef7cd06222/asn1crypto-0.24.0-py2.py3-none-any.whl
   Requirement already satisfied: setuptools in d:\work\setup\script_code_language\python\python37\lib\site-packages (from zope.interface>=4.4.2->Twisted>=13.1.0->Scrapy==1.5.1) (39.0.1)
   Collecting pycparser (from cffi!=1.11.3,>=1.7->cryptography>=2.2.1->pyOpenSSL->Scrapy==1.5.1)
     Using cached https://files.pythonhosted.org/packages/68/9e/49196946aee219aead1290e00d1e7fdeab8567783e83e1b9ab5585e6206a/pycparser-2.19.tar.gz
   Building wheels for collected packages: PyDispatcher, pycparser
     Running setup.py bdist_wheel for PyDispatcher ... done
     Stored in directory: C:\Users\ASUS\AppData\Local\pip\Cache\wheels\88\99\96\cfef6665f9cb1522ee6757ae5955feedf2fe25f1737f91fa7f
     Running setup.py bdist_wheel for pycparser ... done
     Stored in directory: C:\Users\ASUS\AppData\Local\pip\Cache\wheels\f2\9a\90\de94f8556265ddc9d9c8b271b0f63e57b26fb1d67a45564511
   Successfully built PyDispatcher pycparser
   Installing collected packages: cssselect, w3lib, parsel, pycparser, cffi, asn1crypto, cryptography, pyOpenSSL, PyDispatcher, queuelib, pyasn1, pyasn1-modules, service-identity, Scrapy
   Successfully installed PyDispatcher-2.0.5 Scrapy-1.5.1 asn1crypto-0.24.0 cffi-1.11.5 cryptography-2.4.2 cssselect-1.0.3 parsel-1.5.1 pyOpenSSL-18.0.0 pyasn1-0.4.4 pyasn1-modules-0.2.2 pycparser-2.19 queuelib-1.5.0 service-identity-18.1.0 w3lib-1.19.0
   PS D:\work\setup\script_code_language\Python\others\scrapy>
这样Scrapy的安装就完成了,请忽略最后两行让我升级pip的信息。*.*
  1. Srapy已经安装成功,还要下载pywin32,找到对应版本下载,一路下一步安装即可。安装完成后,就可以正常使用Scrapy了。

    https://sourceforge.net/projects/pywin32/files/pywin32/Build 221/


    至此,大功告成,我们可以愉快的使用Scrapy了。

Scrapy笔记

scrapy使用

1、创建scrapy项目

scrapy startproject mySpider

2、编写第一个scrapy爬虫

#可以用于调试xpath或css选择器
scrapy shell https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85%E5%90%A7  

运行爬虫

scrapy crawl first_spider  #first_spider是你在类中定义的name值

爬取李毅吧下的第一页贴子链接和文本:

import scrapy


class firstSpider(scrapy.Spider):
    # # 构造方法
    # def __init__(self, name, age):
    #     self.name = name
    #     self.age = age

    name = 'first_spider'

    def start_requests(self):# 此方法用于通过链接爬取页面
        # 爬取页面的链接
        urls = [
            "https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85%E5%90%A7"
        ]

        yield scrapy.Request(urls[0], callback=self.parse)  # 爬取页面内容后如何处理通过self.parse来执行

    # 定义回调函数
    def parse(self, response):
        # 指定规则下的所有a标签元素
        xpath='//ul[@id="thread_list"]/li[@class=" j_thread_list clearfix"]//div[@class="threadlist_lz clearfix"]/div/a'
        link_list = response.xpath(xpath)

        for link in link_list:
            href = link.xpath('@href').extract_first()
            text = link.xpath('text()').extract_first()
            print('text: %s, href: %s' % (text, href))




if __name__ == '__main__':
    pass
    # 获得类的实例
    # a = firstSpider('张三', 18)
    # print('name: %s, age: %d' % (a.name, a.age))

爬取李毅吧下的下一页贴子链接和文本:

import scrapy

class firstSpider(scrapy.Spider):
    # # 构造方法
    # def __init__(self, name, age):
    #     self.name = name
    #     self.age = age

    name = 'second_spider'

    def start_requests(self): # 此方法用于通过链接爬取页面
        # 爬取页面的链接
        urls = [
            "https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85%E5%90%A7"
        ]

        yield scrapy.Request(urls[0], callback=self.parse)  # 爬取页面内容后如何处理通过self.parse来执行

    # 定义回调函数
    def parse(self, response):
        # 指定规则下的所有a标签元素
        xpath='//ul[@id="thread_list"]/li[@class=" j_thread_list clearfix"]//div[@class="threadlist_lz clearfix"]/div/a'
        link_list = response.xpath(xpath)

        filename = '李毅吧贴子内容.txt'
        for link in link_list:
            href = link.xpath('@href').extract_first()
            text = link.xpath('text()').extract_first()
            line = 'text: %s, href: %s' % (text, href)
            print(line)

            # 写文件
            # a 追加到文件末尾
            with open(filename, 'a', encoding='utf-8',) as f:
                f.write(line)
                f.write('\n')

        # 爬取下一页的链接地址:css选择器
        next_page = response.css('#frs_list_pager a:nth-last-child(2)::attr(href)').extract_first()
        # 判断下一页是否不为空
        if next_page is not None:
            next_page = 'https:'+next_page
            yield scrapy.Request(next_page, callback=self.parse)


if __name__ == '__main__':
    pass
    # 获得类的实例
    # a = firstSpider('张三', 18)
    # print('name: %s, age: %d' % (a.name, a.age))

urls的简写形式

import scrapy


class firstSpider(scrapy.Spider):
    # # 构造方法
    # def __init__(self, name, age):
    #     self.name = name
    #     self.age = age

    name = 'three_spider'

    # 简化形式
    start_urls = [
        "https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85%E5%90%A7",
        "https://www.baidu.com"
    ]

    # def start_requests(self): # 此方法用于通过链接爬取页面
    #     # 爬取页面的链接
    #     urls = [
    #         "https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85%E5%90%A7"
    #     ]
    #
    #     yield scrapy.Request(urls[0], callback=self.parse)  # 爬取页面内容后如何处理通过self.parse来执行

    # 定义回调函数
    def parse(self, response):
        # 指定规则下的所有a标签元素
        xpath='//ul[@id="thread_list"]/li[@class=" j_thread_list clearfix"]//div[@class="threadlist_lz clearfix"]/div/a'
        link_list = response.xpath(xpath)

        for link in link_list:
            href = link.xpath('@href').extract_first()
            text = link.xpath('text()').extract_first()
            print('text: %s, href: %s' % (text, href))




if __name__ == '__main__':
    pass
    # 获得类的实例
    # a = firstSpider('张三', 18)
    # print('name: %s, age: %d' % (a.name, a.age))

推荐阅读