首页 > 技术文章 > 宅男福利--利用Python简单爬图

hooligen 2014-04-17 21:11 原文

Ver beta..代码粗陋。

使用说明以Windows为例, Python版本为2.7.6

  1. 确认你电脑已经安装了Python, Windows默认安装路径为C:\Python27。如果没有安装,先下载安装 https://www.python.org/download/releases/2.7.6
  2. 下载mechanize (mechanize-0.2.5.zip)和BeautifulSoup (beautifulsoup4-4.3.2.tar.gz
  3. 解压缩mechanize-0.2.5.zip 到C:\mechanize-0.2.5,打开命令行(Windows键+R键,输入cmd,回车),分别执行以下两条命令
    cd C:\mechanize-0.2.5
    C:\Python27\python setup.py install
  4. 解压缩beautifulsoup4-4.3.2.tar.gz 到C:\beautifulsoup4-4.3.2,打开命令行,执行命令
    cd C:\beautifulsoup4-4.3.2
    C:\Python27\python setup.py install
  5. 拷贝下面代码,保存到任意目录(如:C:\picture\meizitu_spider.py)
  6. 打开命令行,执行命令
    cd C:\picture
    C:\Python27\python meizitu_spider.py
  7. 查看文件夹 C:\picture\MeiziTu
  8. Enjoy :-)

 

代码:

#!/usr/local/bin/python
# -*-coding=utf-8-*-
# Filename: meizitu_spider.py

import os
import mechanize
from bs4 import BeautifulSoup

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1")]
#br.set_proxies({"http": "proxy.host.com:port"})

def parse_url(url):
    br.open(url)
    response = br.response()
    soup = BeautifulSoup(response.read(), from_encoding='gb18030')

    return soup

def find_next_page(soup):
    page_nums = soup.find('div', id='wp_page_numbers').find_all('li');
    next_page_wrapper = page_nums[-2]

    return next_page_wrapper.find('a')


host = "http://www.meizitu.com/"
next_page_uri = ''

page_count = 1
parent_folder = 'MeiZiTu'
if(not(os.path.exists(parent_folder))):
    os.mkdir(parent_folder)

while True:
    print 'Start to parse PAGE %d' %page_count

    soup = parse_url(host + 'a/' + next_page_uri)
    next_page = find_next_page(soup)
    
    if next_page == None:
        break

    next_page_uri = next_page.get('href')

    for pic_link_wrapper in soup.find_all('div', attrs={'class':'metaRight'}):
        pic_link = pic_link_wrapper.find('a')
        album_soup = parse_url(pic_link.get('href'))
        album_name = os.path.join(parent_folder, pic_link.get_text())
        if(os.path.exists(album_name)):
            continue

        os.mkdir(album_name)

        for img in album_soup.find('div', id='picture').find_all('img'):
            img_src = img.get('src')
            img_name = img_src[img_src.rindex('/')+1:]
            picture_data = mechanize.urlopen(img_src)

            with open(os.path.join(album_name, img_name), 'wb') as picture:
                picture.write(picture_data.read())

    page_count += 1

 

推荐阅读