首页 > 技术文章 > 抓取猫眼top100电影信息

minorblog 2017-10-07 16:52 原文

1. 在google浏览器中输入maoyan.com,  点击榜单top100.

2.观察分页路由,构造分页url = 'http://maoyan.com/board/4?offset=' + str(offset)

3.卡发者选项,查看排行的电影信息,我们要爬取电影的排行(index), 图片的url, 标题(title), 演员, 上映时间, 评分。

4.获取首页的html代码,

 1 user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ' \
 2             ' (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
 3 headers = {'User-Agent': user_agent}
 4 
 5 def get_one_page(url):
 6     try:
 7         response = requests.get(url, headers=headers)
 8         if response.status_code == 200:
 9             return response.text
10         return None
11     except RequestException:
12         return None

5. 解析页面,提取数据

 1 def parse_one_page(html):
 2     soup = BeautifulSoup(html, 'lxml')
 3     items = soup.select('dd')
 4     if items:
 5         for item in items:
 6             yield {
 7                 'index': item.find('i').text,
 8                 'image': item.find('img', class_="board-img").get('data-src'),
 9                 'title': item.find('p').text,
10                 'actor': item.find('p', class_="star").text.strip()[3:],
11                 'time': item.find('p', class_="releasetime").text.strip()[5:],
12                 'score': item.find('i', class_="integer").text + item.find('i', class_="fraction").text
13             }

6. 爬虫主函数

1 def main(offset):
2     url = 'http://maoyan.com/board/4?' + str(offset)
3     html = get_one_page(url)
4     for item in parse_one_page(html):
5         print(item)
6         write_to_file(item)

7. 开启多进程

1 if __name__ == '__main__':
2     pool = Pool()
3     pool.map(main, [i*10 for i in range(10)])

 

完整代码:https://github.com/huazhicai/Spider/tree/master/maoyantop

推荐阅读