首页 > 解决方案 > 需要帮助在循环中使用 Python 中的 Scrapy 抓取多个网页并从那里到下一页

问题描述

目前我正在一次爬取多个网站,需要爬取下一页,从爬取的站点获取到下一页的链接。所以需要不断地爬取每一页的下一页。请注意,每个页面的第二页具有相同的 div 内容。

蜘蛛.py

class UstodaySpider(scrapy.Spider):

name = 'usatoday'

start_urls = ['https://en.wikipedia.org/wiki/India',
              'https://en.wikipedia.org/wiki/USA
              ]

def parse(self, response):
    items = MynewsItem()
    print ("**********************************")
    print (type(response))
    print (response.url)

    all_section = response.css(' a.gnt_m_flm_a ')

    for quote in all_section:
        news_provider_id = '14'
        news_title = quote.css('a.gnt_m_flm_a').xpath("text()").extract()
        news_details = quote.css('a.gnt_m_flm_a').xpath("@data-c-br").extract()
        news_image = quote.css("img.gnt_m_flm_i").xpath("@data-gl-srcset").extract()
        news_page_url = quote.css('a.gnt_m_flm_a').xpath("@href").extract()


        items['news_provider_id'] = news_provider_id
        items['news_title'] = news_title
        items['news_details'] = news_details
        items['news_image'] = news_image
        items['news_page_url'] = news_page_url

    yield items
    next_page = 'https://en.wikipedia.org/wiki/India' + str(news_page_url)
    print(next_page)

管道.py

import mysql
class MynewsPipeline(object):
 def __init__(self):
   self.create_connection()
    self.create_table()
 def create_connection(self):
    self.conn = mysql.connector.connect(

        host = 'localhost',
        user = 'root',
        password = '',
        database = 'mydb',
        port = '3306'
    )

    self.curr = self.conn.cursor()

 def create_table(self):

    self.curr.execute("""DROP TABLE IF EXISTS news_crawl_newsdetails""")
    self.curr.execute("""create table news_crawl_newsdetails(
                    news_provider_id text,
                    news_title text,
                    news_details text,
                    news_image text,
                    news_page_url text
                    )""" )

 def process_item(self, item, spider):
    self.store_db(item)
    return item
 def store_db(self,item):
    # print (item['news_title'][0])

     self.curr.execute("""insert into news_crawl_newsdetails (news_provider_id,news_title,news_details,news_image,news_page_url) values (%s,%s,%s,%s,%s)""", (

        item['news_provider_id'],
        item['news_title'][0],
        item['news_details'][0],
        item['news_image'][0],
        item['news_page_url'][0]

    ))

    self.conn.commit()

项目.py

import scrapy
class MynewsItem(scrapy.Item):
  news_provider_id = scrapy.Field()
  news_title = scrapy.Field()
  news_details = scrapy.Field()
  news_image = scrapy.Field()
  news_page_url = scrapy.Field()
  news_des = scrapy.Field()
  pass

标签: pythondjango-rest-frameworkscrapyweb-crawler

解决方案


你可以试试这个方法:

你应该找到 next_page xpath 。它可以是指向下一页的链接或按钮:

next_page = response.selector.xpath(--xpath expression--).extract_first()

if next_page is not None:
    next_page_link = response.urljoin(next_page)
    yield scrapy.Request(url = next_page_link, callback=self.parse)

这就是你的解析函数应该是什么样子

def parse(self, response):
    items = MynewsItem()
    print ("**********************************")
    print (type(response))
    print (response.url)

    all_section = response.css(' a.gnt_m_flm_a ')

    for quote in all_section:
        news_provider_id = '14'
        news_title = quote.css('a.gnt_m_flm_a').xpath("text()").extract()
        news_details = quote.css('a.gnt_m_flm_a').xpath("@data-c-br").extract()
        news_image = quote.css("img.gnt_m_flm_i").xpath("@data-gl-srcset").extract()
        news_page_url = quote.css('a.gnt_m_flm_a').xpath("@href").extract()


        items['news_provider_id'] = news_provider_id
        items['news_title'] = news_title
        items['news_details'] = news_details
        items['news_image'] = news_image
        items['news_page_url'] = news_page_url

     next_page = response.selector.xpath("").extract_first()

     if next_page is not None:
         next_page_link = response.urljoin(next_page)
         yield scrapy.Request(url= next_page_link, callback=self.parse)

推荐阅读