python - 需要帮助在循环中使用 Python 中的 Scrapy 抓取多个网页并从那里到下一页
问题描述
目前我正在一次爬取多个网站,需要爬取下一页,从爬取的站点获取到下一页的链接。所以需要不断地爬取每一页的下一页。请注意,每个页面的第二页具有相同的 div 内容。
蜘蛛.py
class UstodaySpider(scrapy.Spider):
name = 'usatoday'
start_urls = ['https://en.wikipedia.org/wiki/India',
'https://en.wikipedia.org/wiki/USA
]
def parse(self, response):
items = MynewsItem()
print ("**********************************")
print (type(response))
print (response.url)
all_section = response.css(' a.gnt_m_flm_a ')
for quote in all_section:
news_provider_id = '14'
news_title = quote.css('a.gnt_m_flm_a').xpath("text()").extract()
news_details = quote.css('a.gnt_m_flm_a').xpath("@data-c-br").extract()
news_image = quote.css("img.gnt_m_flm_i").xpath("@data-gl-srcset").extract()
news_page_url = quote.css('a.gnt_m_flm_a').xpath("@href").extract()
items['news_provider_id'] = news_provider_id
items['news_title'] = news_title
items['news_details'] = news_details
items['news_image'] = news_image
items['news_page_url'] = news_page_url
yield items
next_page = 'https://en.wikipedia.org/wiki/India' + str(news_page_url)
print(next_page)
管道.py
import mysql
class MynewsPipeline(object):
def __init__(self):
self.create_connection()
self.create_table()
def create_connection(self):
self.conn = mysql.connector.connect(
host = 'localhost',
user = 'root',
password = '',
database = 'mydb',
port = '3306'
)
self.curr = self.conn.cursor()
def create_table(self):
self.curr.execute("""DROP TABLE IF EXISTS news_crawl_newsdetails""")
self.curr.execute("""create table news_crawl_newsdetails(
news_provider_id text,
news_title text,
news_details text,
news_image text,
news_page_url text
)""" )
def process_item(self, item, spider):
self.store_db(item)
return item
def store_db(self,item):
# print (item['news_title'][0])
self.curr.execute("""insert into news_crawl_newsdetails (news_provider_id,news_title,news_details,news_image,news_page_url) values (%s,%s,%s,%s,%s)""", (
item['news_provider_id'],
item['news_title'][0],
item['news_details'][0],
item['news_image'][0],
item['news_page_url'][0]
))
self.conn.commit()
项目.py
import scrapy
class MynewsItem(scrapy.Item):
news_provider_id = scrapy.Field()
news_title = scrapy.Field()
news_details = scrapy.Field()
news_image = scrapy.Field()
news_page_url = scrapy.Field()
news_des = scrapy.Field()
pass
解决方案
你可以试试这个方法:
你应该找到 next_page xpath 。它可以是指向下一页的链接或按钮:
next_page = response.selector.xpath(--xpath expression--).extract_first()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.Request(url = next_page_link, callback=self.parse)
这就是你的解析函数应该是什么样子
def parse(self, response):
items = MynewsItem()
print ("**********************************")
print (type(response))
print (response.url)
all_section = response.css(' a.gnt_m_flm_a ')
for quote in all_section:
news_provider_id = '14'
news_title = quote.css('a.gnt_m_flm_a').xpath("text()").extract()
news_details = quote.css('a.gnt_m_flm_a').xpath("@data-c-br").extract()
news_image = quote.css("img.gnt_m_flm_i").xpath("@data-gl-srcset").extract()
news_page_url = quote.css('a.gnt_m_flm_a').xpath("@href").extract()
items['news_provider_id'] = news_provider_id
items['news_title'] = news_title
items['news_details'] = news_details
items['news_image'] = news_image
items['news_page_url'] = news_page_url
next_page = response.selector.xpath("").extract_first()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.Request(url= next_page_link, callback=self.parse)
推荐阅读
- javascript - 如何在vue中的另一个组件内以视觉方式显示数组中的值
- python - 如何在 Mac 上点安装 django-auth-ldap
- java - 如何使用 CSVWriter 仅在 csv 中嵌入的逗号字符串上使用双引号?
- javascript - 带有 React Nextjs 的粘性导航栏
- c# - jQuery Array 未发布到 ASP.NET MVC 控制器
- javascript - 如何使一个复选框控制其他复选框,并且每个复选框都有自己的设置来控制
- c# - 使用 SSIS 脚本任务摆脱存储为文本的数字
- node.js - 是否可以使用 pg-promise 在一次往返中获得 2 个查询的结果?
- javascript - 未捕获的语法错误:无法在模块外使用 import 语句
- bitbucket-pipelines - 使用来自 Bitbucket Pipelines 的 docker-maven-plugin 将图像推送到 DockerHub