python - 通过登录主页来抓取主页的内部链接
问题描述
我会让这个变得简单。我有一个登录页面。我登录。我看主页。主页有两个链接。我想打开这两个链接。每个链接有两个数据。我只需要登录后主页上的两个链接中的四个数据。我可以抓取链接步骤。我可以抓取链接,但不能抓取链接内的数据。我怎么做?谢谢
我的scrapy代码: PS我只是凭自己的直觉做的,我不知道这是否可能。
import scrapy
class ClassroomSpider(scrapy.Spider):
name = 'classroom'
start_urls =['http://classroom.dwit.edu.np/login/index.php']
login_url = 'http://classroom.dwit.edu.np/login/index.php'
def parse(self, response): //code to login into the website
data = {
'username': mynameisaj
'password': somerandomvalue
}
yield scrapy.FormRequest(url=self.login_url,formdata = data, callback = self.parse_quotes)
def parse_quotes(self,response):
Link = response.xpath('//*[@class="event"]/a/@href').extract() //link in the homepage
for item in zip(Link):
scraped_info = {
'Link':item[0],
}
yield scraped_info
next_page_url = response.xpath('//*[@class="event"]/a/@href').extract() // link in the homepage
if next_page_url:
yield scrapy.Request(url = next_page_url, callback = self.parse_data)
def parse_data(self,response):
Data = response.xpath('//*[@class="no-overflow"]/p/text()').extract() //data inside the link in the homepage
for item in zip(Data):
scraped_info1 = {
'Data':item[0],
}
yield scraped_info1
更新
html元素是:
<div id="intro" class="box generalbox boxaligncenter"><div class="no-overflow"><p>1) Write a program to print the area and perimeter of a triangle having sides of 3, 4 and 5 units by creating a class named 'Triangle' without any parameter in its constructor.</p>
<p><br>2) Write a program that would print the information (name, year of joining, salary, address) of three employees by creating a class named 'Employee'. <br>Create properties as needed for Employee class and set values to those properties using constructor with arguments.</p>
<p>The output should be as follows:</p>
<table border="0" style="width: 348px; height: 43px;">
<tbody>
<tr>
<td><strong><span data-mce-mark="1">Name</span></strong></td>
<td><strong><span data-mce-mark="1">Year of joining</span></strong></td>
<td><strong><span data-mce-mark="1">Address</span></strong></td>
</tr>
<tr>
<td><span data-mce-mark="1">Robert</span></td>
<td><span data-mce-mark="1">1994</span></td>
<td><span data-mce-mark="1">64C- WallsStreet</span></td>
</tr>
<tr>
<td><span data-mce-mark="1">Sam</span></td>
<td><span data-mce-mark="1">2000</span></td>
<td>Kathmandu</td>
</tr>
</tbody>
</table>
<p></p>
<p>3) Create a class 'Degree' having a method 'getDegree' that prints "I got a degree". It has two subclasses namely 'Undergraduate' and 'Postgraduate' each having a method with the same name that prints "I am an Undergraduate" and "I am a Postgraduate" respectively. Call the method by creating an object of each of the three classes.</p>
<p>Note: Use separate class with main method</p></div></div>
它只刮掉了最后一个 p 元素。
解决方案
如果您想为您需要使用的两个链接组合request.meta
输出(未经测试):
def parse_quotes(self,response):
# first you need to get ALL links you want to process
your_links = response.xpath('//*[@class="event"]/a/@href').extract()
first_link = your_links.pop(0)
# and start processing from the very first link
yield scrapy.Request(url = first_link, callback = self.parse_data, meta={"links": your_links})
def parse_data(self,response):
item = {}
# If we already have some Item data we need to continue with it
if "item" in response.meta:
item = response.meta["item"]
# Below you need to parse HTML and update your Item
Data = response.xpath('//*[@class="no-overflow"]/p/text()').extract()
for item in zip(Data):
item = {
'Data':item[0],
}
# Now we need to check if we need to process other Links
if len(response.meta["links"]) > 0:
next_link_url = response.meta["links"].pop(0)
yield scrapy.Request(url = next_link_url, callback = self.parse_data, meta={"links": response.meta["links"], "item": item})
else:
# No more links to process, just save output
yield item
推荐阅读
- python - 将多维数组的元素与两个新数组进行比较和存储
- pic - PIC 16F887内部EEPROM写入问题
- javascript - 如何打印数组的更改值并将其恢复为 Javascript 中的先前值?
- flask - Flask 使用 url_for() 路由多个可选参数
- javascript - 我如何获得令牌?
- python - 有没有更短的方法来编写这个 python 代码?(例如,不重复 try/except 语句)
- angular - 如何将mat-expansion-panel拖放到Angular中div内的项目列表中?
- html - 文本没有移动也没有格式化
- reactjs - 在渲染函数中反应本机异步调用
- server - 使用终端在服务器上的特定文件夹中提取 ZIP 文件?