python - 抓取时出错:“ValueError:请求 url 中缺少方案”
问题描述
我正在尝试用 Scrapy 库刮玻璃门。我已经获得了在 mongo 数据库中提取信息的所有链接。
我得到的错误是:
2019-09-02 13:54:56 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/scrapy/spiders/__init__.py", line 73, in start_requests
yield Request(url, dont_filter=True)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/scrapy/http/request/__init__.py", line 69, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url:
2019-09-02 13:54:56 [scrapy.core.engine] INFO: Closing spider (finished)
我的代码是:
# Importing libraries.
import scrapy
from scrapy.http import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import CloseSpider
import json
import re
# Importing files.
import mongo_db as db
# Glassdoor Scraper
class GlassdoorScrapySpider(CrawlSpider):
# Spider name, domain, and headers.
name = 'glassdoor_scraper'
allowed_domain = ['https://www.glassdoor.com']
# The first method called by GlassdoorScrapySpider.
def start_requests(self):
# Connecting to MongoDB.
connection = db.connect_to_database()
# Reading all links in database.
db_links = db.read_crawled_urls(client=connection)
# Calling the parse function to scrape all data.
for link in db_links:
yield Request(url=link, callback=self.parse, headers=self.headers)
# Closing connection with MongoDB.
db.close_connection(connection)
# This method gets all the job_posting json data inside the urls.
def parse(self, response):
text = response.xpath('//*[@id="JobContent"]/script/text()') # Extracting the tag with the JSON.
text = text.extract()[0].strip() # Extracting the text and removing the leading/trailing spaces.
text = re.sub(r'<.*?>', '', text) # Deleting the HMTL inside the description.
text = text.replace('\r', '') # Removing unnecesary end lines.
text = text.replace('\n', '') # Removing unnecesary end lines.
text = text.replace('\t', '') # Removing unnecesary tabs.
text = text.replace('\\', '') # Removing unnecesary characters.
try:
loaded_json = json.loads(text)
db.save_scraped(client=connection, new_data=loaded_json, task_number=self.task_number, broken=False)
except:
print('\nReturned JSON is broken.\n')
if loaded_json:
db.save_scraped(client=connection, new_data=loaded_json, task_number=self.task_number, broken=True)
我已经尝试过使用self.start_urls = []
和使用self.start_urls = db_links
(因为 db_links 是我从 mongo 获得的列表)。当然,我把它放在一个名为 __init__ 的方法中。
这些都不起作用。
我不知道还能尝试什么。
.
编辑:
我正在尝试更改代码,以查看是否可以找到解决方案,但仍然失败。
我检查了“db_link”变量,它很好,是一个包含所有链接的列表。我已经将连接和 db_close 等放在了 __init__ 方法中。
# Importing libraries.
import scrapy
from scrapy.http import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import CloseSpider
import json
import re
# Importing files.
import mongo_db as db
# Glassdoor Scraper
class GlassdoorScrapySpider(CrawlSpider):
# Spider name, domain, and headers.
name = 'glassdoor_scraper'
allowed_domain = ['https://www.glassdoor.com']
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/32.0.1700.102 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
# Connecting to MongoDB.
connection = db.connect_to_database()
# Reading all links in database.
db_links = db.read_crawled_urls(client=connection)
# Closing connection with MongoDB.
db.close_connection(connection)
# The first method called by GlassdoorCrawlSpider.
def __init__(self, **kwargs):
super(GlassdoorScrapySpider, self).__init__(**kwargs)
self.start_urls = [db_links]
# The second method called by GlassdoorCrawlSpider.
def start_requests():
for link in db_links:
# Calling the parse function with the requested html to scrape all data.
yield Request(url=link, callback=self.parse, headers=self.headers)
# This method gets all the job_posting json data inside the urls.
def parse(self, response):
text = response.xpath('//*[@id="JobContent"]/script/text()') # Extracting the tag with the JSON.
text = text.extract()[0].strip() # Extracting the text and removing the leading/trailing spaces.
text = re.sub(r'<.*?>', '', text) # Deleting the HMTL inside the description.
text = text.replace('\r', '') # Removing unnecesary end lines.
text = text.replace('\n', '') # Removing unnecesary end lines.
text = text.replace('\t', '') # Removing unnecesary tabs.
text = text.replace('\\', '') # Removing unnecesary characters.
try:
loaded_json = json.loads(text)
db.save_scraped(client=connection, new_data=loaded_json, task_number=self.task_number, broken=False)
except:
print('\nReturned JSON is broken.\n')
if loaded_json:
db.save_scraped(client=connection, new_data=loaded_json, task_number=self.task_number, broken=True)
.
编辑2:
如果您想查看“read_crawled_urls”的实现,这里是:
def read_crawled_urls(client):
# The actual database
db = client.TaskExecution
# Selecting the collection of the database.
collection = db.Urls
url_list = []
for entry in collection.find():
url_list.append(entry['link'])
return url_list
当我从 main.py 文件运行这个蜘蛛时,
os.system('scrapy runspider gs_scraper.py')
代码会抛出错误。但是,如果我从终端运行它,它显然可以正常工作。
解决方案
推荐阅读
- python-3.x - 读取 pyspark 中的分区蜂巢表而不是镶木地板
- java - 使用单个for循环比较java中同一数组的元素
- oop - 了解模型层及其复杂性
- java - 如何更新 Maven 版本?
- python - 有没有更好的方法来检查输入中的字符串?
- php - PHP 到 MSSQL 的连接失败,因为我的 MSSQL 密码中包含 $ 特殊字符
- r - 在 R 闪亮中,如何指定用于绘图的反应对象列?
- charts - 如何根据每个窗格的平均值在 Tableau 中为条形图着色?
- vba - 向集合 VB 添加第二个对象
- javascript - charAt 不适用于在 javascript 中动态创建的元素