python - 用 Python 实现网络爬虫
问题描述
当我尝试在 中实现一个简单的网络爬虫代码时Colab
,并且我编写了以下代码,我得到了如下语法错误。请告诉我如何解决问题以运行它:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page=1
while page <= max_pages:
url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=2%22+Butterfly+Valve&_sacat=0&_pgn='+ str(page)
source_code= requests.get(url)
plain_text=source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findALL('a', {'class':'s-item__title s-item__title--has-tags'})
href = link.get('href')
print(href)
page+=1
trade_spider(1)
错误:
File "<ipython-input-4-5d567ac26fb5>", line 11
for link in soup.findALL('a', {'class':'s-item__title s-item__title--has-tags'})
^
IndentationError: unexpected indent
解决方案
这段代码有很多错误的地方,但我可以提供帮助。for 循环有一个额外的缩进,因此从 for 循环的开头删除一个缩进并:
在 for 循环的末尾添加一个。此外,您似乎只是从互联网上复制了这个,但无论如何。无论如何,这是正确的代码:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page=1
while page <= max_pages:
url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=2%22+Butterfly+Valve&_sacat=0&_pgn='+ str(page)
source_code= requests.get(url)
plain_text=source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findALL('a', {'class':'s-item__title s-item__title--has-tags'}):
href = link.get('href')
print(href)
page+=1
trade_spider(1)
编辑:运行此代码后,出现错误:
main.py:10: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 10 of the file main.py. To get rid of this warning, pass the additional argument 'features="html5lib"' to the BeautifulSoup constructor.
所以这是正确的代码:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page=1
while page <= max_pages:
url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=2%22+Butterfly+Valve&_sacat=0&_pgn='+ str(page)
source_code= requests.get(url)
plain_text=source_code.text
soup = BeautifulSoup(plain_text, features="html5lib")
for link in soup.find_all('a', {'class':'s-item__title s-item__title--has-tags'}):
href = link.get('href')
print(href)
page+=1
trade_spider(1)
推荐阅读
- apache-spark - Spark-on-k8s:/opt/spark/bin/spark-class: line 71: /home/deploy/jdk1.8.0_201/bin/java: 没有这样的文件或目录
- httpresponse - 第一个请求的蝗虫响应时间
- installshield - 资源链接器无法构建 DLL _isuser_0x0409.dll
- c# - 将 RenderAction 部分视图 ASP.Net MVC 转换为 ASP.Net Core
- angular-formly - Formly Form 装载吊装现场性能
- ruby - 与 Mongoid 交易的上下文
- windows - 用户在 Windows 控制面板中可见的规则是什么?
- perl - 如何使用 perl 脚本删除重复的行
- java - 无法使用与 Gradle 发行版“https://services.gradle.org/distributions/gradle-4.10.2-bin.zip”的连接来运行构建操作
- java - Java:将字符串参数解析为数组