python - 如何跳过重复的行?
问题描述
如何使描述的部分无法测试重复的链接?我试着对比一下,我做不到,脚本很慢。
import re
from bs4 import BeautifulSoup
import requests
import urllib.request
r = requests.get( 'http://www.google.com' )
html = r.text
soup = BeautifulSoup( html , 'lxml' )
links = soup.find_all( 'a' , attrs={'href' : re.compile( r'^https?://' )} )
for i in links :
href = i['href']
# Test Section
req = requests.get( href )
resp = req.status_code
if resp is None or resp in [400 , 404 , 403 , 408 , 409 , 501 , 502 , 503] :
print( resp + '=' + resp.reason + '===>' + href )
with open( 'Document_ERROR.txt' , 'a' ) as arq :
arq.write( href )
arq.write( '\n' )
arq.write( resp.reason )
arq.close( )
else :
print( 'Response is {} ===> `{}'.format( resp , href ) )
with open( 'Document_OK.txt' , 'a' ) as arq :
arq.write( href )
arq.write( '\n' )
arq.close( )
解决方案
如果我对您的理解正确,那么当您已经测试过链接时,您想跳过测试代码。
您可以有一个名为 的集合seen_links
,它将包含迄今为止测试的所有链接:
import re
from bs4 import BeautifulSoup
import requests
import urllib.request
r = requests.get('http://www.google.com')
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all('a',attrs={'href': re.compile( r'^https?://' )})
seen_links = set() # <-- set that will hold all seen links so far
for i in links :
href = i['href']
# have we seen the link before?
if href in seen_links:
continue # yes, continue the loop
# no, add it to seen_links
seen_links.add(href)
req = requests.get( href )
resp = req.status_code
if resp is None or resp in [400 , 404 , 403 , 408 , 409 , 501 , 502 , 503]:
print( resp + '=' + resp.reason + '===>' + href )
with open( 'Document_ERROR.txt' , 'a' ) as arq :
print(href, file=arq)
print(resp.reason, file=arq)
else :
print( 'Response is {} ===> `{}'.format( resp , href ) )
with open( 'Document_OK.txt' , 'a' ) as arq :
print(href, file=arq)
推荐阅读
- android - 以编程方式设置工具栏高度在 android 4.4 上不起作用
- java - javafx.scene.robot.Robot 与 java.awt.Robot
- excel - Excel Function Large with condition
- c - 此代码(在描述中)如何工作?
- c# - I want to protect the users' API public and secret keys in my UWP app
- c# - 如何在一个解决方案中从 2 个项目中的同一个 txt 文件中获取数据
- java - Moving a file in processing
- c - print characters of a string or array in C Fast
- python - 使用对象为带有极坐标投影的 matplotlib pcolor 绘图设置动画问题
- sql - 如何在满足条件时从另一个表中获取值但在条件失败时分配不同值的表中创建列