首页 > 解决方案 > 如何跳过重复的行?

问题描述

如何使描述的部分无法测试重复的链接?我试着对比一下,我做不到,脚本很慢。

import re
from bs4 import BeautifulSoup
import requests
import urllib.request

r = requests.get( 'http://www.google.com' )
html = r.text
soup = BeautifulSoup( html , 'lxml' )
links = soup.find_all( 'a' , attrs={'href' : re.compile( r'^https?://' )} )
for i in links :
    href = i['href']

# Test Section
    req = requests.get( href )
    resp = req.status_code
    if resp is None or resp in [400 , 404 , 403 , 408 , 409 , 501 , 502 , 503] :
        print( resp + '=' + resp.reason + '===>' + href )
        with open( 'Document_ERROR.txt' , 'a' ) as arq :
           arq.write( href )
           arq.write( '\n' )
           arq.write( resp.reason )
           arq.close( )
    else :
       print( 'Response is {} ===> `{}'.format( resp , href ) )
       with open( 'Document_OK.txt' , 'a' ) as arq :
          arq.write( href )
          arq.write( '\n' )
          arq.close( )

标签: pythonpython-3.xregexbeautifulsoup

解决方案


如果我对您的理解正确,那么当您已经测试过链接时,您想跳过测试代码。

您可以有一个名为 的集合seen_links,它将包含迄今为止测试的所有链接:

import re
from bs4 import BeautifulSoup
import requests
import urllib.request


r = requests.get('http://www.google.com')
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all('a',attrs={'href': re.compile( r'^https?://' )})


seen_links = set()  # <-- set that will hold all seen links so far

for i in links :
    href = i['href']

    # have we seen the link before?
    if href in seen_links:
        continue    # yes, continue the loop

    # no, add it to seen_links
    seen_links.add(href)

    req = requests.get( href )
    resp = req.status_code
    if resp is None or resp in [400 , 404 , 403 , 408 , 409 , 501 , 502 , 503]:
        print( resp + '=' + resp.reason + '===>' + href )
        with open( 'Document_ERROR.txt' , 'a' ) as arq :
            print(href, file=arq)
            print(resp.reason, file=arq)
    else :
        print( 'Response is {} ===> `{}'.format( resp , href ) )
        with open( 'Document_OK.txt' , 'a' ) as arq :
            print(href, file=arq)

推荐阅读