首页 > 解决方案 > BeautifulSoup 无法检索网页链接

问题描述

我正在尝试检测网站列表页面的 url,但 BeautifulSoup 无法做到这一点。我得到以下异常,即使我尝试使用标题,

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
TimeoutError: [Errno 60] Operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 368, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 317, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='www.sahibinden.com', port=80): Read timed out. (read timeout=None)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/soner/PycharmProjects/bitirme2/main.py", line 8, in <module>
    r = requests.get(url)
  File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='www.sahibinden.com', port=80): Read timed out. (read timeout=None)

Process finished with exit code 1

但是,当我使用https://hackertarget.com/extract-links/尝试代码中的 url 时,它会带来 URL。

import requests
from bs4 import BeautifulSoup


url = 'http://www.sahibinden.com/satilik/istanbul-kartal?pagingOffset=50&pagingSize=50'
url2 = 'http://www.stackoverflow.com'

r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'lxml')

for link in soup.find_all("a", {"class": "classifiedTitle"}):
    print(link.get('href'))


'''
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
print(requests.get(url, headers=headers, timeout=5).text)
'''

请注意,如果您发现自己被网站阻止(sahbinden),这是可能的。我还没有研究过 BeautifulSoup 与代理列表的用法。

标签: pythonweb-scrapingbeautifulsouppython-requestsweb-crawler

解决方案


这是我运行的代码片段,它按预期工作:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}

url = 'http://www.sahibinden.com/satilik/istanbul-kartal?pagingOffset=50&pagingSize=50'

r = requests.get(url, headers=headers)
if r.ok:
    soup = BeautifulSoup(r.text, 'lxml')
    for a in soup('a', 'classifiedTitle'):
        print(a.get('href'))

这是上面代码的输出:

/ilan/emlak-konut-satilik-directten%2Ccift-wc-li%2Cgenis-m2de%2Ciskanli%2Culasimi-kolay-sik-3-plus1-671049902/detay
/ilan/emlak-konut-satilik-nesrin-den-kartal-ugurmumcuda-satilik-3-plus1-yunus-emre-caddesinde-692133846/detay
/ilan/emlak-konut-satilik-akelden-karliktepe-de-genis-m2-li-krediye-uygun-daire-659458837/detay
/ilan/emlak-konut-satilik-ikea-ve-metro-yani-teknik-yapi-uprise-elite_mukemmel-firsat-3-plus1-692131163/detay
/ilan/emlak-konut-satilik-kartal-atalar-da-iskanli-5-plus1-dubleks-satilik-daire-692125302/detay
/ilan/emlak-konut-satilik-satilik-daire-kartal-atalar-da-2-plus1-lux-100-m2-671083034/detay
/ilan/emlak-konut-satilik-kartal-ugurmumcuda-3-plus1-genis-masrafsiz-satilik-daire-681180607/detay
/ilan/emlak-konut-satilik-soner-den-manzara-adalar-da-satilik-kacirilmayacak-kelepir-daire-653973723/detay
/ilan/emlak-konut-satilik-mertcan-dan-tarihi-ayazma-caddesinde-2-plus1-satilik-ters-dubleks-692122837/detay
/ilan/emlak-konut-satilik-cinar-emlak%2Ctan-hurriyet-mah-105-m2-toprak-tapulu-692117031/detay
/ilan/emlak-konut-satilik-kartal-cumhuriyet-te-arsa-hisseli-yuksek-giris-daire-692116930/detay
/ilan/emlak-konut-satilik-temiz-emlaktan-petroliste-2-plus1-satilik-sifir-deniz-manzarali-671086029/detay
/ilan/emlak-konut-satilik-cemal-yalcin-dan-ozel-mimarili-luks-satilik-dubleks-623158476/detay
/ilan/emlak-konut-satilik-la-marin-kartal-da-site-icerisinde-ozel-bahce-kati-sifir-daire-645480180/detay
/ilan/emlak-konut-satilik-sen-kardeslerden-merkezde-3-plus1%2Ccok-temiz-satilik-daire%2C350.000tl-692103788/detay
/ilan/emlak-konut-satilik-kartal-petrol-is-mah-de-3-plus1-deniz-manzarali-yatirimlik-daire-619762304/detay
/ilan/emlak-konut-satilik-remax-red-rukiye-korkmaz-dan-panorama-velpark-ta-esyali-1-plus1-616596826/detay
/ilan/emlak-konut-satilik-yakacik-demirli-twinstar-sitesi-ultra-luks-174-m2-3-plus1-daire-692104680/detay
/ilan/emlak-konut-satilik-kartal-soganlikta-yatirimlik-kiracili-firsat-2-plus1-daire-682793715/detay
/ilan/emlak-konut-satilik-istmarinada-devirli-taksitli-satilik-studyo-gulsen-yanmazdan-638548163/detay
/ilan/emlak-konut-satilik-sahibinden-satilik-kartal-merkezde-kaymakamligin-karsisinda-2-plus1-692054497/detay
/ilan/emlak-konut-satilik-petrolis______ara-kat-2-plus1-110-m2-lux-panjurlu_____carsiya-yakin-692100683/detay
/ilan/emlak-konut-satilik-ful-deniz-manzarali-3-plus1-ana-yola-cok-yakin-115m2-sifir-daire-585807696/detay
/ilan/emlak-konut-satilik-kartal-karlitepe-de-ters-dublek-2-plus2-satilik-daire-692085141/detay
/ilan/emlak-konut-satilik-kartal-dap-yapi-istmarina-full-deniz-manzarali-2-plus1-satilik-621795699/detay
/ilan/emlak-konut-satilik-aybars-dan-site-icinde-havuzlu-satilik-daire-671063936/detay
/ilan/emlak-konut-satilik-soganlik-yeni-mah-5-yillik-binada-adalar-manzarali-satilik-dair-679308838/detay
/ilan/emlak-konut-satilik-kartal-soganlik-orta-mah-e-5-yani-yeni-bina-kelepir-daire-573785719/detay
/ilan/emlak-konut-satilik-sahibinden-site-icerisinde-1-plus1-644746509/detay
/ilan/emlak-konut-satilik-3-plus1-luks-sitede-646420303/detay
/ilan/emlak-konut-satilik-mirac-dan-ayazma-koru-da-lux-yapili-3-plus1-135m2-masrafsiz-daire-535382195/detay
/ilan/emlak-konut-satilik-sahibinden-site-icerisinde-3-plus1-644729603/detay
/ilan/emlak-konut-satilik-cevizli-de-satilik-daire-2-plus1-lux-85-m2-671030197/detay
/ilan/emlak-konut-satilik-esentepe-de-bahceli-acik-otoparkli-125m2-ferah-kullansli-daire-670847710/detay
/ilan/emlak-konut-satilik-atalarda-ara-katta-sifir-binada-2-plus1-85-m2-otoparkli-510436215/detay
/ilan/emlak-konut-satilik-sahil-mesa-marmara-10.kat-122m2-deniz-manzarali-0-satilik-3-plus1-692085951/detay
/ilan/emlak-konut-satilik-kartal-da-sifir-ara-kat-3-plus1-satilik-daire-692090351/detay
/ilan/emlak-konut-satilik-pega-kartal-satis-ofisinden-2-plus1-kat-mulkiyetli-hemen-teslim-644626657/detay
/ilan/emlak-konut-satilik-adalilar-dan-kartal-hurriyet-mah-de-satilik-kelepir-3-plus1-dublex-682761629/detay
/ilan/emlak-konut-satilik-kartal-kordonboyunda-2-plus1-sifir-daire-647037679/detay
/ilan/emlak-konut-satilik-aklife-den_yakacik_carsi_mah_ultra_lux_katta_tek_sifir_2-plus1-654883140/detay
/ilan/emlak-konut-satilik-aklife-den_yakacik_da_mukanbel_yapi_kaliteli_3-plus1_arakat_sifir-657772595/detay
/ilan/emlak-konut-satilik-ciceksan-insaat-dan-3-plus1-daireler-hemen-tapu-hemen-teslim-682770303/detay
/ilan/emlak-konut-satilik-satilik-daire-ofis-2-1-85-mt-klepir-634724740/detay
/ilan/emlak-konut-satilik-ricar-dan%2C7-24-guvenlik%2Cyuzme-havuzu%2Ckapali-otopark%2Csifir%2Csitede-682744629/detay
/ilan/emlak-konut-satilik-ricar-dan%2Cana-cadde-uzeri%2Cgenis%2Cferah%2Csifir%2Clux%2Cara-kat-649504313/detay
/ilan/emlak-konut-satilik-mertcan-dan-e5-e-yurume-mesafesinde-iskanli-2-plus1-sifir-daire-692078490/detay
/ilan/emlak-konut-satilik-kartal-atalar-da-sahile-yurume-mesafesinde-iskanli-masrafsiz-3-plus1-454709956/detay
/ilan/emlak-konut-satilik-tugcan-pala-dan-mesa-kartall-da-satilik-2-kat-buyuk-tip-2-plus1-670434988/detay
/ilan/emlak-konut-satilik-satilik-sifir-daire-soganlik-yeni-mah-2-plus1-kat-mulkiyetli-682522237/detay

推荐阅读