首页 > 解决方案 > Counting words on multiple web pages at the same domain

问题描述

I am writing a simple Python 3 web crawler intended to count the words on various sub-pages of a domain. The idea is to get all the sub-pages on the domain, then iterate through them and count the words in each.

My problem is that I'm getting various errors, such as urllib3.exceptions.NewConnectionError, and that word counts are inaccurate.

Once I'll perfect the code, I'll make it more recursive, to count the words in sub pages of sub pages as well.

I will be grateful for any suggestions to improve my code.

import requests
from collections import Counter
from string import punctuation
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://www.paulelliottbooks.com/")
bsObj = BeautifulSoup(html.read(), features="html.parser");

urls=[]
for link in bsObj.find_all('a'):
    if link.get('href') not in urls:
        urls.append(link.get('href'))
    else:
        pass
print(urls)

words=0
for url in urls:
    specific_url="https://www.paulelliottbooks.com"+url
    r = requests.get(specific_url)
    soup = BeautifulSoup(r.content, features="html.parser")
    text_div = (''.join(s.findAll(text=True)) for s in soup.findAll('div'))
    c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
    word_count=(sum(c_div.values()))
    print(specific_url + " " + str(word_count))
    words += word_count
print(words)`

Output:

https://www.paulelliottbooks.com/zozergames.html 12152
https://www.paulelliottbooks.com/43-ad.html 9306
https://www.paulelliottbooks.com/warband.html 6142
https://www.paulelliottbooks.com/camp-cretaceous.html 2886
https://www.paulelliottbooks.com/free-rpgs.html 5217
https://www.paulelliottbooks.com/grunt.html 7927
https://www.paulelliottbooks.com/hostile.html 7232
https://www.paulelliottbooks.com/alien-breeds.html 4946
https://www.paulelliottbooks.com/crew-expendable.html 2786
https://www.paulelliottbooks.com/dirtside.html 4682
https://www.paulelliottbooks.com/hot-zone.html 2546
https://www.paulelliottbooks.com/marine-handbook.html 4700
https://www.paulelliottbooks.com/pioneer-class-station.html 4394
https://www.paulelliottbooks.com/roughnecks.html 4406
https://www.paulelliottbooks.com/technical-manual.html 2933
https://www.paulelliottbooks.com/tool-kits.html 2180
https://www.paulelliottbooks.com/zaibatsu.html 8555
https://www.paulelliottbooks.com/hostile-resources.html 3768
https://www.paulelliottbooks.com/low-tech-supplements.html 7142
https://www.paulelliottbooks.com/modern-war.html 3206
https://www.paulelliottbooks.com/orbital.html 8991
https://www.paulelliottbooks.com/far-horizon.html 7113
https://www.paulelliottbooks.com/outpost-mars.html 4513
https://www.paulelliottbooks.com/horizon-survey-craft.html 4778
https://www.paulelliottbooks.com/planetary-tool-kits.html 7581
https://www.paulelliottbooks.com/solo.html 8451
https://www.paulelliottbooks.com/traveller-freebies.html 16155
https://www.paulelliottbooks.com/universal-world-profile.html 8213
https://www.paulelliottbooks.com/zenobia-rpg.html 7760
https://www.paulelliottbooks.com/history-books.html 13427
https://www.paulelliottbooks.com/gallery.html 971
https://www.paulelliottbooks.com/contact.html 914
https://www.paulelliottbooks.com# 556
Traceback (most recent call last):
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\util\connection.py", line 61, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1776.0_x64__qbz5n2kfra8p0\lib\socket.py", line 752, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connection.py", line 300, in connect
    conn = self._new_conn()
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x00000214FE5B7708>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\adapters.py", line 449, in send
    timeout=timeout
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\util\retry.py", line 436, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.paulelliottbooks.com_blank', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000214FE5B7708>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/golan/PycharmProjects/crawl_counter/crawl_counter.py", line 21, in <module>
    r = requests.get(specific_url)
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.paulelliottbooks.com_blank', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000214FE5B7708>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

Process finished with exit code 1

标签: python

解决方案


我想我自己修好了!

首先,我确保脚本忽略了导致错误消息的NULL_blank URL。

然后,我做了更多的研究并大大简化了我的单词计数器,现在它似乎可以准确地完成它的工作。

关于改进我的脚本有什么进一步的建议吗?

import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://www.paulelliottbooks.com/")
bsObj = BeautifulSoup(html.read(), features="html.parser");

urls=[]
for link in bsObj.find_all('a'):
    if link.get('href') not in urls:
        urls.append(link.get('href'))
    else:
        pass
print(urls)

words=0
for url in urls:
    if url not in ["NULL", "_blank"]:
        specific_url="https://www.paulelliottbooks.com/"+url
        r = requests.get(specific_url)
        soup = BeautifulSoup(r.text, features="html.parser")
        for script in soup(["script", "style"]):
            script.extract()
        text = soup.get_text()
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = '\n'.join(chunk for chunk in chunks if chunk)
        text_list = text.split()
        print(f"{specific_url}: {len(text_list)} words")
        words += len(text_list)
    else:
        pass
print(words)

输出:

['/zozergames.html', '/43-ad.html', '/warband.html', '/camp-cretaceous.html', '/free-rpgs.html', '/grunt.html', '/hostile.html', '/alien-breeds.html', '/crew-expendable.html', '/dirtside.html', '/hot-zone.html', '/marine-handbook.html', '/pioneer-class-station.html', '/roughnecks.html', '/technical-manual.html', '/tool-kits.html', '/zaibatsu.html', '/hostile-resources.html', '/low-tech-supplements.html', '/modern-war.html', '/orbital.html', '/far-horizon.html', '/outpost-mars.html', '/horizon-survey-craft.html', '/planetary-tool-kits.html', '/solo.html', '/traveller-freebies.html', '/universal-world-profile.html', '/zenobia-rpg.html', '/history-books.html', '/gallery.html', '/contact.html', '#', '_blank']
https://www.paulelliottbooks.com/zozergames.html: 1148 words
https://www.paulelliottbooks.com/43-ad.html: 933 words
https://www.paulelliottbooks.com/warband.html: 610 words
https://www.paulelliottbooks.com/camp-cretaceous.html: 328 words
https://www.paulelliottbooks.com/free-rpgs.html: 535 words
https://www.paulelliottbooks.com/grunt.html: 811 words
https://www.paulelliottbooks.com/hostile.html: 726 words
https://www.paulelliottbooks.com/alien-breeds.html: 491 words
https://www.paulelliottbooks.com/crew-expendable.html: 311 words
https://www.paulelliottbooks.com/dirtside.html: 468 words
https://www.paulelliottbooks.com/hot-zone.html: 291 words
https://www.paulelliottbooks.com/marine-handbook.html: 470 words
https://www.paulelliottbooks.com/pioneer-class-station.html: 446 words
https://www.paulelliottbooks.com/roughnecks.html: 445 words
https://www.paulelliottbooks.com/technical-manual.html: 324 words
https://www.paulelliottbooks.com/tool-kits.html: 260 words
https://www.paulelliottbooks.com/zaibatsu.html: 792 words
https://www.paulelliottbooks.com/hostile-resources.html: 408 words
https://www.paulelliottbooks.com/low-tech-supplements.html: 678 words
https://www.paulelliottbooks.com/modern-war.html: 346 words
https://www.paulelliottbooks.com/orbital.html: 943 words
https://www.paulelliottbooks.com/far-horizon.html: 716 words
https://www.paulelliottbooks.com/outpost-mars.html: 518 words
https://www.paulelliottbooks.com/horizon-survey-craft.html: 497 words
https://www.paulelliottbooks.com/planetary-tool-kits.html: 831 words
https://www.paulelliottbooks.com/solo.html: 784 words
https://www.paulelliottbooks.com/traveller-freebies.html: 1490 words
https://www.paulelliottbooks.com/universal-world-profile.html: 826 words
https://www.paulelliottbooks.com/zenobia-rpg.html: 726 words
https://www.paulelliottbooks.com/history-books.html: 1207 words
https://www.paulelliottbooks.com/gallery.html: 161 words
https://www.paulelliottbooks.com/contact.html: 157 words
https://www.paulelliottbooks.com#: 127 words
19804

Process finished with exit code 0

推荐阅读