python - Counting words on multiple web pages at the same domain
问题描述
I am writing a simple Python 3 web crawler intended to count the words on various sub-pages of a domain. The idea is to get all the sub-pages on the domain, then iterate through them and count the words in each.
My problem is that I'm getting various errors, such as urllib3.exceptions.NewConnectionError, and that word counts are inaccurate.
Once I'll perfect the code, I'll make it more recursive, to count the words in sub pages of sub pages as well.
I will be grateful for any suggestions to improve my code.
import requests
from collections import Counter
from string import punctuation
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.paulelliottbooks.com/")
bsObj = BeautifulSoup(html.read(), features="html.parser");
urls=[]
for link in bsObj.find_all('a'):
if link.get('href') not in urls:
urls.append(link.get('href'))
else:
pass
print(urls)
words=0
for url in urls:
specific_url="https://www.paulelliottbooks.com"+url
r = requests.get(specific_url)
soup = BeautifulSoup(r.content, features="html.parser")
text_div = (''.join(s.findAll(text=True)) for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
word_count=(sum(c_div.values()))
print(specific_url + " " + str(word_count))
words += word_count
print(words)`
Output:
https://www.paulelliottbooks.com/zozergames.html 12152
https://www.paulelliottbooks.com/43-ad.html 9306
https://www.paulelliottbooks.com/warband.html 6142
https://www.paulelliottbooks.com/camp-cretaceous.html 2886
https://www.paulelliottbooks.com/free-rpgs.html 5217
https://www.paulelliottbooks.com/grunt.html 7927
https://www.paulelliottbooks.com/hostile.html 7232
https://www.paulelliottbooks.com/alien-breeds.html 4946
https://www.paulelliottbooks.com/crew-expendable.html 2786
https://www.paulelliottbooks.com/dirtside.html 4682
https://www.paulelliottbooks.com/hot-zone.html 2546
https://www.paulelliottbooks.com/marine-handbook.html 4700
https://www.paulelliottbooks.com/pioneer-class-station.html 4394
https://www.paulelliottbooks.com/roughnecks.html 4406
https://www.paulelliottbooks.com/technical-manual.html 2933
https://www.paulelliottbooks.com/tool-kits.html 2180
https://www.paulelliottbooks.com/zaibatsu.html 8555
https://www.paulelliottbooks.com/hostile-resources.html 3768
https://www.paulelliottbooks.com/low-tech-supplements.html 7142
https://www.paulelliottbooks.com/modern-war.html 3206
https://www.paulelliottbooks.com/orbital.html 8991
https://www.paulelliottbooks.com/far-horizon.html 7113
https://www.paulelliottbooks.com/outpost-mars.html 4513
https://www.paulelliottbooks.com/horizon-survey-craft.html 4778
https://www.paulelliottbooks.com/planetary-tool-kits.html 7581
https://www.paulelliottbooks.com/solo.html 8451
https://www.paulelliottbooks.com/traveller-freebies.html 16155
https://www.paulelliottbooks.com/universal-world-profile.html 8213
https://www.paulelliottbooks.com/zenobia-rpg.html 7760
https://www.paulelliottbooks.com/history-books.html 13427
https://www.paulelliottbooks.com/gallery.html 971
https://www.paulelliottbooks.com/contact.html 914
https://www.paulelliottbooks.com# 556
Traceback (most recent call last):
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\util\connection.py", line 61, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1776.0_x64__qbz5n2kfra8p0\lib\socket.py", line 752, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
chunked=chunked,
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
self._validate_conn(conn)
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
conn.connect()
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connection.py", line 300, in connect
conn = self._new_conn()
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connection.py", line 169, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x00000214FE5B7708>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\urllib3\util\retry.py", line 436, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.paulelliottbooks.com_blank', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000214FE5B7708>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/golan/PycharmProjects/crawl_counter/crawl_counter.py", line 21, in <module>
r = requests.get(specific_url)
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "C:\Users\golan\PycharmProjects\crawl_counter\venv\lib\site-packages\requests\adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.paulelliottbooks.com_blank', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000214FE5B7708>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
Process finished with exit code 1
解决方案
我想我自己修好了!
首先,我确保脚本忽略了导致错误消息的NULL和_blank URL。
然后,我做了更多的研究并大大简化了我的单词计数器,现在它似乎可以准确地完成它的工作。
关于改进我的脚本有什么进一步的建议吗?
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.paulelliottbooks.com/")
bsObj = BeautifulSoup(html.read(), features="html.parser");
urls=[]
for link in bsObj.find_all('a'):
if link.get('href') not in urls:
urls.append(link.get('href'))
else:
pass
print(urls)
words=0
for url in urls:
if url not in ["NULL", "_blank"]:
specific_url="https://www.paulelliottbooks.com/"+url
r = requests.get(specific_url)
soup = BeautifulSoup(r.text, features="html.parser")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
text_list = text.split()
print(f"{specific_url}: {len(text_list)} words")
words += len(text_list)
else:
pass
print(words)
输出:
['/zozergames.html', '/43-ad.html', '/warband.html', '/camp-cretaceous.html', '/free-rpgs.html', '/grunt.html', '/hostile.html', '/alien-breeds.html', '/crew-expendable.html', '/dirtside.html', '/hot-zone.html', '/marine-handbook.html', '/pioneer-class-station.html', '/roughnecks.html', '/technical-manual.html', '/tool-kits.html', '/zaibatsu.html', '/hostile-resources.html', '/low-tech-supplements.html', '/modern-war.html', '/orbital.html', '/far-horizon.html', '/outpost-mars.html', '/horizon-survey-craft.html', '/planetary-tool-kits.html', '/solo.html', '/traveller-freebies.html', '/universal-world-profile.html', '/zenobia-rpg.html', '/history-books.html', '/gallery.html', '/contact.html', '#', '_blank']
https://www.paulelliottbooks.com/zozergames.html: 1148 words
https://www.paulelliottbooks.com/43-ad.html: 933 words
https://www.paulelliottbooks.com/warband.html: 610 words
https://www.paulelliottbooks.com/camp-cretaceous.html: 328 words
https://www.paulelliottbooks.com/free-rpgs.html: 535 words
https://www.paulelliottbooks.com/grunt.html: 811 words
https://www.paulelliottbooks.com/hostile.html: 726 words
https://www.paulelliottbooks.com/alien-breeds.html: 491 words
https://www.paulelliottbooks.com/crew-expendable.html: 311 words
https://www.paulelliottbooks.com/dirtside.html: 468 words
https://www.paulelliottbooks.com/hot-zone.html: 291 words
https://www.paulelliottbooks.com/marine-handbook.html: 470 words
https://www.paulelliottbooks.com/pioneer-class-station.html: 446 words
https://www.paulelliottbooks.com/roughnecks.html: 445 words
https://www.paulelliottbooks.com/technical-manual.html: 324 words
https://www.paulelliottbooks.com/tool-kits.html: 260 words
https://www.paulelliottbooks.com/zaibatsu.html: 792 words
https://www.paulelliottbooks.com/hostile-resources.html: 408 words
https://www.paulelliottbooks.com/low-tech-supplements.html: 678 words
https://www.paulelliottbooks.com/modern-war.html: 346 words
https://www.paulelliottbooks.com/orbital.html: 943 words
https://www.paulelliottbooks.com/far-horizon.html: 716 words
https://www.paulelliottbooks.com/outpost-mars.html: 518 words
https://www.paulelliottbooks.com/horizon-survey-craft.html: 497 words
https://www.paulelliottbooks.com/planetary-tool-kits.html: 831 words
https://www.paulelliottbooks.com/solo.html: 784 words
https://www.paulelliottbooks.com/traveller-freebies.html: 1490 words
https://www.paulelliottbooks.com/universal-world-profile.html: 826 words
https://www.paulelliottbooks.com/zenobia-rpg.html: 726 words
https://www.paulelliottbooks.com/history-books.html: 1207 words
https://www.paulelliottbooks.com/gallery.html: 161 words
https://www.paulelliottbooks.com/contact.html: 157 words
https://www.paulelliottbooks.com#: 127 words
19804
Process finished with exit code 0
推荐阅读
- math - 任意平面的反射矩阵
- python - 反转偶数行并将段落放入列表中
- javascript - React hooks 在从子节点获取数据时进行无限获取,但如果父节点是类组件则没有问题
- r - 无法在 Shiny 中更改 dygraphs 的大小
- c++ - 如何使用 cmake 链接静态外部库
- markdown - 我如何在graphviz记录中下划线作为特定字段
- c++ - ConnectNamedPipe() 函数在从作为参数传递给新线程的 Functor 的方法中调用时触发“调试错误!abort()”
- python - 生成具有预定义模数和指数的公钥
- php - Composer - 通过删除 composer.lock 中的条目来启用更新
- java - Android 上 Java 中的 MidiUnavailableException