首页 > 解决方案 > 计算网页中的单词不准确

问题描述

Noob,试图建立一个单词计数器,来计算网站上显示的单词。我找到了一些代码(计算网页中的单词),对其进行了修改,在 Google 上尝试了它,发现它很差。我尝试的其他代码显示了所有各种 HTML 标记,这同样没有帮助。如果可见页面内容为:“Hello there world”,我正在寻找 3 的计数。目前,我不关心图像文件(图片)中的单词。我修改后的代码如下:

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# Page you want to count words from
page = "https://google.com"

# Get the page
r = requests.get(page)
soup = BeautifulSoup(r.content)

# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
# creates a dictionary of words and frequency from paragraphs
content_paras = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))
sum_of_paras = sum(content_paras.values())

# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
content_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
sum_of_divs = sum(content_div.values())

words_on_page = sum_of_paras + sum_of_divs
print(words_on_page)

与往常一样,我可以遵循的简单答案胜过我无法遵循的复杂/优雅的答案,b/c Noob。

标签: python-3.xbeautifulsoup

解决方案


推荐阅读