python - Python - 使用字典和元组查找单词和字母的唯一计数
问题描述
我目前正在尝试创建一个脚本,该脚本允许我遍历文件中包含的文本并计算单词的数量、不同的单词、列出前 10 个最常见的单词和计数,并从最常见的字符频率中排序最不频繁。
这是我到目前为止所拥有的:
import sys
import os
os.getcwd()
import string
path = ""
os.chdir(path)
#Prompt for user to input filename:
fname = input('Enter the filename: ')
try:
fhand = open(fname)
except IOError:
#Invalid filename error
print('\n')
print("Sorry, file can't be opened! Please check your spelling.")
sys.exit()
#Initialize char counts and word counts dictionary
counts = {}
worddict = {}
#For character and word frequency count
for line in fhand:
#Remove leading spaces
line = line.strip()
#Convert everything in the string to lowercase
line = line.lower()
#Take into account punctuation
line = line.translate(line.maketrans('', '', string.punctuation))
#Take into account white spaces
line = line.translate(line.maketrans('', '', string.whitespace))
#Take into account digits
line = line.translate(line.maketrans('', '', string.digits))
#Splitting line into words
words = line.split(" ")
for word in words:
#Is the word already in the word dictionary?
if word in worddict:
#Increase by 1
worddict[word] += 1
else:
#Add word to dictionary with count of 1 if not there already
worddict[word] = 1
#Character count
for word in line:
#Increase count by 1 if letter
if word in counts:
counts[word] += 1
else:
counts[word] = 1
#Initialize dictionaries
lst = []
countlst = []
freqlst = []
#Count up the number of letters
for ltrs, c in counts.items():
lst.append((c,ltrs))
countlst.append(c)
#Sum up the count
totalcount = sum(countlst)
#Calculate the frequency in each dictionary
for ec in countlst:
efreq = (ec/totalcount) * 100
freqlst.append(efreq)
#Sort lists by count and percentage frequency
freqlst.sort(reverse=True)
lst.sort(reverse=True)
#Print out word counts
for key in list(worddict.keys()):
print(key, ":", worddict[key])
#Print out all letters and counts:
for ltrs, c, in lst:
print(c, '-', ltrs, '-', round(ltrs/totalcount*100, 2), '%')
当我在 romeo.txt 之类的东西上运行脚本时:
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
我得到这个输出:
butsoftwhatlightthroughyonderwindowbreaks : 1
itistheeastandjulietisthesun : 1
arisefairsunandkilltheenviousmoon : 1
whoisalreadysickandpalewithgrief : 1
i - 14 - 10.45 %
t - 12 - 8.96 %
e - 12 - 8.96 %
s - 11 - 8.21 %
a - 11 - 8.21 %
n - 9 - 6.72 %
h - 9 - 6.72 %
o - 8 - 5.97 %
r - 7 - 5.22 %
u - 6 - 4.48 %
l - 6 - 4.48 %
d - 6 - 4.48 %
w - 5 - 3.73 %
k - 3 - 2.24 %
g - 3 - 2.24 %
f - 3 - 2.24 %
y - 2 - 1.49 %
b - 2 - 1.49 %
v - 1 - 0.75 %
p - 1 - 0.75 %
m - 1 - 0.75 %
j - 1 - 0.75 %
c - 1 - 0.75 %
当我在 frequency.txt 上运行脚本时:
I am you you you you you I I I I you you you you I am
我得到这个输出:
iamyouyouyouyouyouiiiiyouyouyouyouiam : 1
y - 9 - 24.32 %
u - 9 - 24.32 %
o - 9 - 24.32 %
i - 6 - 16.22 %
m - 2 - 5.41 %
a - 2 - 5.41 %
我能否就如何考虑将每行上的单词分开以使其不同,并以所需的方式汇总计数获得一些指导?
解决方案
line = line.translate(line.maketrans('', '', string.whitespace))
您正在使用此代码删除行中的所有空格。删除它,它应该可以按您的意愿工作。
推荐阅读
- python - 找到最接近的数字,它是 2 的幂并且小于输入
- excel - 创建 VBA 循环以查看自动过滤器中的每个条件
- java - 使用 Apache Crypto 时如何确定输出字节 [] 大小?
- c++ - 不调用 if 语句的无限循环
- python - 尝试清理 DF 的几列以准备情绪分析
- c# - 在 Windows Server 2008 R2 中使用 Redis
- mysql - 使用 mysql 的 Node.JS ER_PARSE_ERROR
- java - 在 EDT 中运行时显示 JFreeChart 点的性能问题
- javascript - 将数组数据转换为对象
- python-3.x - Python TCP数据包变得混合