首页 > 解决方案 > Python - 使用字典和元组查找单词和字母的唯一计数

问题描述

我目前正在尝试创建一个脚本,该脚本允许我遍历文件中包含的文本并计算单词的数量、不同的单词、列出前 10 个最常见的单词和计数,并从最常见的字符频率中排序最不频繁。

这是我到目前为止所拥有的:

import sys
import os
os.getcwd()
import string

path = ""
os.chdir(path)

#Prompt for user to input filename:
fname = input('Enter the filename: ')

try:
    fhand = open(fname)
except IOError:
    #Invalid filename error
    print('\n')
    print("Sorry, file can't be opened! Please check your spelling.")
    sys.exit()

#Initialize char counts and word counts dictionary
counts = {}
worddict = {}

#For character and word frequency count
for line in fhand:
        #Remove leading spaces
        line = line.strip()
        #Convert everything in the string to lowercase
        line = line.lower()
        #Take into account punctuation        
        line = line.translate(line.maketrans('', '', string.punctuation))
        #Take into account white spaces
        line = line.translate(line.maketrans('', '', string.whitespace))
        #Take into account digits
        line = line.translate(line.maketrans('', '', string.digits))

        #Splitting line into words
        words = line.split(" ")

        for word in words:
            #Is the word already in the word dictionary?
            if word in worddict:
                #Increase by 1
                worddict[word] += 1
            else:
                #Add word to dictionary with count of 1 if not there already
                worddict[word] = 1

        #Character count
        for word in line:
            #Increase count by 1 if letter
            if word in counts:
                counts[word] += 1
            else:
                counts[word] = 1

#Initialize dictionaries
lst = []
countlst = []
freqlst = []

#Count up the number of letters
for ltrs, c in counts.items():
    lst.append((c,ltrs))
    countlst.append(c)

#Sum up the count
totalcount = sum(countlst)

#Calculate the frequency in each dictionary
for ec in countlst:
    efreq = (ec/totalcount) * 100
    freqlst.append(efreq)

#Sort lists by count and percentage frequency
freqlst.sort(reverse=True)
lst.sort(reverse=True)

#Print out word counts
for key in list(worddict.keys()):
    print(key, ":", worddict[key])

#Print out all letters and counts:
for ltrs, c, in lst:
    print(c, '-', ltrs, '-', round(ltrs/totalcount*100, 2), '%')

当我在 romeo.txt 之类的东西上运行脚本时:

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

我得到这个输出:

butsoftwhatlightthroughyonderwindowbreaks : 1
itistheeastandjulietisthesun : 1
arisefairsunandkilltheenviousmoon : 1
whoisalreadysickandpalewithgrief : 1
i - 14 - 10.45 %
t - 12 - 8.96 %
e - 12 - 8.96 %
s - 11 - 8.21 %
a - 11 - 8.21 %
n - 9 - 6.72 %
h - 9 - 6.72 %
o - 8 - 5.97 %
r - 7 - 5.22 %
u - 6 - 4.48 %
l - 6 - 4.48 %
d - 6 - 4.48 %
w - 5 - 3.73 %
k - 3 - 2.24 %
g - 3 - 2.24 %
f - 3 - 2.24 %
y - 2 - 1.49 %
b - 2 - 1.49 %
v - 1 - 0.75 %
p - 1 - 0.75 %
m - 1 - 0.75 %
j - 1 - 0.75 %
c - 1 - 0.75 %

当我在 frequency.txt 上运行脚本时:

I am you you you you you I I I I you you you you I am

我得到这个输出:

iamyouyouyouyouyouiiiiyouyouyouyouiam : 1
y - 9 - 24.32 %
u - 9 - 24.32 %
o - 9 - 24.32 %
i - 6 - 16.22 %
m - 2 - 5.41 %
a - 2 - 5.41 %

我能否就如何考虑将每行上的单词分开以使其不同,并以所需的方式汇总计数获得一些指导?

标签: pythondictionarytuples

解决方案


line = line.translate(line.maketrans('', '', string.whitespace))

您正在使用此代码删除行中的所有空格。删除它,它应该可以按您的意愿工作。


推荐阅读