python - How to sort a dictionary by relative word frequency in two txt files
问题描述
I’m trying to write some code to read two separate text files, filter out common words, calculate frequency of words in each file, and finally output in order of relative frequency between the two lists. Therefore ideal output is words relatively more frequent in file 1 should appear at the top of the list, the words relatively more frequent in file 2 should appear at the bottom of the list, and those words that appear in both should be in the middle. For example:
word, freq file 1, freq file 2
Cat 5,0
Dog 4,0
Mouse 2,2
Carrot 1,4
Lettuce 0,5
​
My code currently outputs the words in order of their frequency in file 1, but I cant figure out how to arrange the list so that it the words more common in file 2 appear at the bottom of the list. I get that I need to subtract the frequency of words in file 1 from the frequency of same words in file 2, but I cant figure out how to operate on the tuple in the dictionary...
Please help!
import re
f1=open('file1.txt','r', encoding="utf-8") #file 1
f2=open('file2.txt','r', encoding="utf-8") #file 2
file_list = [f1, f2] # This will hold all the files
num_files = len(file_list)
stopwords = ["a", "and", "the", "i", "of", "this", "it", "but", "is", "in", "im", "my", "to", "for", "as", "on", "helpful", "comment", "report", "stars", "reviewed", "united", "kingdom", "was", "with", "-", "it", "not", "about", "which", "so", "at", "out", "abuse", "than","any", "if", "be", "can", "its", "customer", "dont", "just", "other", "too", "only", "people", "found", "helpful", "have", "wasnt", "purchase", "do", "only", "bought", "etc", "verified", "", "wasnt", "thanks", "thanx", "could", "think", "your", "thing", "much", "ive", "you", "they", "vine", "had", "more", "that"]
frequencies = {} # One dictionary to hold the frequencies
for i, f in enumerate(file_list): # Loop over the files, keeping an index i
for line in f: # Get the lines of that file
for word in line.split(): # Get the words of that file
word = re.sub(r'[^\w\s]','',word) # Strip punctuation
word = word.lower() # Make lowercase
if not word in stopwords: # Remove stopwords
if not word.isdigit(): # Ignore digits
if not word in frequencies:
frequencies[word] = [0 for _ in range(num_files)] # make a list of 0's for any word not seen yet -- one 0 for each file
frequencies[word][i] += 1 # Increment the frequency count for that word and file
frequency_sorted = sorted(frequencies, key=frequencies.get, reverse=True)
for r in frequency_sorted:
print (r, frequencies[r])
解决方案
你把事情复杂化了。这应该可以帮助您:
import strings
from collections import Counter
def get_freqs( name ) :
with open(name) as fin :
text = fin.read().lower()
words = ''.join( i if i in strings.ascii_letters else ' ' for i in text )
words = [w for w in words.split() if len(w) > 0]
return Counter( words )
freqs1 = get_freqs( 'file1.txt' )
freqs2 = get_freqs( 'file2.txt' )
all_words = set(freqs1.keys()) | set(freqs2.keys()) # - set(stop_words) ?
freqs_sorted = sorted( (freqs1[w], freqs2[w], w) for w in all_words )
如果您担心停用词,您可能会更改all_words = set(freqs1.keys()) | set(freqs2.keys())
为all_words = set(freqs1.keys()) | set(freqs2.keys()) - set(stop_words)
或类似的东西。
推荐阅读
- ruby-on-rails - 如果存在则更新,如果为空则销毁,如果在数组表单提交时不存在则创建
- javascript - 用于递归异步/等待调用的计时器包装函数
- php - object 对象 - 返回与 jsPDF ajax 发送到服务器
- c++ - 堆分配变量的返回值优化和初始化
- java - JAX-RS:将 LocalDate 自动序列化为 JSON 不起作用 (json-b) - Liberty 配置文件 19.0.0.6
- django - 如何在 django 中处理将生成的 pdf 作为附件邮件发送?
- ios - Apple IAP 自定义折扣
- ruby-on-rails - 此代码在自动化测试中未通过测试,但在 rails 控制台中有效
- cakephp-3.0 - 如何在 cakephp 中将 3 个子域合并到主域中
- javascript - 将机器人登录名放入突击队命令文件中以调用表情符号而无需登录两次