首页 > 解决方案 > How to sort a dictionary by relative word frequency in two txt files

问题描述

I’m trying to write some code to read two separate text files, filter out common words, calculate frequency of words in each file, and finally output in order of relative frequency between the two lists. Therefore ideal output is words relatively more frequent in file 1 should appear at the top of the list, the words relatively more frequent in file 2 should appear at the bottom of the list, and those words that appear in both should be in the middle. For example:

word, freq file 1, freq file 2    
Cat 5,0    
Dog 4,0    
Mouse 2,2    
Carrot 1,4    
Lettuce 0,5    
​

My code currently outputs the words in order of their frequency in file 1, but I cant figure out how to arrange the list so that it the words more common in file 2 appear at the bottom of the list. I get that I need to subtract the frequency of words in file 1 from the frequency of same words in file 2, but I cant figure out how to operate on the tuple in the dictionary...

Please help!

import re

f1=open('file1.txt','r', encoding="utf-8") #file 1
f2=open('file2.txt','r', encoding="utf-8") #file 2

file_list = [f1, f2] # This will hold all the files

num_files = len(file_list)

stopwords = ["a", "and", "the", "i", "of", "this", "it", "but", "is", "in", "im", "my", "to", "for", "as", "on", "helpful", "comment", "report", "stars", "reviewed", "united", "kingdom", "was", "with", "-", "it", "not", "about", "which", "so", "at", "out", "abuse", "than","any", "if", "be", "can", "its", "customer", "dont", "just", "other", "too", "only", "people", "found", "helpful", "have", "wasnt", "purchase", "do", "only", "bought", "etc", "verified", "", "wasnt", "thanks", "thanx", "could", "think", "your", "thing", "much", "ive", "you", "they", "vine", "had", "more", "that"]

frequencies = {} # One dictionary to hold the frequencies

for i, f in enumerate(file_list):   # Loop over the files, keeping an index i
    for line in f:                                      # Get the lines of that file
        for word in line.split():           # Get the words of that file
            word = re.sub(r'[^\w\s]','',word) # Strip punctuation
            word = word.lower()                     # Make lowercase
            if not word in stopwords:           # Remove stopwords
                    if not word.isdigit():      # Ignore digits
                        if not word in frequencies:
                            frequencies[word] = [0 for _ in range(num_files)] # make a list of 0's for any word not seen yet -- one 0 for each file

                        frequencies[word][i] += 1   # Increment the frequency count for that word and file

frequency_sorted = sorted(frequencies, key=frequencies.get, reverse=True)
for r in frequency_sorted:
    print (r, frequencies[r])

标签: python

解决方案


你把事情复杂化了。这应该可以帮助您:

import strings
from collections import Counter

def get_freqs( name ) :
    with open(name) as fin :
        text = fin.read().lower()

    words = ''.join( i if i in strings.ascii_letters else ' ' for i in text )
    words = [w for w in words.split() if len(w) > 0]
    return Counter( words )

freqs1 = get_freqs( 'file1.txt' )
freqs2 = get_freqs( 'file2.txt' )

all_words = set(freqs1.keys()) | set(freqs2.keys())  # - set(stop_words) ?
freqs_sorted = sorted( (freqs1[w], freqs2[w], w) for w in all_words )

如果您担心停用词,您可能会更改all_words = set(freqs1.keys()) | set(freqs2.keys())all_words = set(freqs1.keys()) | set(freqs2.keys()) - set(stop_words)或类似的东西。


推荐阅读