python - 从文件创建字典

您可以使用正则表达式并简化整个过程：

def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    words = re.split('\W+')
    words = {word:words.count(word) for word in set(words)}
    return words

从文档：

\W匹配任何不是单词字符的字符。这与 \w 正好相反。如果使用 ASCII 标志，则它等效于 [^a-zA-Z0-9_]。如果使用了 LOCALE 标志，则匹配当前语言环境中既不是字母数字也不是下划线的字符。

+使生成的 RE 匹配前一个 RE 的 1 个或多个重复。ab+ 将匹配 'a' 后跟任何非零数量的 'b'；它不会只匹配“a”。

所以 \W+ 将拆分除 a 到 z、A 到 Z、0 到 9 和 _ 之外的所有字符。正如评论中所建议的，它可以是“语言”敏感的（例如，非 unicode 字符）。在这种情况下，您可以通过设置将此代码调整为您的语言

words = re.split('[^a-zA-Z0-9_àéèêùç'])

编辑使用 Stef 的建议确实更快：

from collections import Counter
def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    words = re.split('\W+')
    words = Counter(words)
    return words

编辑 2 没有任何正则表达式或其他库，但这效率不高：

def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    split_on = {"'", ","}
    for separator in split_on:
      txt = txt.replace(separator, ' ')
    words = txt.split()
    dict_words = dict()
    for word in set(words):
      if word in dict_words:
        dict_words[word] += dict_words[word] +1
      else
        dict_words[word] = 1
    
    return dict_words

python - 从文件创建字典

问题描述

解决方案

推荐阅读