首页 > 解决方案 > 创建配置文件并将计数用于字典

问题描述

这将很难解释,但我会尽力而为。

所以我有一个文本文件,它是一个段落。我最近将该段落转换为仅包含唯一词(无停用词)。此处显示的示例:

'mississippi worth reading about', ' commonplace river contrary ways remarkable', ' considering missouri main branch longest river world--four miles', ' seems safe crookedest river world part journey uses cover ground crow fly six seventy-five', ' discharges water st', ' lawrence twenty-five rhine thirty-eight thames', ' river vast drainage-basin draws water supply twenty-eight states territories delaware atlantic seaboard country idaho pacific slope spread forty-five degrees longitude', ' mississippi receives carries gulf water fifty-four subordinate rivers navigable steamboats hundreds navigable flats keels', ' area drainage-basin combined areas england wales scotland ireland france spain portugal germany austria italy turkey almost wide region fertile mississippi valley proper exceptionally so']

我在这里所做的是将段落分成句子并删除任何标点符号。然后我把它放到一个列表中。

因此,例如该列表称为temp,如果我打印出 print(temp[0]) 它将输出:

'mississippi worth reading about'

极好的。然而,我坚持的下一步是我正在尝试使用可能你们中的一些人熟悉的余弦相似度方程创建一个迷你词库。

但是,首先我想创建一些配置文件。我将举一个配置文件示例为'River'。在临时列表中,每个元素都是一个句子。我想要实现的是对于每个包含单词 River 的句子,创建该句子中每个其他单词的计数。

所以如果我有'commonplace river contrary ways remarkable'哪个是 temp[1] ,那么使用 count 方法的字典的开头就是。

{'commonplace': 1, 'river': 1, 'contrary': 1, 'ways': 1, 'remarkable': 1,}

首先看一下输出将是:

river 1 (profile word)
   commonplace: 1
   contrary: 1
   remarkable: 1
   ways: 1

因此,对于其中包含河流的每个句子,都应该是最终输出。

river 4 (profile)
    atlantic: 1
    branch: 1
    commonplace: 1
    considering: 1
    contrary: 1
    country: 1
    cover: 1
    crookedest: 1
    crow: 1
    degrees: 1
    delaware: 1
    drainage-basin: 1
    draws: 1
    fly: 1
    forty-five: 1
    ground: 1
    idaho: 1
    journey: 1
    longest: 1
    longitude: 1
    main: 1
    missouri: 1
    pacific: 1
    part: 1
    remarkable: 1
    safe: 1
    seaboard: 1
    seems: 1
    seventy-five: 1
    six: 1
    slope: 1
    spread: 1
    states: 1
    supply: 1
    territories: 1
    twenty-eight: 1
    uses: 1
    vast: 1
    water: 1
    ways: 1

我不确定是否最好只使用完整的唯一单词列表而不是将唯一单词拆分为一个句子作为元素。例如,这是来自第一个列表的一组上述单词。

{'austria', 'fortyfive', 'fiftyfour', 'longest', 'vast', 'almost', 'states', 'region', 'commonplace', 'wide', 'flats', 'main', 'longitude', 'part', 'gulf', 'st', 'contrary', 'missouri', 'pacific', 'hundreds', 'area', 'areas', 'turkey', 'discharges', 'twentyeight', 'fly', 'worth', 'thirtyeight', 'valley', 'seaboard', 'wales', 'ireland', 'ways', 'uses', 'scotland', 'ground', 'river', 'steamboats', 'seventyfive', 'territories', 'safe', 'degrees', 'twentyfive', 'england', 'thames', 'subordinate', 'drainagebasin', 'water', 'considering', 'fertile', 'rivers', 'spread', 'reading', 'combined', 'seems', 'france', 'crookedest', 'drainagebasin:', 'supply', 'rhine', 'portugal', 'six', 'slopea', 'draws', 'exceptionally', 'mississippi', 'idaho', 'worldfour', 'atlantic', 'italy', 'spain', 'receives', 'cover', 'remarkable', 'germany', 'crow', 'delaware', 'country', 'branch', 'carries', 'proper', 'lawrence', 'journey', 'keels', 'navigable'}

如果这是一个不好的解释,我很抱歉,但对我来说很难解释。这是阻止我使用余弦相似度方程的障碍。

谢谢,

编辑:

唯一的词只设置:

{'remarkable', 'six', 'part', 'navigable', 'england', 'areas', 'worth', 'ways', 'longest', 'lawrence', 'journey', 'longitude', 'austria', 'rivers', 'st', 'crow', 'pacific', 'thirty-eight', 'gulf', 'ireland', 'drainage-basin', 'delaware', 'spread', 'proper', 'subordinate', 'territories', 'germany', 'cover', 'fifty-four', 'slope--a', 'fertile', 'degrees', 'wales', 'seems', 'exceptionally', 'water', 'italy', 'fly', 'missouri', 'turkey', 'atlantic', 'flats', 'hundreds', 'world--four', 'branch', 'twenty-eight', 'main', 'spain', 'receives', 'keels', 'states', 'portugal', 'draws', 'almost', 'contrary', 'seaboard', 'safe', 'mississippi', 'idaho', 'scotland', 'steamboats', 'france', 'valley', 'twenty-five', 'carries', 'wide', 'crookedest', 'area', 'reading', 'rhine', 'discharges', 'uses', 'commonplace', 'combined', 'considering', 'seventy-five', 'river', 'region', 'forty-five', 'ground', 'country', 'vast', 'thames', 'supply'}

我的尝试:

for i in unique:
            kw = i
            count_word = [i for i in temp for j in i.split() if j == kw]
            count_dict = {j: i.count(j) for i in count_word for j in i.split() if j != kw}
            print(kw)
            for a, c in sorted(count_dict.items(), key=lambda x: x[0]):
                print('{}: {}'.format(a, c))
            print()

标签: pythonpython-3.x

解决方案


为此,我们可以指定kw(keyword)as riverthen 我们可以使用列表推导来获取包含 this 的所有项目kw,请注意某些句子包含rivers因此kw in将不起作用。从这里现在我们可以使用字典理解构造一个字典,我们将使用j表示每个单词i.split()i.count(j)表示每个项目中每个单词的计数,我们还将抛出,if j != kw因此我们不包含river在我们的列表中。最后,我们可以使用打印for k, v in dicta.items(),如果我们愿意,可以添加排序方法以按字母顺序获得我们的结果。

kw = 'river'
lista = [i for i in temp for j in i.split() if j == kw]
dicta = {j: i.count(j) for i in lista for j in i.split() if j != kw}

for k, v in sorted(dicta.items(), key=lambda x: x[0]):
    print('{}: {}'.format(k, v))
atlantic: 1
branch: 1
commonplace: 1
considering: 1
contrary: 1
country: 1
...
twenty-eight: 1
uses: 1
vast: 1
water: 1
ways: 1
world: 1
world--four: 1

扩展循环:

lista = []
for i in temp:
    for j in i.split():
        if j == kw:
            lista.append(i)

dicta = {}
for i in lista:
    for j in i.split():
        dicta[j] = i.count(j)

附加要求:

Read all entire file into one variable as string

all_words = 'some string'
all_words = all_words.split()
unique = set(all_words)

for i in unique:
    kw = i
    temp = list of sentences to check against
    rest of existing code
    maybe instead of printing the final statement append the dictionaries created to a list

推荐阅读