首页 > 解决方案 > 创建包含年份和单词的共现矩阵

问题描述

我有一个会议摘要数据集,还有一本包含每年词频词典的词典。我想把这本词典词典变成一个矩阵,将每年的频率与其他年份进行比较,看看哪些年份彼此最相似。

我把字典做成了熊猫数据框。请记住:这是文字和年份。

wordsdf = pd.DataFrame.from_dict(word_dfs, orient='index')

我试图将年份作为列和行在 coocc 矩阵中进行比较。但到目前为止,因为它们并不都是整数,所以我不能只使用点积。有什么建议么?

我试过这个无济于事:

# #to create a co-occurrence matrix
from nltk.tokenize import word_tokenize
from itertools import combinations
from collections import Counter

sentences = wordsdf
vocab = set(word_tokenize(' '.join(str(sentences)))
token_sent_list = [word_tokenize(sen) for sen in sentences]

co_occ = {ii:Counter({jj:0 for jj in vocab if jj!=ii}) for ii in vocab}
k=2

for sen in token_sent_list:
    for ii in range(len(sen)):
        if ii < k:
            c = Counter(sen[0:ii+k+1])
            del c[sen[ii]]
            co_occ[sen[ii]] = co_occ[sen[ii]] + c
        elif ii > len(sen)-(k+1):
            c = Counter(sen[ii-k::])
            del c[sen[ii]]
            co_occ[sen[ii]] = co_occ[sen[ii]] + c
        else:
            c = Counter(sen[ii-k:ii+k+1])
            del c[sen[ii]]
            co_occ[sen[ii]] = co_occ[sen[ii]] + c

# # Having final matrix in dict form lets you convert it to different python data structures
co_occ = {ii:dict(co_occ[ii]) for ii in vocab}
co_occ

标签: pythonmatrixnetwork-analysis

解决方案


推荐阅读