首页 > 解决方案 > 优化从输入文本中查找单词关联强度的过程

问题描述

我编写了以下(粗略)代码来查找给定文本中单词之间的关联强度。

import re

## The first paragraph of Wikipedia's article on itself - you can try with other pieces of text with preferably more words (to produce more meaningful word pairs)
text = "Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger.[10] Sanger coined its name,[11][12] as a portmanteau of wiki[notes 3] and 'encyclopedia'. Initially an English-language encyclopedia, versions in other languages were quickly developed. With 5,748,461 articles,[notes 4] the English Wikipedia is the largest of the more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 40 million articles in 301 different languages[14] and by February 2014 it had reached 18 billion page views and nearly 500 million unique visitors per month.[15] In 2005, Nature published a peer review comparing 42 science articles from Encyclopadia Britannica and Wikipedia and found that Wikipedia's level of accuracy approached that of Britannica.[16] Time magazine stated that the open-door policy of allowing anyone to edit had made Wikipedia the biggest and possibly the best encyclopedia in the world and it was testament to the vision of Jimmy Wales.[17] Wikipedia has been criticized for exhibiting systemic bias, for presenting a mixture of 'truths, half truths, and some falsehoods',[18] and for being subject to manipulation and spin in controversial topics.[19] In 2017, Facebook announced that it would help readers detect fake news by suitable links to Wikipedia articles. YouTube announced a similar plan in 2018."
text = re.sub("[\[].*?[\]]", "", text)     ## Remove brackets and anything inside it.
text=re.sub(r"[^a-zA-Z0-9.]+", ' ', text)  ## Remove special characters except spaces and dots
text=str(text).lower()                     ## Convert everything to lowercase
## Can add other preprocessing steps, depending on the input text, if needed.







from nltk.corpus import stopwords
import nltk

stop_words = stopwords.words('english')

desirable_tags = ['NN'] # We want only nouns - can also add 'NNP', 'NNS', 'NNPS' if needed, depending on the results

word_list = []

for sent in text.split('.'):
    for word in sent.split():
        '''
        Extract the unique, non-stopword nouns only
        '''
        if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
            word_list.append(word)





'''
Construct the association matrix, where we count 2 words as being associated 
if they appear in the same sentence.

Later, I'm going to define associations more properly by introducing a 
window size (say, if 2 words seperated by at most 5 words in a sentence, 
then we consider them to be associated)
'''

table = np.zeros((len(word_list),len(word_list)), dtype=int)

for sent in text.split('.'):
    for i in range(len(word_list)):
        for j in range(len(word_list)):
            if word_list[i] in sent and word_list[j] in sent:
                table[i,j]+=1

df = pd.DataFrame(table, columns=word_list, index=word_list)







# Count the number of occurrences of each word from word_list in the text

all_words = pd.DataFrame(np.zeros((len(df), 2)), columns=['Word', 'Count'])
all_words.Word = df.index

for sent in text.split('.'):
    count=0
    for word in sent.split():
        if word in word_list:
            all_words.loc[all_words.Word==word,'Count'] += 1







# Sort the word pairs in decreasing order of their association strengths

df.values[np.triu_indices_from(df, 0)] = 0 # Make the upper triangle values 0

assoc_df = pd.DataFrame(columns=['Word 1', 'Word 2', 'Association Strength (Word 1 -> Word 2)'])
for row_word in df:
    for col_word in df:
        '''
        If Word1 occurs 10 times in the text, and Word1 & Word2 occur in the same sentence 3 times,
        the association strength of Word1 and Word2 is 3/10 - Please correct me if this is wrong.
        '''
        assoc_df = assoc_df.append({'Word 1': row_word, 'Word 2': col_word, 
                                        'Association Strength (Word 1 -> Word 2)': df[row_word][col_word]/all_words[all_words.Word==row_word]['Count'].values[0]}, ignore_index=True)

assoc_df.sort_values(by='Association Strength (Word 1 -> Word 2)', ascending=False)

这会产生像这样的单词关联:

        Word 1          Word 2          Association Strength (Word 1 -> Word 2)
330     wiki            encyclopedia    3.0
895     encyclopadia    found           1.0
1317    anyone          edit            1.0
754     peer            science         1.0
755     peer            encyclopadia    1.0
756     peer            britannica      1.0
...
...
...

但是,该代码包含许多for阻碍其运行时间的循环。特别是最后sort the word pairs in decreasing order of their association strengths一部分n^2nword_list

因此,以下是我想要帮助的内容:

  1. 如何对代码进行矢量化,或者以其他方式使其更高效?

  2. 除了在最后一步中产生n^2组合/词对之外,有没有办法在产生它们之前修剪它们中的一些?无论如何,我将通过检查来修剪一些无用/无意义的对。

  3. 另外,我知道这不属于编码问题的范围,但我很想知道我的逻辑是否有任何错误,特别是在计算单词关联强度时。

标签: pythonperformanceloopsnlpanalytics

解决方案


由于您询问了您的特定代码,因此我不会进入备用库。我将主要关注您问题的第 1) 点和第 2) 点:

无需遍历整个单词 ist 两次(ij),您已经可以通过在列表末尾和列表末尾j之间进行迭代来将处理时间减少 ~ 一半i + i。这将删除重复的对(索引 24 和 42 以及索引 42 和 24)以及相同的对(索引 42 和 42)。

for sent in text.split('.'):
    for i in range(len(word_list)):
        for j in range(i+1, len(word_list)):
            if word_list[i] in sent and word_list[j] in sent:
                table[i,j]+=1

不过要小心。该in运算符还将匹配部分单词(如andin hand)当然,您也可以j通过首先过滤单词列表中的所有单词然后将它们配对来完全删除迭代:

word_list = set()    # Using set instead of list makes lookups faster since this is a hashed structure

for sent in text.split('.'):
    for word in sent.split():
        '''
        Extract the unique, non-stopword nouns only
        '''
        if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
            word_list.add(word)

(...)
for sent in text.split('.'):
    found_words = [word for word in sent.split() if word in word_list]    # list comprehensions are usually faster than pure for loops
    # If you want to count duplicate words, then leave the whole line below out.
    found_words = tuple(frozenset(found_words)) #  make every word unique using a set and then iterable by index again by converting it into a tuple. 
    for i in range(len(found_words):
        for j in range(i+1, len(found_words):
            table[i, j] += 1

不过,一般来说,您应该真正考虑使用外部库来完成大部分工作。正如对您的问题的一些评论已经指出的那样,拆分.可能会给您带来错误的结果,同样适用于拆分空格,例如用 a 分隔的-单词或后跟 a的单词,


推荐阅读