python - 优化从输入文本中查找单词关联强度的过程
问题描述
我编写了以下(粗略)代码来查找给定文本中单词之间的关联强度。
import re
## The first paragraph of Wikipedia's article on itself - you can try with other pieces of text with preferably more words (to produce more meaningful word pairs)
text = "Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger.[10] Sanger coined its name,[11][12] as a portmanteau of wiki[notes 3] and 'encyclopedia'. Initially an English-language encyclopedia, versions in other languages were quickly developed. With 5,748,461 articles,[notes 4] the English Wikipedia is the largest of the more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 40 million articles in 301 different languages[14] and by February 2014 it had reached 18 billion page views and nearly 500 million unique visitors per month.[15] In 2005, Nature published a peer review comparing 42 science articles from Encyclopadia Britannica and Wikipedia and found that Wikipedia's level of accuracy approached that of Britannica.[16] Time magazine stated that the open-door policy of allowing anyone to edit had made Wikipedia the biggest and possibly the best encyclopedia in the world and it was testament to the vision of Jimmy Wales.[17] Wikipedia has been criticized for exhibiting systemic bias, for presenting a mixture of 'truths, half truths, and some falsehoods',[18] and for being subject to manipulation and spin in controversial topics.[19] In 2017, Facebook announced that it would help readers detect fake news by suitable links to Wikipedia articles. YouTube announced a similar plan in 2018."
text = re.sub("[\[].*?[\]]", "", text) ## Remove brackets and anything inside it.
text=re.sub(r"[^a-zA-Z0-9.]+", ' ', text) ## Remove special characters except spaces and dots
text=str(text).lower() ## Convert everything to lowercase
## Can add other preprocessing steps, depending on the input text, if needed.
from nltk.corpus import stopwords
import nltk
stop_words = stopwords.words('english')
desirable_tags = ['NN'] # We want only nouns - can also add 'NNP', 'NNS', 'NNPS' if needed, depending on the results
word_list = []
for sent in text.split('.'):
for word in sent.split():
'''
Extract the unique, non-stopword nouns only
'''
if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
word_list.append(word)
'''
Construct the association matrix, where we count 2 words as being associated
if they appear in the same sentence.
Later, I'm going to define associations more properly by introducing a
window size (say, if 2 words seperated by at most 5 words in a sentence,
then we consider them to be associated)
'''
table = np.zeros((len(word_list),len(word_list)), dtype=int)
for sent in text.split('.'):
for i in range(len(word_list)):
for j in range(len(word_list)):
if word_list[i] in sent and word_list[j] in sent:
table[i,j]+=1
df = pd.DataFrame(table, columns=word_list, index=word_list)
# Count the number of occurrences of each word from word_list in the text
all_words = pd.DataFrame(np.zeros((len(df), 2)), columns=['Word', 'Count'])
all_words.Word = df.index
for sent in text.split('.'):
count=0
for word in sent.split():
if word in word_list:
all_words.loc[all_words.Word==word,'Count'] += 1
# Sort the word pairs in decreasing order of their association strengths
df.values[np.triu_indices_from(df, 0)] = 0 # Make the upper triangle values 0
assoc_df = pd.DataFrame(columns=['Word 1', 'Word 2', 'Association Strength (Word 1 -> Word 2)'])
for row_word in df:
for col_word in df:
'''
If Word1 occurs 10 times in the text, and Word1 & Word2 occur in the same sentence 3 times,
the association strength of Word1 and Word2 is 3/10 - Please correct me if this is wrong.
'''
assoc_df = assoc_df.append({'Word 1': row_word, 'Word 2': col_word,
'Association Strength (Word 1 -> Word 2)': df[row_word][col_word]/all_words[all_words.Word==row_word]['Count'].values[0]}, ignore_index=True)
assoc_df.sort_values(by='Association Strength (Word 1 -> Word 2)', ascending=False)
这会产生像这样的单词关联:
Word 1 Word 2 Association Strength (Word 1 -> Word 2)
330 wiki encyclopedia 3.0
895 encyclopadia found 1.0
1317 anyone edit 1.0
754 peer science 1.0
755 peer encyclopadia 1.0
756 peer britannica 1.0
...
...
...
但是,该代码包含许多for
阻碍其运行时间的循环。特别是最后sort the word pairs in decreasing order of their association strengths
一部分n^2
(n
word_list
因此,以下是我想要帮助的内容:
如何对代码进行矢量化,或者以其他方式使其更高效?
除了在最后一步中产生
n^2
组合/词对之外,有没有办法在产生它们之前修剪它们中的一些?无论如何,我将通过检查来修剪一些无用/无意义的对。另外,我知道这不属于编码问题的范围,但我很想知道我的逻辑是否有任何错误,特别是在计算单词关联强度时。
解决方案
由于您询问了您的特定代码,因此我不会进入备用库。我将主要关注您问题的第 1) 点和第 2) 点:
无需遍历整个单词 ist 两次(i
和j
),您已经可以通过在列表末尾和列表末尾j
之间进行迭代来将处理时间减少 ~ 一半i + i
。这将删除重复的对(索引 24 和 42 以及索引 42 和 24)以及相同的对(索引 42 和 42)。
for sent in text.split('.'):
for i in range(len(word_list)):
for j in range(i+1, len(word_list)):
if word_list[i] in sent and word_list[j] in sent:
table[i,j]+=1
不过要小心。该in
运算符还将匹配部分单词(如and
in hand
)当然,您也可以j
通过首先过滤单词列表中的所有单词然后将它们配对来完全删除迭代:
word_list = set() # Using set instead of list makes lookups faster since this is a hashed structure
for sent in text.split('.'):
for word in sent.split():
'''
Extract the unique, non-stopword nouns only
'''
if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
word_list.add(word)
(...)
for sent in text.split('.'):
found_words = [word for word in sent.split() if word in word_list] # list comprehensions are usually faster than pure for loops
# If you want to count duplicate words, then leave the whole line below out.
found_words = tuple(frozenset(found_words)) # make every word unique using a set and then iterable by index again by converting it into a tuple.
for i in range(len(found_words):
for j in range(i+1, len(found_words):
table[i, j] += 1
不过,一般来说,您应该真正考虑使用外部库来完成大部分工作。正如对您的问题的一些评论已经指出的那样,拆分.
可能会给您带来错误的结果,同样适用于拆分空格,例如用 a 分隔的-
单词或后跟 a的单词,
。
推荐阅读
- asp.net-core - 如何从 HttpContext 对象中检索 IIS 日志参数值 [参数名称:服务名称和实例编号 (s-sitename)]
- selenium - 自动下载 PDF
- reactjs - 如何在material-ui的`SvgIcon`中加载外部svg文件?
- json - 使用 play-json-extensions 反序列化 json 后添加其他字段
- javascript - 单击确认按钮后如何在相同的甜蜜框中显示文本?
- c# - 如何找到匹配所有字段的行
- java - 如何制作具有圆形排列组件的 GUI?
- react-native - FlatList 中的反应警告键
- android - iOS会破坏旋转视图吗?
- spring-boot - 防止 Spring 响应触发 EclipseLink 延迟加载