首页 > 解决方案 > 在给定时间点观察和子集之间的最大 Doc2vec 相似度

问题描述

我有一个名为 database_finale 的大型数据框(大约 30000 obs)。与这篇文章相关的栏目是:

我想创建一个新列,其中包含每个“非雪”观测在其日期与截至焦点观测日期存在的所有“雪”观测的最大文本相似性。随着焦点观测日期的增加,“雪”类别中的参考观测值也会增加,数量也会增加。

在确定语料库并训练模型后,我在 200 个观察的子样本上尝试了这段代码,它似乎可以完成这项工作:

datab=[]
#daytuple is a tuple of app_date2 values, index_tuple is a tuple of index1 values
for i, j in [(i,j) for i in daytuple for j in index_tuple]:
    #here I extract the row of the dataframe of the focal non-snow observation for which I want to find the max similarity to snow
    m=database_finale.loc[(database_finale.index1 ==j ) & (database_finale.app_date2 == i ) & (database_finale["snow_pat"] !=1)]
    result=m.empty
    #if the extracted line is non-empty
    if result== False:
        #here I extract all the "snow" observations up to day i
        l=database_finale.loc[(database_finale.app_date2 <= i) & (database_finale["snow_pat"] ==1)]
        result2=l.empty
        #if also this one is non-empty       
        if result2==False:
            #I create a list of the "snow" references
            reference_list1=l["index1"].tolist()
            #I create dictionaries where to store the similarity score
            most_similars_by_key = {}
            most_similars_by_key_2 = {}
            #model was trained before and corpus was already defined
            for doc in corpus_for_doc2vec:
                #select the tag of the focal patent
                if doc.tags[0]==j:
                    #extract from the list of snow up to day i the one that is most similar to the focal observation
                    most_similars_by_key[doc.tags[0]] = model.docvecs.most_similar_to_given(j, reference_list1)
                    #I have the tag of the most similar snow observation to the non-snow focal observation, but not the similarity score, thus I extract the similarity score
                    for key in most_similars_by_key:
                        maxim = most_similars_by_key[key]
                        sim_score = model.docvecs.similarity(key, maxim)
                        most_similars_by_key_2[key] = sim_score
                        print("prova"+str(prova))
                        prova=prova+1
                        #I merge the database with the most similar observation and the similarity score to the original one and append in a list
                        db1=pd.DataFrame.from_dict(most_similars_by_key, orient='index')
                        db1.reset_index(inplace=True)
                        db1=db1.rename(columns={"index": "index1", 0: "most_similar_of_snow"})
                        db2=pd.DataFrame.from_dict(most_similars_by_key_2, orient='index')
                        db2.reset_index(inplace=True)
                        db2=db2.rename(columns={"index": "index1", 0: "Similar_doc2vec_desc"})
                        db3=pd.merge(left=database_finale, right= db1, how="left", left_on=["index1"], right_on=["index1"])
                        db4=pd.merge(left=db3, right= db2, how="left", left_on=["index1"], right_on=["index1"])
                        datab.append(db4)
                        datab = datab[datab['Similar_doc2vec_desc'].notna()]  
                else:
                    continue
        else:
            continue
    else:
        continue
#here I cereate the final DB.
datab = pd.concat(datab)

如前所述,此代码似乎可以工作,但将其应用于完整的 30000 个观测值时,速度非常慢。谁能帮我优化代码以加快计算速度?

我曾尝试研究并行化该过程,但我对这种做法并不十分熟悉,而且似乎需要将这个 for 循环编写为一个函数,我不确定我是否具备这样做的技能。

标签: pandasfunctionparallel-processingdoc2vec

解决方案


您的代码很难遵循,所以这些都是预感而不是确定的事情:

  • 隐含的潜在双循环[(i,j) for i in daytuple for j in index_tuple]可能会产生比严格必要的组合更多的组合;您是否审查了其输出以确保其最小/合理?

  • 将事物“保存在” Pandas 结构中可能会增加额外的间接性/复杂性;只有 30000 个项目,您可能只想将它们全部设为纯 Python dict,在列表中按日期升序排列。

  • 这些else: \n continue公式似乎是多余的,因为它们似乎都出现在 continue 本来应该是自动的地方。

  • 如果您需要进行批量相似性计算,通常应该避免两者.most_similar_to_given(),甚至.similarity()应该避免,因为它们在 Python 代码循环中一次做 1 件事情。相反,尝试使用most_similar()一次全部取回大量结果 - 它将使用优化的批量计算。

在非常高级的伪代码中,仅在 Python-dicts 上工作,更集中的方法可能非常粗略:

earlier_snow_observations = set()
for observation in all_observations_earliest_to_latest:
    if observation['snow_pat']:
        earlier_snow_observations.add(observation['index1'])
        continue  # no need to find nearest-preceding
    all_ranked_similar = d2v_model.dv.most_similar(observation['index1'], topn=len(all_observations))
    for id, sim in all_ranked_similar:
        if id in earlier_snow_observations:
            observation['earlier_snow_closest'] = (id, sim)
            break  # exit early
    else:
        # no earlier snow observations
        observation['earlier_snow_closest'] = None  # or maybe just pass?

最后,每个都将在其最接近文档向量的较早项目的值中observation dict具有一个(id, similarity)元组。earlier_snow_closestsnow


推荐阅读