首页 > 解决方案 > 我需要使用 spacy 从文本中推荐电影

问题描述

感谢您抽出时间来阅读

我不是要为我做作业……只需要指导

我有一个无法解决的家庭作业问题。我需要对 python 中的 spacy 库执行以下操作。

作业问题

读入movies.txt文件。每一行都是对不同电影的描述。

你的任务是创建一个函数来返回如果用户观看了《绿巨人星球》并描述“他会拯救他们的世界还是毁灭世界?当绿巨人变得对地球来说太危险时,光明会欺骗绿巨人进入航天飞机并将他送入太空,前往绿巨人可以和平生活的星球。不幸的是,绿巨人降落在萨卡尔星球上,在那里他被卖为奴隶并接受了角斗士的训练。”</p>

该函数应该将描述作为参数并返回最相似电影的标题。

movie.txt 文件包含以下内容:

Movie A :When Hiccup discovers Toothless isn't the only Night Fury, he must seek "The Hidden World", a secret Dragon Utopia before a hired tyrant named Grimmel finds it first.
Movie B :After the death of Superman, several new people present themselves as possible successors.
Movie C :A darkness swirls at the center of a world-renowned dance company, one that will engulf the artistic director, an ambitious young dancer, and a grieving psychotherapist. Some will succumb to the nightmare. Others will finally wake up.
Movie D :A humorous take on Sir Arthur Conan Doyle's classic mysteries featuring Sherlock Holmes and Doctor Watson.
Movie E :A 16-year-old girl and her extended family are left reeling after her calculating grandmother unveils an array of secrets on her deathbed.
Movie F :In the last moments of World War II, a young German soldier fighting for survival finds a Nazi captain's uniform. Impersonating an officer, the man quickly takes on the monstrous identity of the perpetrators he is trying to escape from.
Movie G :The world at an end, a dying mother sends her young son on a quest to find the place that grants wishes.
Movie H :A musician helps a young singer and actress find fame, even as age and alcoholism send his own career into a downward spiral.
Movie I :Corporate analyst and single mom, Jen, tackles Christmas with a business-like approach until her uncle arrives with a handsome stranger in tow.
Movie J :Adapted from the bestselling novel by Madeleine St John, Ladies in Black is an alluring and tender-hearted comedy drama about the lives of a group of department store employees in 1959 Sydney.

我尝试过的事情:

我曾尝试在 spacy 中寻找一个类似这样的功能,但我唯一能遇到的是相似性函数,但它只检查句子是否具有相似的值......

是的,我是 Spacy 的新手

到目前为止我的代码

from __future__ import unicode_literals
import spacy
nlp = spacy.load("en_core_web_md")

myfile = open("movies.txt").read()
NlpRead = nlp(myfile)

sentence_to_compare = "Will he save their world or destroy it? When the Hulk becomes too dangerous for the Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk land on the planet Sakaar where he is sold into slavery and trained as a gladiator"

model_sentences = nlp(sentence_to_compare)

for sentence in myfile:
    similarity = nlp(sentence).similarity(model_sentences)
    print(sentence + "-" + str(similarity))

标签: python-3.xspacy

解决方案


Spacy 有几个可用的预训练模型。您正在使用包含词向量的“en_core_web_md”。根据文档,这些包含的词向量是“在 Common Crawl 上训练的 GloVe 向量”。

如下面的代码和热图所示,这些词向量捕获语义相似性,并可以帮助您对主题进行聚类。

当然,这不是解决您的作业问题,而是提示您可能会发现有用的技术。

import spacy

nlp = spacy.load("en_core_web_md")
tokens = nlp(u'Hulk Superman Batman dragon elf dance musical handsome romance war soldier')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

labels = [a.text for a in tokens]
print(labels)

M = np.zeros((len(tokens), len(tokens)))
for idx, token1 in enumerate(tokens):
    for idy, token2 in enumerate(tokens):
        M[idx, idy] = token1.similarity(token2)

%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt


ax = sns.heatmap(M, cmap = "RdBu_r", xticklabels=labels,  yticklabels=labels)
plt.show()

相似性热图

此外,Spacy 还提供词性标记,您可以使用它从句子中提取专有名词和普通名词:

doc = nlp("Will he save their world or destroy it? When the Hulk becomes too dangerous for the Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk land on the planet Sakaar where he is sold into slavery and trained as a gladiator")

properNouns = [token.text for token in doc if token.pos_ =='PROPN']
commonNouns = [token.text for token in doc if token.pos_ =='NOUN']
print(properNouns)
# ['Hulk', 'Earth', 'Illuminati', 'Hulk', 'Hulk', 'Hulk', 'Sakaar']
print(commonNouns)
# ['world', 'shuttle', 'space', 'planet', 'peace', 'land', 'planet', 'slavery', 'gladiator']

推荐阅读