python-3.x - 我需要使用 spacy 从文本中推荐电影
问题描述
感谢您抽出时间来阅读
我不是要为我做作业……只需要指导
我有一个无法解决的家庭作业问题。我需要对 python 中的 spacy 库执行以下操作。
作业问题
读入movies.txt
文件。每一行都是对不同电影的描述。
你的任务是创建一个函数来返回如果用户观看了《绿巨人星球》并描述“他会拯救他们的世界还是毁灭世界?当绿巨人变得对地球来说太危险时,光明会欺骗绿巨人进入航天飞机并将他送入太空,前往绿巨人可以和平生活的星球。不幸的是,绿巨人降落在萨卡尔星球上,在那里他被卖为奴隶并接受了角斗士的训练。”</p>
该函数应该将描述作为参数并返回最相似电影的标题。
movie.txt 文件包含以下内容:
Movie A :When Hiccup discovers Toothless isn't the only Night Fury, he must seek "The Hidden World", a secret Dragon Utopia before a hired tyrant named Grimmel finds it first.
Movie B :After the death of Superman, several new people present themselves as possible successors.
Movie C :A darkness swirls at the center of a world-renowned dance company, one that will engulf the artistic director, an ambitious young dancer, and a grieving psychotherapist. Some will succumb to the nightmare. Others will finally wake up.
Movie D :A humorous take on Sir Arthur Conan Doyle's classic mysteries featuring Sherlock Holmes and Doctor Watson.
Movie E :A 16-year-old girl and her extended family are left reeling after her calculating grandmother unveils an array of secrets on her deathbed.
Movie F :In the last moments of World War II, a young German soldier fighting for survival finds a Nazi captain's uniform. Impersonating an officer, the man quickly takes on the monstrous identity of the perpetrators he is trying to escape from.
Movie G :The world at an end, a dying mother sends her young son on a quest to find the place that grants wishes.
Movie H :A musician helps a young singer and actress find fame, even as age and alcoholism send his own career into a downward spiral.
Movie I :Corporate analyst and single mom, Jen, tackles Christmas with a business-like approach until her uncle arrives with a handsome stranger in tow.
Movie J :Adapted from the bestselling novel by Madeleine St John, Ladies in Black is an alluring and tender-hearted comedy drama about the lives of a group of department store employees in 1959 Sydney.
我尝试过的事情:
我曾尝试在 spacy 中寻找一个类似这样的功能,但我唯一能遇到的是相似性函数,但它只检查句子是否具有相似的值......
是的,我是 Spacy 的新手
到目前为止我的代码
from __future__ import unicode_literals
import spacy
nlp = spacy.load("en_core_web_md")
myfile = open("movies.txt").read()
NlpRead = nlp(myfile)
sentence_to_compare = "Will he save their world or destroy it? When the Hulk becomes too dangerous for the Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk land on the planet Sakaar where he is sold into slavery and trained as a gladiator"
model_sentences = nlp(sentence_to_compare)
for sentence in myfile:
similarity = nlp(sentence).similarity(model_sentences)
print(sentence + "-" + str(similarity))
解决方案
Spacy 有几个可用的预训练模型。您正在使用包含词向量的“en_core_web_md”。根据文档,这些包含的词向量是“在 Common Crawl 上训练的 GloVe 向量”。
如下面的代码和热图所示,这些词向量捕获语义相似性,并可以帮助您对主题进行聚类。
当然,这不是解决您的作业问题,而是提示您可能会发现有用的技术。
import spacy
nlp = spacy.load("en_core_web_md")
tokens = nlp(u'Hulk Superman Batman dragon elf dance musical handsome romance war soldier')
for token in tokens:
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
labels = [a.text for a in tokens]
print(labels)
M = np.zeros((len(tokens), len(tokens)))
for idx, token1 in enumerate(tokens):
for idy, token2 in enumerate(tokens):
M[idx, idy] = token1.similarity(token2)
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt
ax = sns.heatmap(M, cmap = "RdBu_r", xticklabels=labels, yticklabels=labels)
plt.show()
此外,Spacy 还提供词性标记,您可以使用它从句子中提取专有名词和普通名词:
doc = nlp("Will he save their world or destroy it? When the Hulk becomes too dangerous for the Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk land on the planet Sakaar where he is sold into slavery and trained as a gladiator")
properNouns = [token.text for token in doc if token.pos_ =='PROPN']
commonNouns = [token.text for token in doc if token.pos_ =='NOUN']
print(properNouns)
# ['Hulk', 'Earth', 'Illuminati', 'Hulk', 'Hulk', 'Hulk', 'Sakaar']
print(commonNouns)
# ['world', 'shuttle', 'space', 'planet', 'peace', 'land', 'planet', 'slavery', 'gladiator']
推荐阅读
- python - 多次设置同一个变量不起作用 [已解决]
- cuda - cuda 11 内核无法运行
- python - 如何将列添加到 numpy recarry
- google-cloud-platform - 我们如何使用数据资源获取 GCP 存储桶名称
- sql - .NET CORE:Dapper 映射多对多查询
- java - Gson.fromJSON 返回 0 和 null
- git - 如何在 git update 上触发 ansible playbook?
- apache-spark - 需要将列的值映射到另一列作为火花数据框中的键值对
- graph - Spotfire - 在一个图上选择数据并从前一个选择中取消选择
- spring-boot - 如何从几个 yaml 规范生成 swagger-ui?