python - Google Cloud NL 实体识别器将单词组合在一起
问题描述
当尝试在长文本输入中查找实体时,Google Cloud 的自然语言程序会将单词组合在一起,然后获取它们不正确的实体。这是我的程序:
def entity_recognizer(nouns):
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/superaitor/Downloads/link"
text = ""
for words in nouns:
text += words + " "
client = language.LanguageServiceClient()
if isinstance(text, six.binary_type):
text = text.decode('utf-8')
document = types.Document(
content=text.encode('utf-8'),
type=enums.Document.Type.PLAIN_TEXT)
encoding = enums.EncodingType.UTF32
if sys.maxunicode == 65535:
encoding = enums.EncodingType.UTF16
entity = client.analyze_entities(document, encoding).entities
entity_type = ('UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION',
'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD', 'OTHER')
for entity in entity:
#if entity_type[entity.type] is "PERSON":
print(entity_type[entity.type])
print(entity.name)
这里的名词是一个单词列表。然后我把它变成一个字符串(我尝试了多种方法,都给出了相同的结果),但是程序会输出如下输出:
PERSON
liberty secularism etching domain professor lecturer tutor royalty
government adviser commissioner
OTHER
business view society economy
OTHER
business
OTHER
verge industrialization market system custom shift rationality
OTHER
family kingdom life drunkenness college student appearance income family
brink poverty life writer variety attitude capitalism age process
production factory system
关于如何解决这个问题的任何意见?
解决方案
要分析文本中的实体,您可以使用文档中的示例,如下所示:
import argparse
import sys
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import six
def entities_text(text):
"""Detects entities in the text."""
client = language.LanguageServiceClient()
if isinstance(text, six.binary_type):
text = text.decode('utf-8')
# Instantiates a plain text document.
document = types.Document(
content=text,
type=enums.Document.Type.PLAIN_TEXT)
# Detects entities in the document. You can also analyze HTML with:
# document.type == enums.Document.Type.HTML
entities = client.analyze_entities(document).entities
# entity types from enums.Entity.Type
entity_type = ('UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION',
'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD', 'OTHER')
for entity in entities:
print('=' * 20)
print(u'{:<16}: {}'.format('name', entity.name))
print(u'{:<16}: {}'.format('type', entity_type[entity.type]))
print(u'{:<16}: {}'.format('metadata', entity.metadata))
print(u'{:<16}: {}'.format('salience', entity.salience))
print(u'{:<16}: {}'.format('wikipedia_url',
entity.metadata.get('wikipedia_url', '-')))
entities_text("Donald Trump is president of United States of America")
这个样本的输出是:
====================
name : Donald Trump
type : PERSON
metadata : <google.protobuf.pyext._message.ScalarMapContainer object at 0x7fd9d0125170>
salience : 0.9564903974533081
wikipedia_url : https://en.wikipedia.org/wiki/Donald_Trump
====================
name : United States of America
type : LOCATION
metadata : <google.protobuf.pyext._message.ScalarMapContainer object at 0x7fd9d01252b0>
salience : 0.04350961744785309
wikipedia_url : https://en.wikipedia.org/wiki/United_States
正如您在此示例中所看到的,实体分析检查给定文本以查找已知实体(专有名词,如公众人物、地标等)。它不会为文本中的每个单词提供实体。
推荐阅读
- artificial-intelligence - 我如何每天从 QnA 制造商那里获得前 10 个常见(趋势)问题?
- google-sheets - 夜班花名册示例 - 自动填充休息日
- mongodb - MongoDB和Docker通过docker-compose auth错误而不是初始化数据库
- codeigniter - 在 codeignator 中将 url 从 domain.com/user/amp/username 更改为 domain.com/username
- python - 如果输入数字 0,如何结束 while 循环并按照其他说明进行操作
- python - 如何通过线程或其他多任务模块从 1 个文件执行多个 tkinter GUI?
- angular - 我可以从字符串创建 TemplateRef 吗?
- post - 如何从学校主页获取经过身份验证的数据?
- ubuntu - Lets Encrypt 给出“无法访问站点”消息
- c# - 如何使用 GZip 多线程存档器将 CPU 负载优化到 100%?