huggingface-transformers - 解析拥抱脸转换器的输出
问题描述
我正在寻找使用bert-english-uncased-finetuned-pos
这里提到的变压器
我这样查询变压器...
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
model = AutoModelForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
text = "My name is Clara and I live in Berkeley, California."
input_ids = tokenizer.encode(text + '</s>', return_tensors='pt')
outputs = model(input_ids)
但是outputs
即将发生这样的事情
(张量([[[-1.8196e+00, -1.9783e+00, -1.7416e+00, 1.2082e+00, -7.0337e-02, -7.0322e-03, 3.4300e-01, -9.6914e- 01, -1.3546e+00, 7.7266e-03, 3.7128e+00, -3.4061e-01, 4.8385e+00, -1.2548e+00, -5.1845e-01, 7.0140e-01, 1.0394e+00 ],
[-1.2702e+00, -1.5518e+00, -1.1553e+00, -4.4077e-01, -9.8661e-01, -3.2680e-01, -6.5338e-01, -3.9779e-01 , -7.5383e-01, -1.2677e+00, 9.6353e+00, 1.9938e-01, -1.0282e+00, -7.5071e-01, -1.0307e+00, -8.0589e-01, 4.2073e- 01],
[-9.6988e-01, -5.0090e-01, -1.3858e+00, -1.0554e+00, -1.4040e+00, -7.5977e-01, -7.4156e-01, 8.0594e+00 , -5.1854e-01, -1.9098e+00, -1.6362e-02, 1.0594e+00, -8.4962e-01, -1.7415e+00, -1.0628e+00, -1.7485e-01, -1.1490 e+00],
[-1.4368e+00, -1.6313e-01, -1.3202e+00, 8.7465e+00, -1.3782e+00, -9.8889e-01, -1.1371e+00, -1.0917e+00, -9.8495 e-01、-9.3237e-01、-9.6111e-01、-4.1658e-01、-7.3133e-01、-9.6004e-01、-9.5337e-01、3.1836e+00、-8.3462e-01 ],
[-7.9476e-01, -7.9640e-01, -9.0027e-01, -6.9506e-01, -8.9706e-01, -6.9383e-01, -3.1590e-01, 1.2390e+00, -1.0443e+00, -9.9977e-01, -8.8189e-01, 8.7941e+00, -9.9445e-01, -1.2076e+00, -1.1424e+00, -9.7801e-01, 5.6683e- 01],
[-8.2837e-01, -5.5060e-01, -2.1352e-01, -8.8721e-01, 9.5536e+00, 1.0478e+00, -5.6208e-01, -7.1037e-01, -7.0248e-01、1.1298e-01...
-7.3788e-01、4.3640e-03、1.6994e+00、1.1528e-01、-1.0983e+00、-8.9202e-01、-1.2869e+00、4.9141e+00、-6.2096e-01、 4.8374e+00, 3.2384e-01, 4.6213e-01],
[-1.3622e+00, 2.0772e+00, -1.6680e+00, -8.8679e-01, -8.6959e-01, -1.7468e+ 00, -1.1424e+00, 1.6996e+00, 3.5800e-01, -4.3927e-01, -3.6129e-01, -4.2220e-01, -1.7912e+00, 8.0154e-01, 7.4594e- 01, -1.0620e+00, 3.8152e+00],
[-1.2889e+00, -2.9379e-01, -1.6543e+00, -4.3326e-01, -2.4919e-01, -4.0112e-01 , -4.4255e-01, 2.2697e-01, -4.6042e-01, -3.7862e-03, -6.3061e-01, -1.3280e+00, 8.5533e+00, -4.6881e-01, 2.3882e+ 00, 2.4533e-01, -1.4095e-01],
[-9.5640e-01,-5.7213e-01,-1.0245e+00,-5.3566e-01,-1.5287e-01,-6.6977e-01,-5.3392e-01,-3.1967e-02,- 7.3077e-01、-3.1048e-01、-7.2973e-01、-3.1701e-01、1.0196e+01、-5.2346e-01、4.0820e-01、-2.1350e-01、1.0340e+00] ]], grad_fn=),)
但根据文档,我希望输出为 JSON 格式......
:“live”},{“entity_group”:“ADP”,“score”:0.999370276927948,“word”:“in”},{“entity_group”:“PROPN”,“score”:0.9987357258796692,“word”:“伯克利”},{“实体组”:“PUNCT”,“分数”:0.9996636509895325,“单词”:“,”},{“实体组”:“PROPN”,“分数”:0.9985638856887817,“单词”:“加利福尼亚” }, { "entity_group": "PUNCT", "score": 0.9996631145477295, "word": "." }] “score”:0.9987357258796692,“word”:“berkeley”},{“entity_group”:“PUNCT”,“score”:0.9996636509895325,“word”:“”},{“entity_group”:“PROPN”,“score” ": 0.9985638856887817, "word": "california" }, { "entity_group": "PUNCT", "score": 0.9996631145477295, "word": "." }] “score”:0.9987357258796692,“word”:“berkeley”},{“entity_group”:“PUNCT”,“score”:0.9996636509895325,“word”:“,},{“entity_group”:“PROPN”,“score” ": 0.9985638856887817, "word": "california" }, { "entity_group": "PUNCT", "score": 0.9996631145477295, "word": "." }] “分数”:0.9996631145477295,“单词”:“。” }] “分数”:0.9996631145477295,“单词”:“。” }]
What am I doing wrong? How can I parse the current output to the desired JSON output?
解决方案
What you see there is the proprietary inference API from huggingface. This API is not part of the transformers library, but you can build something similar. All you need is the Tokenclassificationpipeline:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
model = AutoModelForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
p = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
p('My name is Clara and I live in Berkeley, California.')
Output:
[{'word': 'my', 'score': 0.9994694590568542, 'entity': 'PRON', 'index': 1},
{'word': 'name', 'score': 0.9971255660057068, 'entity': 'NOUN', 'index': 2},
{'word': 'is', 'score': 0.9938186407089233, 'entity': 'AUX', 'index': 3},
{'word': 'clara', 'score': 0.9983252882957458, 'entity': 'PROPN', 'index': 4},
{'word': 'and', 'score': 0.9991229772567749, 'entity': 'CCONJ', 'index': 5},
{'word': 'i', 'score': 0.9994894862174988, 'entity': 'PRON', 'index': 6},
{'word': 'live', 'score': 0.9983154535293579, 'entity': 'VERB', 'index': 7},
{'word': 'in', 'score': 0.999370276927948, 'entity': 'ADP', 'index': 8},
{'word': 'berkeley',
'score': 0.9987357258796692,
'entity': 'PROPN',
'index': 9},
{'word': ',', 'score': 0.9996636509895325, 'entity': 'PUNCT', 'index': 10},
{'word': 'california',
'score': 0.9985638856887817,
'entity': 'PROPN',
'index': 11},
{'word': '.', 'score': 0.9996631145477295, 'entity': 'PUNCT', 'index': 12}]
You can find the other available pipelines which might be used by the inference API here.
推荐阅读
- regex - 删除仅出现在 Spark 数据帧中每列末尾的反斜杠“\”?
- python - Python代码打开目录中的所有文件并打印它们的第一行
- typescript - 将 CSV 文件导出到本地存储并使用 Ionic 4 打开它
- azure - 将 Azure Blob 存储 csv 文件作为附件发送给 Microsoft Azure 上的用户
- oracle - 我们可以在 Azure VM 中使用 vagrant 创建 Oracle VM 吗?
- c++ - SQLite - 如何执行从 IN-MEMORY 数据库到文件系统数据库的持续增量备份?
- iis - IIS 似乎绕过了 web.config 文件中的代理
- html - 悬停时如何缩放链接
- android - React Native,限制来自外部文档的 JSON 输出
- amazon-web-services - 使用 AWS Elastic Beanstalk 我应该在哪里存储我的应用程序代码?