首页 > 解决方案 > 解析拥抱脸转换器的输出

问题描述

我正在寻找使用bert-english-uncased-finetuned-pos这里提到的变压器

https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos?text=My+name+is+Clara+and+I+live+in+Berkeley%2C+California

我这样查询变压器...

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")

model = AutoModelForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")

text = "My name is Clara and I live in Berkeley, California."
input_ids = tokenizer.encode(text + '</s>', return_tensors='pt')
outputs = model(input_ids)

但是outputs即将发生这样的事情

(张量([[[-1.8196e+00, -1.9783e+00, -1.7416e+00, 1.2082e+00, -7.0337e-02, -7.0322e-03, 3.4300e-01, -9.6914e- 01, -1.3546e+00, 7.7266e-03, 3.7128e+00, -3.4061e-01, 4.8385e+00, -1.2548e+00, -5.1845e-01, 7.0140e-01, 1.0394e+00 ],
[-1.2702e+00, -1.5518e+00, -1.1553e+00, -4.4077e-01, -9.8661e-01, -3.2680e-01, -6.5338e-01, -3.9779e-01 , -7.5383e-01, -1.2677e+00, 9.6353e+00, 1.9938e-01, -1.0282e+00, -7.5071e-01, -1.0307e+00, -8.0589e-01, 4.2073e- 01],
[-9.6988e-01, -5.0090e-01, -1.3858e+00, -1.0554e+00, -1.4040e+00, -7.5977e-01, -7.4156e-01, 8.0594e+00 , -5.1854e-01, -1.9098e+00, -1.6362e-02, 1.0594e+00, -8.4962e-01, -1.7415e+00, -1.0628e+00, -1.7485e-01, -1.1490 e+00],
[-1.4368e+00, -1.6313e-01, -1.3202e+00, 8.7465e+00, -1.3782e+00, -9.8889e-01, -1.1371e+00, -1.0917e+00, -9.8495 e-01、-9.3237e-01、-9.6111e-01、-4.1658e-01、-7.3133e-01、-9.6004e-01、-9.5337e-01、3.1836e+00、-8.3462e-01 ],
[-7.9476e-01, -7.9640e-01, -9.0027e-01, -6.9506e-01, -8.9706e-01, -6.9383e-01, -3.1590e-01, 1.2390e+00, -1.0443e+00, -9.9977e-01, -8.8189e-01, 8.7941e+00, -9.9445e-01, -1.2076e+00, -1.1424e+00, -9.7801e-01, 5.6683e- 01],
[-8.2837e-01, -5.5060e-01, -2.1352e-01, -8.8721e-01, 9.5536e+00, 1.0478e+00, -5.6208e-01, -7.1037e-01, -7.0248e-01、1.1298e-01

...

-7.3788e-01、4.3640e-03、1.6994e+00、1.1528e-01、-1.0983e+00、-8.9202e-01、-1.2869e+00、4.9141e+00、-6.2096e-01、 4.8374e+00, 3.2384e-01, 4.6213e-01],
[-1.3622e+00, 2.0772e+00, -1.6680e+00, -8.8679e-01, -8.6959e-01, -1.7468e+ 00, -1.1424e+00, 1.6996e+00, 3.5800e-01, -4.3927e-01, -3.6129e-01, -4.2220e-01, -1.7912e+00, 8.0154e-01, 7.4594e- 01, -1.0620e+00, 3.8152e+00],
[-1.2889e+00, -2.9379e-01, -1.6543e+00, -4.3326e-01, -2.4919e-01, -4.0112e-01 , -4.4255e-01, 2.2697e-01, -4.6042e-01, -3.7862e-03, -6.3061e-01, -1.3280e+00, 8.5533e+00, -4.6881e-01, 2.3882e+ 00, 2.4533e-01, -1.4095e-01],
[-9.5640e-01,-5.7213e-01,-1.0245e+00,-5.3566e-01,-1.5287e-01,-6.6977e-01,-5.3392e-01,-3.1967e-02,- 7.3077e-01、-3.1048e-01、-7.2973e-01、-3.1701e-01、1.0196e+01、-5.2346e-01、4.0820e-01、-2.1350e-01、1.0340e+00] ]], grad_fn=),)

但根据文档,我希望输出为 JSON 格式......

:“live”},{“entity_group”:“ADP”,“score”:0.999370276927948,“word”:“in”},{“entity_group”:“PROPN”,“score”:0.9987357258796692,“word”:“伯克利”},{“实体组”:“PUNCT”,“分数”:0.9996636509895325,“单词”:“,”},{“实体组”:“PROPN”,“分数”:0.9985638856887817,“单词”:“加利福尼亚” }, { "entity_group": "PUNCT", "score": 0.9996631145477295, "word": "." }] “score”:0.9987357258796692,“word”:“berkeley”},{“entity_group”:“PUNCT”,“score”:0.9996636509895325,“word”:“”},{“entity_group”:“PROPN”,“score” ": 0.9985638856887817, "word": "california" }, { "entity_group": "PUNCT", "score": 0.9996631145477295, "word": "." }] “score”:0.9987357258796692,“word”:“berkeley”},{“entity_group”:“PUNCT”,“score”:0.9996636509895325,“word”:“,},{“entity_group”:“PROPN”,“score” ": 0.9985638856887817, "word": "california" }, { "entity_group": "PUNCT", "score": 0.9996631145477295, "word": "." }] “分数”:0.9996631145477295,“单词”:“。” }] “分数”:0.9996631145477295,“单词”:“。” }]

What am I doing wrong? How can I parse the current output to the desired JSON output?

标签: huggingface-transformershuggingface-tokenizers

解决方案


What you see there is the proprietary inference API from huggingface. This API is not part of the transformers library, but you can build something similar. All you need is the Tokenclassificationpipeline:

from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")

model = AutoModelForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
p = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
p('My name is Clara and I live in Berkeley, California.')

Output:

[{'word': 'my', 'score': 0.9994694590568542, 'entity': 'PRON', 'index': 1},
 {'word': 'name', 'score': 0.9971255660057068, 'entity': 'NOUN', 'index': 2},
 {'word': 'is', 'score': 0.9938186407089233, 'entity': 'AUX', 'index': 3},
 {'word': 'clara', 'score': 0.9983252882957458, 'entity': 'PROPN', 'index': 4},
 {'word': 'and', 'score': 0.9991229772567749, 'entity': 'CCONJ', 'index': 5},
 {'word': 'i', 'score': 0.9994894862174988, 'entity': 'PRON', 'index': 6},
 {'word': 'live', 'score': 0.9983154535293579, 'entity': 'VERB', 'index': 7},
 {'word': 'in', 'score': 0.999370276927948, 'entity': 'ADP', 'index': 8},
 {'word': 'berkeley',
  'score': 0.9987357258796692,
  'entity': 'PROPN',
  'index': 9},
 {'word': ',', 'score': 0.9996636509895325, 'entity': 'PUNCT', 'index': 10},
 {'word': 'california',
  'score': 0.9985638856887817,
  'entity': 'PROPN',
  'index': 11},
 {'word': '.', 'score': 0.9996631145477295, 'entity': 'PUNCT', 'index': 12}]

You can find the other available pipelines which might be used by the inference API here.


推荐阅读