首页 > 解决方案 > 如何改善 DBpedia Spotlight 的结果?

问题描述

我正在使用 DBpedia Spotlight 来提取 DBpedia 资源,如下所示。

import json
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse

## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = "Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients.     In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions."
CONFIDENCE = '0.5'
SUPPORT = '10'
REQUEST = BASE_URL.format(
    text=urllib.parse.quote_plus(TEXT), 
    confidence=CONFIDENCE, 
    support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []

r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
    all_urls.append(res['@URI'])
print(all_urls)

我的文字如下所示:

常春藤提取物在临床实践条件下对炎症性支气管疾病的耐受性、安全性和有效性:一项针对 9657 名患者的前瞻性、开放、多中心上市后研究。在这项上市后研究中,9657 名患有支气管炎(急性或慢性支气管炎性疾病)的患者(5181 名儿童)接受了含有干常春藤叶提取物的糖浆治疗。治疗 7 天后,95% 的患者症状改善或痊愈。治疗的安全性非常好,不良事件的总发生率为 2.1%(主要是胃肠道疾病,发生率为 1.5%)。在那些同时服用药物的患者中,可以证明额外使用抗生素对疗效没有好处,但确实将发生副作用的相对风险增加了 26%。总之,就是说干常春藤叶提取物对支气管炎患者有效且耐受性良好。鉴于所考虑的人口众多,未来的分析应处理有关按年龄组、伴随治疗和基线条件的治疗的具体问题。

我得到的结果如下。

['http://dbpedia.org/resource/Hedera', 
'http://dbpedia.org/resource/Helix', 
'http://dbpedia.org/resource/Bronchitis', 
'http://dbpedia.org/resource/Cough_medicine',
'http://dbpedia.org/resource/Hedera', 
'http://dbpedia.org/resource/After_7',
'http://dbpedia.org/resource/Gastrointestinal_tract',
'http://dbpedia.org/resource/Antibiotics',
'http://dbpedia.org/resource/Relative_risk',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Bronchitis']

如您所见,结果不是很好。

例如,考虑Hedera helix extract上面提到的文本。即使 DBpedia 有Hedera helix( http://dbpedia.org/resource/Hedera_helix) 的资源,Spotlight 仍将其作为两个 URI 输出为http://dbpedia.org/resource/Hederahttp://dbpedia.org/resource/Helix

根据我的数据集,我想得到 DBpedia 中最长的词作为结果。在这种情况下,我可以做哪些改进来获得我想要的输出?

如果需要,我很乐意提供更多详细信息。

标签: sparqlwikipediadbpedialinked-dataspotlight-dbpedia

解决方案


虽然我回答这个问题的时间很晚,但您可以在 python 中使用 Babelnet API 来获取包含更长文本的 dbpedia URI。我使用以下代码重现了该问题:

`from babelpy.babelfy import BabelfyClient

text ="Tolerance, safety and efficacy of Hedera helix extract in inflammatory 
bronchial diseases under clinical practice conditions: a prospective, open, 
multicentre postmarketing study in 9657 patients.     In this postmarketing 
study 9657 patients (5181 children) with bronchitis (acute or chronic 
bronchial inflammatory disease) were treated with a syrup containing dried ivy 
leaf extract. After 7 days of therapy, 95% of the patients showed improvement 
or healing of their symptoms. The safety of the therapy was very good with an 
overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders 
with 1.5%). In those patients who got concomitant medication as well, it could 
be shown that the additional application of antibiotics had no benefit 
respective to efficacy but did increase the relative risk for the occurrence 
of side effects by 26%. In conclusion, it is to say that the dried ivy leaf 
extract is effective and well tolerated in patients with bronchitis. In view 
of the large population considered, future analyses should approach specific 
issues concerning therapy by age group, concomitant therapy and baseline 
conditions."

# Instantiate BabelFy client.
params = dict()
params['lang'] = 'english'
babel_client = BabelfyClient("**Your Registration Code For API**", params)

# Babelfy sentence.
babel_client.babelfy(text)


# Get all merged entities.
babel_client.all_merged_entities'

对于文本中的所有合并实体,输出将采用示例格式,如下所示。您可以进一步存储和处理字典结构以提取 dbpedia URI。

{'start': 34,
'end': 45,
'text': 'Hedera helix',
'isEntity': True,
'tokenFragment': {'start': 6, 'end': 7},
'charFragment': {'start': 34, 'end': 45},
'babelSynsetID': 'bn:00021109n',
'DBpediaURL': 'http://dbpedia.org/resource/Hedera_helix',
'BabelNetURL': 'http://babelnet.org/rdf/s00021109n',
'score': 1.0,
'coherenceScore': 0.0847457627118644,
'globalScore': 0.0013494092960806407,
'source': 'BABELFY'},

推荐阅读