sparql - 如何改善 DBpedia Spotlight 的结果?
问题描述
我正在使用 DBpedia Spotlight 来提取 DBpedia 资源,如下所示。
import json
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse
## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = "Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients. In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions."
CONFIDENCE = '0.5'
SUPPORT = '10'
REQUEST = BASE_URL.format(
text=urllib.parse.quote_plus(TEXT),
confidence=CONFIDENCE,
support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []
r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
all_urls.append(res['@URI'])
print(all_urls)
我的文字如下所示:
常春藤提取物在临床实践条件下对炎症性支气管疾病的耐受性、安全性和有效性:一项针对 9657 名患者的前瞻性、开放、多中心上市后研究。在这项上市后研究中,9657 名患有支气管炎(急性或慢性支气管炎性疾病)的患者(5181 名儿童)接受了含有干常春藤叶提取物的糖浆治疗。治疗 7 天后,95% 的患者症状改善或痊愈。治疗的安全性非常好,不良事件的总发生率为 2.1%(主要是胃肠道疾病,发生率为 1.5%)。在那些同时服用药物的患者中,可以证明额外使用抗生素对疗效没有好处,但确实将发生副作用的相对风险增加了 26%。总之,就是说干常春藤叶提取物对支气管炎患者有效且耐受性良好。鉴于所考虑的人口众多,未来的分析应处理有关按年龄组、伴随治疗和基线条件的治疗的具体问题。
我得到的结果如下。
['http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Helix',
'http://dbpedia.org/resource/Bronchitis',
'http://dbpedia.org/resource/Cough_medicine',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/After_7',
'http://dbpedia.org/resource/Gastrointestinal_tract',
'http://dbpedia.org/resource/Antibiotics',
'http://dbpedia.org/resource/Relative_risk',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Bronchitis']
如您所见,结果不是很好。
例如,考虑Hedera helix extract
上面提到的文本。即使 DBpedia 有Hedera helix
( http://dbpedia.org/resource/Hedera_helix
) 的资源,Spotlight 仍将其作为两个 URI 输出为http://dbpedia.org/resource/Hedera
和http://dbpedia.org/resource/Helix
。
根据我的数据集,我想得到 DBpedia 中最长的词作为结果。在这种情况下,我可以做哪些改进来获得我想要的输出?
如果需要,我很乐意提供更多详细信息。
解决方案
虽然我回答这个问题的时间很晚,但您可以在 python 中使用 Babelnet API 来获取包含更长文本的 dbpedia URI。我使用以下代码重现了该问题:
`from babelpy.babelfy import BabelfyClient
text ="Tolerance, safety and efficacy of Hedera helix extract in inflammatory
bronchial diseases under clinical practice conditions: a prospective, open,
multicentre postmarketing study in 9657 patients. In this postmarketing
study 9657 patients (5181 children) with bronchitis (acute or chronic
bronchial inflammatory disease) were treated with a syrup containing dried ivy
leaf extract. After 7 days of therapy, 95% of the patients showed improvement
or healing of their symptoms. The safety of the therapy was very good with an
overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders
with 1.5%). In those patients who got concomitant medication as well, it could
be shown that the additional application of antibiotics had no benefit
respective to efficacy but did increase the relative risk for the occurrence
of side effects by 26%. In conclusion, it is to say that the dried ivy leaf
extract is effective and well tolerated in patients with bronchitis. In view
of the large population considered, future analyses should approach specific
issues concerning therapy by age group, concomitant therapy and baseline
conditions."
# Instantiate BabelFy client.
params = dict()
params['lang'] = 'english'
babel_client = BabelfyClient("**Your Registration Code For API**", params)
# Babelfy sentence.
babel_client.babelfy(text)
# Get all merged entities.
babel_client.all_merged_entities'
对于文本中的所有合并实体,输出将采用示例格式,如下所示。您可以进一步存储和处理字典结构以提取 dbpedia URI。
{'start': 34,
'end': 45,
'text': 'Hedera helix',
'isEntity': True,
'tokenFragment': {'start': 6, 'end': 7},
'charFragment': {'start': 34, 'end': 45},
'babelSynsetID': 'bn:00021109n',
'DBpediaURL': 'http://dbpedia.org/resource/Hedera_helix',
'BabelNetURL': 'http://babelnet.org/rdf/s00021109n',
'score': 1.0,
'coherenceScore': 0.0847457627118644,
'globalScore': 0.0013494092960806407,
'source': 'BABELFY'},
推荐阅读
- javascript - 如何打开从 Chrome / Edge 到 IE 的 html 超链接?
- html - 如何始终在 html5 上显示视频播放器控件
- flutter - Flutter PageView 与 BottomNavigationBar 结合使用非常慢
- asp.net-mvc-5 - 通过在asp.net mvc5应用程序中调用外部corejwtapi获取JWT Token
- mysql - 错误 Connect MySQL Communications link failure 成功发送到服务器的最后一个数据包是 0 毫秒前
- c# - C#如何从自定义事件中清除处理程序
- jupyter-notebook - 如何在 Jupyter 笔记本中显示已完成的 Chaco.plot.Plot 对象?
- python - 为什么我的不和谐机器人不会在频道 discord.py 中发送消息
- java - 在更大的字符串中查找一个单词或多个单词字符串的位置列表
- event-receiver - - 事件接收器项目添加 - 保存冲突 - Sharepoint 2019