python - 如何在 Python 中对 GENIA 语料库进行 XML 解析
问题描述
我有以下 XML 模式,我想解析它以获取一个列表中的所有完整句子以及一个列表中标记之间的所有文本
<article>
<articleinfo>
<bibliomisc>MEDLINE:95369245</bibliomisc>
</articleinfo>
<title>
<sentence><cons lex="IL-2_gene_expression" sem="G#other_name"><cons lex="IL-2_gene" sem="G#DNA_domain_or_region">IL-2 gene</cons> expression</cons> and <cons lex="NF-kappa_B_activation" sem="G#other_name"><cons lex="NF-kappa_B" sem="G#protein_molecule">NF-kappa B</cons> activation</cons> through <cons lex="CD28" sem="G#protein_molecule">CD28</cons> requires reactive oxygen production by <cons lex="5-lipoxygenase" sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence>
</title>
<abstract>
<sentence>Activation of the <cons lex="CD28_surface_receptor" sem="G#protein_family_or_group"><cons lex="CD28" sem="G#protein_molecule">CD28</cons> surface receptor</cons> provides a major costimulatory signal for <cons lex="T_cell_activation" sem="G#other_name">T cell activation</cons> resulting in enhanced production of <cons lex="interleukin-2" sem="G#protein_molecule">interleukin-2</cons> (<cons lex="IL-2" sem="G#protein_molecule">IL-2</cons>) and <cons lex="cell_proliferation" sem="G#other_name">cell proliferation</cons>.</sentence>
<sentence>In <cons lex="primary_T_lymphocyte" sem="G#cell_type">primary T lymphocytes</cons> we show that <cons lex="CD28" sem="G#protein_molecule">CD28</cons> ligation leads to the rapid intracellular formation of <cons lex="reactive_oxygen_intermediate" sem="G#inorganic">reactive oxygen intermediates</cons> (<cons lex="ROI" sem="G#inorganic">ROIs</cons>) which are required for <cons lex="CD28-mediated_activation" sem="G#other_name"><cons lex="CD28" sem="G#protein_molecule">CD28</cons>-mediated activation</cons> of the <cons lex="NF-kappa_B" sem="G#protein_molecule">NF-kappa B</cons>/<cons lex="CD28-responsive_complex" sem="G#protein_complex"><cons lex="CD28" sem="G#protein_molecule">CD28</cons>-responsive complex</cons> and <cons lex="IL-2_expression" sem="G#other_name"><cons lex="IL-2" sem="G#protein_molecule">IL-2</cons> expression</cons>.</sentence>
<sentence>Delineation of the <cons lex="CD28_signaling_cascade" sem="G#other_name"><cons lex="CD28" sem="G#protein_molecule">CD28</cons> signaling cascade</cons> was found to involve <cons lex="protein_tyrosine_kinase_activity" sem="G#other_name"><cons lex="protein_tyrosine_kinase" sem="G#protein_family_or_group">protein tyrosine kinase</cons> activity</cons>, followed by the activation of <cons lex="phospholipase_A2" sem="G#protein_molecule">phospholipase A2</cons> and <cons lex="5-lipoxygenase" sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence>
<sentence>Our data suggest that <cons lex="lipoxygenase_metabolite" sem="G#protein_family_or_group"><cons lex="lipoxygenase" sem="G#protein_molecule">lipoxygenase</cons> metabolites</cons> activate <cons lex="ROI_formation" sem="G#other_name"><cons lex="ROI" sem="G#inorganic">ROI</cons> formation</cons> which then induce <cons lex="IL-2" sem="G#protein_molecule">IL-2</cons> expression via <cons lex="NF-kappa_B_activation" sem="G#other_name"><cons lex="NF-kappa_B" sem="G#protein_molecule">NF-kappa B</cons> activation</cons>.</sentence>
<sentence>These findings should be useful for <cons lex="therapeutic_strategies" sem="G#other_name">therapeutic strategies</cons> and the development of <cons lex="immunosuppressants" sem="G#other_name">immunosuppressants</cons> targeting the <cons lex="CD28_costimulatory_pathway" sem="G#other_name"><cons lex="CD28" sem="G#protein_molecule">CD28</cons> costimulatory pathway</cons>.</sentence>
</abstract>
</article>
</set>
我试着做这样的事情
import xml.etree.ElementTree as ET
root = ET.parse("test.xml").getroot()
sent= [elem.text for elem in root.iter('sentence')]
print(sent)
terms = [elem.text for elem in root.iter('cons')]
print(terms)
但这给出了以下输出。
[None, 'Activation of the ', 'In ', 'Delineation of the ', 'Our data suggest that ', 'These findings should be useful for ']
[None, 'IL-2 gene', None, 'NF-kappa B', 'CD28', '5-lipoxygenase', None, 'CD28', 'T cell activation', 'interleukin-2', 'IL-2', 'cell proliferation', 'primary T lymphocytes', 'CD28', 'reactive oxygen intermediates', 'ROIs', None, 'CD28', 'NF-kappa B', None, 'CD28', None, 'IL-2', None, 'CD28', None, 'protein tyrosine kinase', 'phospholipase A2', '5-lipoxygenase', None, 'lipoxygenase', None, 'ROI', 'IL-2', None, 'NF-kappa B', 'therapeutic strategies', 'immunosuppressants', None, 'CD28']
我想要一个更接近以下的输出
['IL-2 gene expression and NF-kappa B activation through CD28 requires oxygen production by 5-lipoxygenase', ...]
['IL-2 gene','NF-kappa B', 'CD28', '5-lipoxygenase',...]
术语列表在我的输出中似乎很好,但我如何在我的sent
列表中获得完整的句子,而不是我目前得到的破碎句子。
解决方案
棘手的部分是您的 xml 中的某些文本不是 .text;它是.tail。
对于句子,很容易做类似的事情:
sent = [''.join(elem.itertext()) for elem in root.iter('sentence')]
对于术语(缺点),它有点不同,因为看起来您想要忽略cons
具有 child 的元素的文本cons
。(真的你不想要孩子的 .text cons
。)
在这种情况下,如果它不是 None ,只需抓住 .text ......
terms = [elem.text for elem in tree.iter('cons') if elem.text]
完整的例子...
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
sent = [''.join(elem.itertext()) for elem in tree.iter('sentence')]
print(sent)
terms = [elem.text for elem in tree.iter('cons') if elem.text]
print(terms)
印刷...
['IL-2 gene expression and NF-kappa B activation through CD28 requires reactive oxygen production by 5-lipoxygenase.', 'Activation of the CD28 surface receptor provides a major costimulatory signal for T cell activation resulting in enhanced production of interleukin-2 (IL-2) and cell proliferation.', 'In primary T lymphocytes we show that CD28 ligation leads to the rapid intracellular formation of reactive oxygen intermediates (ROIs) which are required for CD28-mediated activation of the NF-kappa B/CD28-responsive complex and IL-2 expression.', 'Delineation of the CD28 signaling cascade was found to involve protein tyrosine kinase activity, followed by the activation of phospholipase A2 and 5-lipoxygenase.', 'Our data suggest that lipoxygenase metabolites activate ROI formation which then induce IL-2 expression via NF-kappa B activation.', 'These findings should be useful for therapeutic strategies and the development of immunosuppressants targeting the CD28 costimulatory pathway.']
['IL-2 gene', 'NF-kappa B', 'CD28', '5-lipoxygenase', 'CD28', 'T cell activation', 'interleukin-2', 'IL-2', 'cell proliferation', 'primary T lymphocytes', 'CD28', 'reactive oxygen intermediates', 'ROIs', 'CD28', 'NF-kappa B', 'CD28', 'IL-2', 'CD28', 'protein tyrosine kinase', 'phospholipase A2', '5-lipoxygenase', 'lipoxygenase', 'ROI', 'IL-2', 'NF-kappa B', 'therapeutic strategies', 'immunosuppressants', 'CD28']
注意:terms
会有重复。如果您需要删除重复项,有几种不同的方法可以做到这一点。例如,使用 set():
terms = list(set(elem.text for elem in tree.iter('cons') if elem.text))
推荐阅读
- maven - ExtentReports:运行使用 maven-assembly-plugin 创建的可执行 jar 文件时 HtmlReporter 未启动
- c# - 在没有工作室的情况下使用 jsreport
- php - AWS 缓存上的 Codeigniter 和错误日志/显示或调试未显示
- ruby-on-rails - 在 Rails 中使用数据库中的 JSON 填充表
- amazon-web-services - 我可以使用 Athena View 作为 AWS Glue 作业的来源吗?
- angular - Angular 单元测试服务
- corda - 从 corda 3.1 -> 3.2 更新,现在无法启动节点
- angular - 调度操作后重定向到页面 (ngredux)
- android - Android Image and Toolbar(EditText) inside CollapsingToolbarLayout Scrolling
- r - 在 RStudio 环境选项卡数据查看器中控制单元格宽度