首页 > 解决方案 > 解析 HPO obo 文件以提取外部参照

问题描述

我需要从 OBO 文件中提取信息。

我需要的是从xref每个术语的行中获取信息id。对于 13.000 个术语,文件中的信息如下所示:

[Term]
id: HP:0011540
name: Congenitally corrected transposition of the great arteries
def: "The essence of the lesion is the combination of discordant atrioventricular and ventriculo-arterial connections. Thus, the morphologically right atrium is connected to a morphologically left ventricle across the mitral valve, with the left ventricle then connected to the pulmonary trunk. The morphologically left atrium is connected to the morphologically right ventricle across the tricuspid valve, with the morphologically right ventricle connected to the aorta." [DDD:dbrown, pmid:21569592]
synonym: "L-transposition" RELATED []
synonym: "Ventricular inversion" RELATED []
xref: EPCC:01.01.03
xref: ICD-10:Q20.5
xref: MSH:C535426
xref: SNOMEDCT_US:56743000
xref: SNOMEDCT_US:83799000
xref: UMLS:C0232301
xref: UMLS:C0344616
is_a: HP:0011534 ! Abnormal spatial orientation of the cardiac segments
is_a: HP:0011603 ! Congenital malformation of the great arteries
created_by: peter
creation_date: 2012-04-07T10:48:56Z

[Term]
id: HP:0011555
name: Double inlet left ventricle
def: "The condition in which both atria are joined to the left ventricle each by its own atrioventricular valve. Usually there is a hypoplastic right ventricle, which may be on the opposite side of the heart as usual." [DDD:dbrown, HPO:probinson]
xref: EPCC:01.04.04
xref: ICD-10:Q20.4
xref: SNOMEDCT_US:253283000
xref: UMLS:C0344622
is_a: HP:0001750 ! Single ventricle
is_a: HP:0011554 ! Double inlet atrioventricular connection
created_by: peter
creation_date: 2012-04-07T11:53:33Z

[Term]
id: HP:0011589
name: Common origin of the right brachiocephalic artery and left common carotid artery
def: "The left common carotid artery has a common origin with the innominate artery." [DDD:dbrown, HPO:probinson, pmid:17138027]
comment: Commonly the three great vessels (innominate artery, left common carotid artery, and the left subclavian artery) originate from the arch of the aorta. The second most common variant of aortic arch branching occurs when the left common carotid artery has a common origin with the innominate artery.
synonym: "Bovine arch" RELATED []
synonym: "Common brachiocephalic trunk" EXACT []
synonym: "Ovine arch" RELATED []
xref: SNOMEDCT_US:460890003
xref: UMLS:C3532020
xref: UMLS:C4020746
xref: UMLS:C4021141
is_a: HP:0011587 ! Abnormal branching pattern of the aortic arch
created_by: peter
creation_date: 2012-04-08T01:38:36Z

txt 或 xlsx 格式的结果应如下所示:

id          UMLS                        SNOMEDCT_US        MSH      EPCC     ICD-10 ICD-9   ICD-O   Fyler   MEDDRA
HP:0011540  C0232301;C0344616           56743000;83799000  C535426  01.01.03 Q20.5              
HP:0011555  C0344622                    253283000                   01.04.04 Q20.4              
HP:0011589  C3532020;C4020746;C4021141  460890003   

                    

标头(UMLS、SNOMEDCT_US、MSH、MEDDRA、...)都是可能的外部参照。

标签: rparsingxlsx

解决方案


这是一种使用ontologyIndexand的方法tidyverse

library(tidyverse)
library(ontologyIndex)
hpo <- get_ontology("https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/hp.obo",
                    extract_tags = "everything") #Download HPO file from GitHub and import
simplify2array(hpo) %>% #Convert to array
  as_tibble() %>% #Convert to tibble
  select(id,xref) %>% #select HPO ID and xref
  unnest(c(id,xref)) %>% #unnest list columns
  separate(xref, into = c("Ontology","Term"), sep = ":") %>% #separate ontology from code
  pivot_wider(id_cols = id, names_from = "Ontology",
              values_from = Term,
              values_fn = \(x)paste(x,collapse = ";")) #pivot wider and combine terms with paste
## A tibble: 11,652 x 22
#   id         UMLS              MSH     SNOMEDCT_US         MEDDRA Fyler NCIT  COHD  EFO   ICD10 ICD9  `ICD-10` EPCC  DOID  MONDO `ICD-O` MP    MPATH PMID  ORPHA SNOMED_CT `ICD-9`
#   <chr>      <chr>             <chr>   <chr>               <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>    <chr> <chr> <chr> <chr>   <chr> <chr> <chr> <chr> <chr>     <chr>  
# 1 HP:0000001 C0444868          NA      NA                  NA     NA    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA      NA    NA    NA    NA    NA        NA     
# 2 HP:0000002 C4025901          NA      NA                  NA     NA    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA      NA    NA    NA    NA    NA        NA     
# 3 HP:0000003 C3714581          D021782 204962002;82525005  NA     NA    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA      NA    NA    NA    NA    NA        NA     
# 4 HP:0000005 C1708511          NA      NA                  NA     NA    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA      NA    NA    NA    NA    NA        NA     
# 5 HP:0000006 C0443147          NA      263681008           NA     NA    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA      NA    NA    NA    NA    NA        NA     
# 6 HP:0000007 C0441748;C4020899 NA      258211005           NA     NA    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA      NA    NA    NA    NA    NA        NA     
# 7 HP:0000008 C4025900          NA      NA                  NA     NA    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA      NA    NA    NA    NA    NA        NA     
# 8 HP:0000009 C3806583          NA      NA                  NA     NA    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA      NA    NA    NA    NA    NA        NA     
# 9 HP:0000010 C0262655          NA      197927001           NA     NA    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA      NA    NA    NA    NA    NA        NA     
#10 HP:0000011 C0005697          D001750 397732007;398064005 NA     NA    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA      NA    NA    NA    NA    NA        NA     

write.table()从这里你可以用or写出结果write_delim()


推荐阅读