首页 > 解决方案 > 使用 rvest 将复杂的 html 文件读入 R

问题描述

我是 R 和 stackoverflow 的新手,所以请保持温和,我会尽量保持这篇文章的正确性。我正在开展一个项目,将全外显子组测序 (WES) 结果与蛋白质组数据进行比较。我们的 WES 工具仅将数据作为 html 文件提供,因此我需要将其读入 R 以继续我的工作。

我尝试按照 rvest 的DataCamp 教程进行操作,但我认为问题可能是 html 文件过于复杂,因为我得到的是一堆乱七八糟的 \t\t\tn\n\t,中间有一些文本。我想问题是不正确的html_node?

这是我的 R 代码,后面是缩短的和经过变体修改的 HTML。

我想得到的是一个与 html 中的列相同的数据框。如示例中所示,某些变体会影响多个转录本,在这些情况下,单行/转录本将是完美的,但无论如何都不是必须的。

非常感谢您的帮助!

塞巴斯蒂安

library(tidyverse)  
library(rvest)    

htmlALL <- read_html("Example_html")

getDATA <- function(html){
html %>%
html_nodes(".table") %>%
html_text() %>%
str_trim() %>%
unlist()

}

df_html <- getDATA(htmlALL)

<!DOCTYPE html
	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
  <!-- add title in the brower tab bar -->
  <title>Homozygous variants of sample XXX </title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>


<!-- change style to look nice -->
<style type="text/css">


html { 
  text-align: center;
  vertical-align: middle;
  height: 100%;
  width: 100%;
}
body { 
  background: #eee url('http://i.imgur.com/eeQeRmk.png'); /* http://subtlepatterns.com/weave/ */
  font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;
  font-size: 62.5%;
  entry-height: 1;
  color: #585858;
  padding: 22px 10px;
  padding-bottom: 55px;

}

::selection { background: #5f74a0; color: #fff; }
::-moz-selection { background: #5f74a0; color: #fff; }
::-webkit-selection { background: #5f74a0; color: #fff; }

br { display: block; entry-height: 1.6em; } 

input, textarea { 
  -webkit-font-smoothing: antialiased;
  -webkit-text-size-adjust: 100%;
  -ms-text-size-adjust: 100%;
  -webkit-box-sizing: border-box;
  -moz-box-sizing: border-box;
  box-sizing: border-box;
  outentry: none; 
}

blockquote, q { quotes: none; }
blockquote:before, blockquote:after, q:before, q:after { content: ''; content: none; }
strong, b { font-weight: bold; } 


h1 {
  font-weight: bold;
  font-size: 3.6em;
  entry-height: 1.7em;
  margin-bottom: 10px;
  text-align: center;
}

h2 {
  font-weight: bold;
  font-size: 2.6em;
  entry-height: 1.7em;
  margin-bottom: 10px;
  text-align: center;
}

/** big white sheet everything is on **/
.wrapper {
  display: block;
  width: 95%;
  background: #fff;
  margin: 0 auto;
  padding: 10px 17px 100px;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  overflow-x: auto;
  overflow-y: visible;
}

/* smaller box the family information is on */
.info{
  display: block;
  width: 800px;
  background: #f2f2f2;
  margin: 0 auto;
  padding: 10px 17px 10px 10px;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  font-size: 1.8em;
  margin-bottom: 10px;
}


/* this is what actually contains the info */
.table {
  display: table;
  margin: 0 auto;
  width: 99%;
  font-size: 1.2em;
  margin-bottom: 15px;
  border-collapse: collapse;
  overflow: visible;
}

/* one row of the variants */
.tablerow {
  display: table-row;
  overflow: visible;
  border: 1px solid gray;
  width: 100%;
}

/* header are bigger and may in the future be clickable to sort accordginly*/
.tableheader {
  display: table-cell;
  background: #f2f2f2;
  padding: 3px 10px;
  margin-bottom: 25px;
  font-size: 1.8em;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
}

/* in the following each column gets specified to increase readablity*/

.position {
  display: table-cell;
  padding: 3px 10px;
  font-size: 1.4em;
  height: 100%;
  text-align: center;
  vertical-align: middle;
}

.variants {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  overflow: visible;
  white-space: nowrap;
  
}

.stacked {
  display: table;
  height: 50%;
  width: 100%;

}

.center {
  display: table-cell;
  vertical-align: middle;
  width: 100%;
  padding: 0px 5px;
}


.consequences {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  padding: 3px 10px;
}

.gene {
  display: table-cell;
  padding: 3px 15px;
  height: 100%;
  vertical-align: middle;
  font-size: 1.4em;
  font-weight: bold;
}

.transcripts {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}

.list {
  height: 100%;
  width: 100%;
  display: table;
  table-layout: fixed;
}
.row {
  display: table-row;
  overflow: visible;
  vertical-align: middle;
}
.entry {
  display: table-cell;
  vertical-align:middle;
  padding: 0% 1% 0% 1%;
  white-space: nowrap;
  text-overflow: ellipsis;
  overflow: hidden;
}

.cdspos {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}

.exon {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}



.hgvs {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}

.hgvs .list .row{
  display: table-row;
  vertical-align: middle;
}

.polyphen {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}
.polyphen .list .row{
  display: table-row;
  vertical-align: middle;
}

.sift {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}
.sift .list .row{
  display: table-row;
  vertical-align: middle;
}

.allelefreq {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}



/* Tooltip container */
.tooltip_gene, .tooltip_allelefrq ,.tooltip_qual{
    position: relative;
    display: inline-block;
    border-bottom: 1px dotted black; /* If you want dots under the hoverable text */
    
}



.tooltiptext{
    visibility: hidden;
    overflow: auto;
    min-width: 400px;
    background-color: #ffb380;
    color: black;
    text-align: left;
    padding: 5px 10px;
    border-radius: 6px;
    font-size: 12pt;
    font-weight: normal;
    
    /* Position the tooltip text - see examples below! */
    position: absolute;
    z-index:1;
    
    /* shadow */
    box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    
    opacity: 0.95;
    filter: alpha(opacity=95);

}

/* Tooltip text */
.tooltip_gene .tooltiptext {
    top: -5px;
    left: 105%;
 
}


/* Tooltip text */
.tooltip_allelefrq .tooltiptext {
    top: -5px;
    right: 105%;
    min-width: 120px;
    
 
}

/* Show the tooltip text when you mouse over the tooltip container */
.tooltip_allelefrq:hover .tooltiptext, .tooltip_gene:hover .tooltiptext {
    visibility: visible;
}


.clin {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  padding: 0% 1% 0% 1%;
  white-space: nowrap;
  text-overflow: ellipsis;
  overflow: hidden;
}

</style>


<body>
  <div class="wrapper">
      <!-- add info about patients -->
      <h1>Homozygous variants of sample XXX</h1>
      <h2>Tue Jan 23 09:01:56 2018</h2>
      <div class="info">
	
	  Patient only<br>
	
      </div>
      <!-- variants table start -->
      <div class="table">
	<!-- table header start -->
	<div class="tablerow">
	  <div class="tableheader">
	    Position
	  </div>
	  <div class="tableheader">
	    Variant
	  </div>
	  <div class="tableheader">
	    Cons
	  </div>
	  <div class="tableheader">
	    Gene
	  </div>
	  <div class="tableheader">
	    Transcript
	  </div>
	  <div class="tableheader">
	    HGVSC
	  </div>
	  <div class="tableheader">
	    HGVSP
	  </div>
	  <div class="tableheader">
	    PolyPhen
	  </div>
	  <div class="tableheader">
	    SIFT
	  </div>
	  <div class="tableheader">
	    AF
	  </div>
	  <div class="tableheader">
	    Clin
	  </div>
	</div>
	<!-- table header stop -->
	<!-- var loop start -->
	
	  <div class="tablerow" >
	    <!-- position start -->
	    <div class="position">
	      <a href="http://gnomad.broadinstitute.org/region/1-117635467-117635507">1:117635487</a>
	    </div>
	    <!-- position stop -->
	    <!-- variants start -->
	    <div class="variants">
	      
		
		  G->T
		
	      
	    </div>
	    <!-- variants stop -->
	    <!-- consequences start -->
	    <div class="consequences" style="background: rgb(196, 197, 198);">
	      
		synonymous
	      
	    </div>
	    <!-- consequences stop -->
	    <!-- gene start -->
	    <div class="gene" >
	      
	      
	      
		
		  <div class="tooltip_gene">
		    <a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=TTF2" >
		      TTF2
		    </a>
		    <span class="tooltiptext">GeneCards Summary<hr>
TTF2 (Transcription Termination Factor 2) is a Protein Coding gene.
Diseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.
Among its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.
GO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.
An important paralog of this gene is HLTF.</span>
		  </div>
		
	    </div>
	    <!-- gene stop -->
	    <!-- transcripts start -->
	    <div class="transcripts">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000369466">ENST00000369466
		      </a>
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- transcripts stop -->
	    <!-- exon start -->
	<!--    <div class="exon">
	      <div class="list">
		
	      </div>
	    </div>-->
	    <!-- exon stop -->
	    <!-- hgvsc start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.2940G>T
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsc stop -->
	    <!-- hgvsp start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.2940G>T(p.%3D)
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsp stop -->
	    <!-- polyphen start -->
	    <div class="polyphen">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- polyphen stop -->
	    <!-- sift start -->
	    <div class="sift">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- sift stop -->
	    <!--.allelefreq start -->
	    <div class="allelefreq">
	      
		
		  <div class="tooltip_allelefrq">
		    0.00000
		    <span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>0</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>0</span><hr>inhouse:<span style='float:right;'>0.00118</span></span>
		  </div>
		
	      
	    </div>
	    <!--.allelefreq stop -->
	    <!--.allelefreq start -->
	    <div class="clin">
	      
		
	      
	    </div>
	    <!--.allelefreq stop -->
	  </div>
	  <!-- table row stop-->
	
	 	
	  <div class="tablerow" >
	    <!-- position start -->
	    <div class="position">
	      <a href="http://gnomad.broadinstitute.org/region/1-149898435-149898475">1:149898455</a>
	    </div>
	    <!-- position stop -->
	    <!-- variants start -->
	    <div class="variants">
	      
		
		  
		      <a href="https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs143105666">G->A</a>
		  
		
	      
	    </div>
	    <!-- variants stop -->
	    <!-- consequences start -->
	    <div class="consequences" style="background: rgb(196, 197, 198);">
	      
		synonymous
	      
	    </div>
	    <!-- consequences stop -->
	    <!-- gene start -->
	    <div class="gene" >
	      
	      
	      
		
		  <div class="tooltip_gene">
		    <a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=SF3B4" >
		      SF3B4
		    </a>
		    <span class="tooltiptext">GeneCards Summary<hr>
SF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.
Diseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.
Among its related pathways are mRNA Splicing - Major Pathway and Gene Expression.
GO annotations related to this gene include nucleic acid binding and nucleotide binding.
</span>
		  </div>
		
	    </div>
	    <!-- gene stop -->
	    <!-- transcripts start -->
	    <div class="transcripts">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000457312">ENST00000457312
		      </a>
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000271628">ENST00000271628
		      </a>
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- transcripts stop -->
	    <!-- exon start -->
	<!--    <div class="exon">
	      <div class="list">
		
	      </div>
	    </div>-->
	    <!-- exon stop -->
	    <!-- hgvsc start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.390C>A
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			c.519C>A
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsc stop -->
	    <!-- hgvsp start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.390C>A(p.%3D)
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			c.519C>A(p.%3D)
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsp stop -->
	    <!-- polyphen start -->
	    <div class="polyphen">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- polyphen stop -->
	    <!-- sift start -->
	    <div class="sift">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- sift stop -->
	    <!--.allelefreq start -->
	    <div class="allelefreq">
	      
		
		  <div class="tooltip_allelefrq">
		    0.00021
		    <span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>57</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>277082</span><hr>inhouse:<span style='float:right;'>0.00236</span></span>
		  </div>
		
	      
	    </div>
	    <!--.allelefreq stop -->
	    <!--.allelefreq start -->
	    <div class="clin">
	      
		
	      
	    </div>
	    <!--.allelefreq stop -->
	  </div>
	  <!-- table row stop-->
	 	
	<!-- var loop stop -->
      </div>
      <!-- variant table stop -->
    </div>
</body>
</html>

标签: htmlrrvest

解决方案


这是我能为您提供的最好的。请注意,当您将鼠标悬停在列中的数据上时,输出包括弹出的“工具提示文本” Gene

library(rvest)

# I saved your sample to my Desktop as test.html
pg = read_html('~/Desktop/test.html')

# count rows (including header):
n_rows = pg %>% html_nodes('div.tablerow') %>% length

# sprintf-friendly format to get the %d-th node matching
#   //div[@class="tablerow"] (same as div.tablerow in CSS)
#   All of the /div after this are columns
xp_fmt = '//div[@class="tablerow"][%d]/div'

# div.tableheader nodes contain column names
col_names = pg %>% html_nodes(xpath = sprintf(xp_fmt, 1L)) %>% 
  html_text %>% trimws

# rows 2:n contain the actual data; gsub is
#   stripping leading/trailing whitespace and 
#   any duplicate internal whitespace
rows = lapply(2:n_rows, function(ii) {
  pg %>% html_nodes(xpath = sprintf(xp_fmt, ii)) %>% 
    html_text %>% gsub('^\\s+|\\s{2,}|\\s+$', '', .)
})

# can't forget those pesky factors
DF = as.data.frame(do.call(rbind, rows), stringsAsFactors = FALSE)
names(DF) = col_names
DF
#      Position Variant       Cons
# 1 1:117635487    G->T synonymous
# 2 1:149898455    G->A synonymous
#                                                                                                                                                                                                                                                                                                                                                                                                                                                     Gene
# 1 TTF2GeneCards Summary\nTTF2 (Transcription Termination Factor 2) is a Protein Coding gene.\nDiseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.\nAmong its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.\nGO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.\nAn important paralog of this gene is HLTF.
# 2                                                       SF3B4GeneCards Summary\nSF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.\nDiseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.\nAmong its related pathways are mRNA Splicing - Major Pathway and Gene Expression.\nGO annotations related to this gene include nucleic acid binding and nucleotide binding.
#                       Transcript            HGVSC
# 1                ENST00000369466        c.2940G>T
# 2 ENST00000457312ENST00000271628 c.390C>Ac.519C>A
#                            HGVSP PolyPhen SIFT
# 1               c.2940G>T(p.%3D)              
# 2 c.390C>A(p.%3D)c.519C>A(p.%3D)              
#                                                         AF
# 1       0.00000allele countsht: 0hm: 0wt: 0inhouse:0.00118
# 2 0.00021allele countsht: 57hm: 0wt: 277082inhouse:0.00236
#   Clin
# 1     
# 2     

请注意,它不适用于此处,因为您的所有列似乎都是character类型,但更复杂的方法会将此处的行转换为常规文件(例如csv),然后使用read.table(或更好,fread)读取文本和自动检测列类型。


推荐阅读