首页 > 解决方案 > How to apply rvest to a dataframe column of HTML to make a column of extracted emboldened words

问题描述

I have a dataframe, of which one column - raw - is HTML:

other column raw
First row <p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p>
Second row <div style="line-height:174%;text-align:left;font-size:9pt;"><font style="font-family:inherit;font-size:9pt;font-style:italic;font-weight:bold;">We have a history of losses, and we cannot assure you that we will achieve profitability.</font></div>

I would like to build some new columns off the raw column. I would like one column per common styling attribute (bold, italic, underlining etc.) - where each entry in the is_bold column, for example, is either "bold" or just blank. So my final desired output looks like this:

other column raw is_bold is_italic
First row <p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p> italic
Second row <div style="line-height:174%;text-align:left;font-size:9pt;"><font style="font-family:inherit;font-size:9pt;font-style:italic;font-weight:bold;">We have a history of losses, and we cannot assure you that we will achieve profitability.</font></div> bold italic

As demonstrated in the above example, several of my HTML paragraphs have some text in some styles, and others not. E.g. my first row has two characters ("55") in bold, and the rest not, while the whole paragraph is italic - so if, say, at least 50% of the text of the HTML is in bold, I'd want to label the row as bold.

So, to achieve this desired output, I want to extract any text that is in bold, count its combined length (even if the bold parts are spread across different parts of the paragraph), divide by the total length of the paragraph, and if this number exceeds 0.5, flag that row as being in bold. So my questions are:

  1. How do I implement this in a dataframe setting? For a single string of html rather than a dataframe, the following code works:
html <- "some html here"
bold_parts <- html %>% html_nodes("b, strong") %>% html_text()

So, applying this to my dataframe column, can someone please help me figure out how to modify the code below to extract any emboldened words to a new column called bold_words? From there, I can count the length of these bold words and divide it by the length of the raw column.

dataframe <- dataframe %>% 
  rowwise() %>% 
  mutate(
    bold_words = read_html(raw) %>%
      html_nodes("b, strong) %>%
      html_text()
    ) 
  1. Once this is working, it should be fine for styles defined by <b>, <strong>, <i>, <em>, and <u>. However, I am not sure how to go about applying this to HTML like that in row 2 - where instead of <b> or <i> or <u>, the appearance is determined by "font-style:italic", "text-decoration:underline" and "font-weight:bold". I could split it at these parts using regex, but I would rather parse the HTML.

  2. If anyone spots a better way of doing any of this, it'd be appreciated, even if it means using an entirely different approach.

Thank you

标签: htmlrdataframeweb-scrapingrvest

解决方案


You can use attribute selectors with * contains operator to specify the style attribute containing bold.

The following shows creating a crude general function you can pass your css pattern, and desired column text, into for a given output column. Shown are the patterns for is_bold and is_italic.

TODO: You probably want to add some error handling e.g. in case of HTML parsing errors.

library(tidyverse)
library(rvest)

df <- data.frame(
  other= c("First Row", "Second Row"),
  raw =  c(
    '<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p>',
    '<div style="line-height:174%;text-align:left;font-size:9pt;"><font style="font-family:inherit;font-size:9pt;font-style:italic;font-weight:bold;">We have a history of losses, and we cannot assure you that we will achieve profitability.</font></div>'
  )
)

is_pattern <- function(i, css_selector, return_text) {
  page <- read_html(i)
  all_text <- nchar(page %>% html_text())
  pattern_text <- sum(nchar(page %>% html_nodes(css_selector) %>% html_text()))
  flag <- ifelse(length(all_text) == 0 | length(pattern_text) == 0, F, (pattern_text / all_text) >= .5)
  return(ifelse(flag, return_text, ''))
}

df$`is_bold` <- lapply(df$raw, is_pattern, 'b, strong, [style*="font-weight:bold"]', 'bold')

mutate example:

is_pattern <- Vectorize(is_pattern)

df <- df %>%
  mutate(
    is_bold = is_pattern(raw, 'b, strong, [style*="font-weight:bold"]', 'bold'),
    is_italic = is_pattern(raw, 'em, i, [style*="font-style:italic"]', 'italic'),
  )

I noted from an answer by @r2evans that I needed to Vectorize the function.


enter image description here


推荐阅读