首页 > 解决方案 > R中的抓取列表

问题描述

我想从本地 HTML 文件中抓取元素列表(名称播放器、成本、买家、卖家、日期),但是我尝试抓取买家和卖家时(在本例中第一次传输“计算机”和“彼得”)以及第二次传输“计算机”和“詹姆斯”)

document.querySelector("#pressReleases > ul > li:nth-child(**2**) > ul > li.text > div > strong:nth-child(2)")

document.querySelector("#pressReleases > ul > li:nth-child(**3**) > ul > li.text > div > strong:nth-child(2)")

如何刮掉li使这 2 个变量的元素?

我在 R 中试过这个:

dades<- mylocalfile

player<-dades %>% html_nodes("ul.player li.text strong") %>% html_text() %>% trimws()
cost<-dades %>% html_nodes("ul.player li.text span") %>% html_text() %>% trimws()
buyer<-dades %>% html_nodes("#pressReleases > ul > li:nth-child(2) > ul > li.text > div > strong:nth-child(2)") %>% html_text() %>% trimws()
seller<-dades %>% html_nodes("#pressReleases > ul > li:nth-child(2) > ul > li.text > div > strong:nth-child(1)") %>% html_text() %>% trimws()
day<-dades %>% html_nodes("ul.player li.text time") %>% html_text() %>% trimws()

我检测到这 2 #pressReleases > ul > li:nth-child(2) 对于每个都是可变的li class="post pressRelease"

html代码:

<div class="newsList" id="pressReleases">
<ul>
 <li class="date" style="background-color: rgb(128, 128, 128);">
   <strong>Fitxatges del dia</strong>
    09/08/2019
  </li>
  <li class="post pressRelease">
    <ul class="player">
      <li class="photo">
        <img src="./futmondo - Fútbol fantasy manager - futmondo_files/espanyol.png" onerror="Futmondo.Helpers.Resources.onErrorPlayerPhoto(this, &quot;L&quot;, &quot;espanyol.png&quot;)">
        <img src="./futmondo - Fútbol fantasy manager - futmondo_files/espanyol(1).png" alt="Espanyol" class="crest">
      </li>
      <li class="text">
         <strong>Player1</strong>
         <time>09/08/2019 - 05:30</time>
         <span>16.245.485 €&lt;/span>
         <div class="from">
           D'
         <strong>computer</strong>
           a 
         <strong>peter</strong>
        </div>
       </li>
      <a class="icon-revert">
      </a>
     </ul>
     <div class="bid second">
        <span class="triangle"></span>
        <strong class="second">2º puja</strong>
        <strong>matheu:</strong>
        <span class="price">15.925.828 €&lt;/span>
     </div>
  </li>
  <li class="post pressRelease">
    <ul class="player">
      <li class="photo">
        <img src="./futmondo - Fútbol fantasy manager - futmondo_files/real-sociedad.png" onerror="Futmondo.Helpers.Resources.onErrorPlayerPhoto(this, &quot;L&quot;, &quot;real-sociedad.png&quot;)">
        <img src="./futmondo - Fútbol fantasy manager - futmondo_files/real-sociedad(1).png" alt="Real Sociedad" class="crest">
      </li>
      <li class="text">
       <strong>Player2</strong>
       <time>09/08/2019 - 05:30</time>
       <span>1.111.711 €&lt;/span>
       <div class="from">
          D'
         <strong>computer</strong>
          a 
         <strong>james</strong>
       </div>
      </li>
      <a class="icon-revert">
      </a>
    </ul>
   </li>

标签: rlistweb-scrapingrvest

解决方案


这是获得的可能解决方案buyer/seller

# Read the local file
URL <- 'D:/Test/Test.html'
wp <- xml2::read_html(URL, encoding = 'utf-8')
# Extract the relevant nodes
node <- rvest::html_nodes(wp, '.from')
# Extract the names
seller <- gsub('.*D\'\r\n\\s+(.*?)\r\n\\s+a\\s?\r\n\\s+(.*?)\r\n.*', '\\1', rvest::html_text(node))
# [1] "computer" "computer"
buyer <- gsub('.*D\'\r\n\\s+(.*?)\r\n\\s+a\\s?\r\n\\s+(.*?)\r\n.*', '\\2', rvest::html_text(node))
# [1] "peter" "james"

推荐阅读