首页 > 解决方案 > 从 RSS Feed XML 中提取文本标签(使用 Javascript/React)

问题描述

我刚刚解析了一个 RSS 提要(Upwork's),并且我将标题、链接等工作项目数据点解析为数据点(items.title、items.link),但是我需要提取的大部分数据工作(其类别、技能等)作为一大块文本转储在“内容”数据项中。一般来说,我需要的信息的标题是标签和信息本身只是一个文本块,后跟一个标签。

这是来自 XML (items.content) 的示例:

We are looking for a developer with capabilities as a Wordpress Frontend/Backend Developer&nbsp;or&nbsp;Full Stack Wordpress Developer. <br /><br /> It is important for us to have experience with hosting, SSL, and&nbsp;Pagebuilders&nbsp;(Elementor/Visual Composer).<br /><br /><b>Hourly Range</b>: $20.00-$45.00 <br /><b>Posted On</b>: December 16, 2020 23:12 UTC<br /><b>Category</b>: Full Stack Development<br /><b>Skills</b>:Website Development, API, Website Redesign, WordPress Plugin, Website Optimization, Google Analytics, Java, JavaScript, PHP, Ruby, Scala, Kotlin, Python, SQL, Very Small (1-9 employees), CSS, Website Security, HTML, Graphic Design, Web Design, jQuery, Adobe Photoshop, Adobe Illustrator <br /><b>Location Requirement</b>: Only freelancers located in the United States may apply. <br /><b>Country</b>: United States <br /><a href="https://www.upwork.com/jobs/Ongoing-Website-development-specialist_%7E018e7e903a64f4e78e?source=rss">click to apply</a>

例如,如何提取标签“Hourly Range”以及与之相关的数据:($20.00 - $45.00)?为了增加复杂性,理想情况下,我需要能够将列出的每个项目(例如 HTML、CSS)分离成它们自己的单独日期项目。

我不知道如何阅读此文本并以有组织的方式提取我需要的数据。任何帮助表示赞赏!

标签: javascriptreactjsxmlxml-parsingrss

解决方案


DOM 中的任何东西都是一个节点。标签是b元素节点。和他们的数据文本节点兄弟。

const snippet = (new DOMParser()).parseFromString(getHTML(), 'text/html');
const data = {};

for (const label of snippet.querySelectorAll('b')) {
  const name = normalizeSpace(label.textContent); 
  let value = normalizeSpace(
    label.nextSibling.textContent.replace(/^:/, '')
  );
  if (name === 'Skills') {
    value = value.split(/\s*,\s*/);
  }
  data[name] = value;
}
console.log(data);

function normalizeSpace(value) {
  return value.replace(/\s{2,}/g, ' ').trim();
}

function getHTML(){
  return `We are looking for a developer with capabilities as a Wordpress   
    Frontend/Backend Developer&nbsp;or&nbsp;Full Stack Wordpress Developer. 
    <br /><br /> It is important for us to have experience with hosting, SSL, 
    and&nbsp;Pagebuilders&nbsp;(Elementor/Visual Composer).<br /><br /><b>Hourly 
    Range</b>: $20.00-$45.00 <br /><b>Posted On</b>: December 16, 2020 23:12 UTC
    <br /><b>Category</b>: Full Stack Development<br /><b>Skills</b>:Website 
    Development, API, Website Redesign, WordPress Plugin, Website Optimization,   
    Google Analytics, Java, JavaScript, PHP, Ruby, Scala, Kotlin, Python, SQL, 
    Very Small (1-9 employees), CSS, Website Security, HTML, Graphic Design, Web 
    Design, jQuery, Adobe Photoshop, Adobe Illustrator <br /><b>Location 
    Requirement</b>: Only freelancers located in the United States may apply. 
    <br /><b>Country</b>: United States <br />
    <a href="https://www.upwork.com/jobs/Ongoing-Website-development-specialist_%7E018e7e903a64f4e78e?source=rss">click to apply</a>`;
}


推荐阅读