首页 > 解决方案 > 从 XML 提要中的文本元素中提取 img src

问题描述

我有一个如下所示的 XML 提要:

<?xml version="1.0" encoding="UTF-8"?>
<smf:xml-feed xmlns:smf="http://www.simplemachines.org/" xmlns="http://www.simplemachines.org/xml/recent" xml:lang="en-US">
  <recent-post>
    <time>April 04, 2021, 04:20:47 pm</time>
    <id>1909114</id>
    <subject>Title</subject>
    <body><![CDATA[<a href="#"><img src="image.png">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iure rerum in tempore sit ducimus doloribus quod commodi eligendi ipsam porro non fugiat nisi eaque delectus harum aspernatur recusandae incidunt quasi.</a>]]></body>
  </recent-post>
</smf:xml-feed>

我想从中提取图像srcbody然后将其保存到一个新的 XML 文件中,该文件包含image.

到目前为止,我有

$xml = 'https://example.com/feed.xml';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->recover = true;
libxml_use_internal_errors(true);
$dom->loadXML($xml);

$xpath = new DOMXPath( $dom );
$nodes = $xpath->query( 'smf:xml-feed/recent-post/body' );

foreach( $nodes as $node )
{
    $html = new DOMDocument();
    $html->loadHTML( $node->nodeValue );
    $src = $html->getElementsByTagName( 'img' )->item(0)->getAttribute('src');
    echo $src;
}

但是当我尝试打印时$nodes,我什么也没得到。我错过了什么?

标签: phpxmldomdocumentdomxpath

解决方案


这看起来像一个简单的机器提要。但是,名称空间丢失了,“body”元素应该是一个 CDATA 部分,其中包含一个 html 片段作为文本。我希望看起来像这样:

<smf:xml-feed 
  xmlns:smf="http://www.simplemachines.org/" 
  xmlns="http://www.simplemachines.org/xml/recent" 
  xml:lang="en-US">
    <recent-post>
    <time>April 04, 2021, 04:20:47 pm</time>
    <id>1909114</id>
    <subject>Title</subject>
    <body><![CDATA[
    <a href="#"><img src="image.png">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iure rerum in tempore sit ducimus doloribus quod commodi eligendi ipsam porro non fugiat nisi eaque delectus harum aspernatur recusandae incidunt quasi.</a>
    ]]>
    </body>
  </recent-post>
</smf:xml-feed>

XML 定义了两个名称空间。要在 Xpath 表达式中使用它们,您必须为它们注册前缀。我建议迭代recent-post元素。然后使用带有字符串类型转换的表达式获取特定子节点的文本内容。

body元素包含作为文本的 HTML 片段,因此您需要将其加载到单独的文档中。src然后你可以在这个文档上用Xpath 来获取img

$feedDocument = new DOMDocument();
$feedDocument->preserveWhiteSpace = false;
$feedDocument->loadXML($xmlString);
$feedXpath = new DOMXPath($feedDocument);

// register namespaces
$feedXpath->registerNamespace('smf', 'http://www.simplemachines.org/');
$feedXpath->registerNamespace('recent', 'http://www.simplemachines.org/xml/recent');

// iterate the posts
foreach($feedXpath->evaluate('/smf:xml-feed/recent:recent-post') as $post) {
    // demo: fetch post subject as string
    var_dump($feedXpath->evaluate('string(recent:subject)', $post));
    
    // create a document for the HTML fragment
    $html = new DOMDocument();
    $html->loadHTML(
        // load the text content of the body element
        $feedXpath->evaluate('string(recent:body)', $post),
        // just a fragment, no need for html document elements or DTD
        LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
    );
    // Xpath instance for the html document
    $htmlXpath = new DOMXpath($html);
    // fetch first src attribute of an img 
    $src = $htmlXpath->evaluate('string(//img/@src)');
    var_dump($src);
}

输出:

string(5) "Title"
string(9) "image.png"

推荐阅读