首页 > 解决方案 > 从 Php 或 cURL 获取的网页上的链接中提取 URL 和锚文本

问题描述

PHP大师,

这是从谷歌获取链接的代码。

<?php

# Use the Curl extension to query Google and get back a page of results
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);

# Create a DOM parser object
$dom = new DOMDocument();

# Parse the HTML from Google.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);

# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
        # Show the <a href>
        echo $link->getAttribute('href');
        echo "<br />";

?>

它呼应了类似这样的结果:

https://www.google.com/webhp?tab=ww
http://www.google.com/imghp?hl=bn&tab=wi

现在,我还是一个学习者,需要你的帮助。我想转换上面的代码,以便使用 DOM 它能够从驻留在所选网页上的所有链接中提取所有 url 及其锚文本,无论链接采用什么格式。格式如:

<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>

锚文本应位于每个提取的 url 下方。每个列出的项目之间应该有一条线。如:

http://stackoverflow.com<br>
A programmer's forum<br>
<br>
http://google.com<br>
A searchengine<br>
<br>
http://yahoo.com<br>
An Index<br>
<br>

等等。我也很感谢你们提供相同结果的 cURL 版本(不使用 DOM)。这个 cURL 并没有完全按照我想要的方式工作:

<?php

$curl = curl_init('http://stackoverflow.com/');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);

$page = curl_exec($curl);

if(curl_errno($curl)) // check for execution errors
{
    echo 'Scraper error: ' . curl_error($curl);
    exit;
}

curl_close($curl);

$regex = '<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]';
if ( preg_match($regex, $page, $list) )
    echo $list[0];
else 
    print "Not found"; 

?>

没有正则表达式的 cURL(不使用 DOM)可以实现这一点吗?我想看看一个正则表达式样本和一个没有正则表达式的样本。最后,我真的不想使用诸如 get_file() 之类的有限函数。

谢谢!

第一次编辑:这不起作用:

<?php

# Use the Curl extension to query Google and get back a page of results
$url = "http://fiverr.com/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);

# Create a DOM parser object
$dom = new DOMDocument();

# Parse the HTML from Devshed Forum.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);

# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
        # Show the <a href>
        echo $link->getAttribute('href');
        echo "<br />";
        echo $link->nodeValue;      
}

?>

我看到一个完整的白色空白页。没有回声。


第二次编辑:我更新了脚本并看到了这些错误:

**Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 97 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 119 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 119 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 123 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 149 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 149 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 159 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 159 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 162 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 162 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 174 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 174 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 179 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 179 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 184 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 185 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 348 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 356 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 356 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 358 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 358 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 361 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 838 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 848 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 848 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): ID display-name already defined in Entity, line: 895 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): ID m-address already defined in Entity, line: 899 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1155 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1155 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag footer invalid in Entity, line: 1168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 1175 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1208 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1208 in C:\xampp\htdocs\cURL\crawler.php on line 194**

更新:

<?php

/*
Using PHP's DOM functions to
  fetch hyperlinks and their anchor text
*/


$dom = new DOMDocument;
$dom->loadHTML(file_get_contents('https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl')); 

// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link) {
    $href = $link->getAttribute('href');
    $anchor = $link->nodeValue;
    echo $href,"\t",$anchor,"\n";
}
echo '</pre>';

?>

第三次编辑:好的。到目前为止,Luis Munoz 的样本对我有用。但是,他的那个样本以及我的原始样本并不是基于抓取页面上找到的链接之后的爬虫。因此,现在希望扩展我们脚本的功能,让爬虫跟踪在获取的页面上找到的链接。这是我以 2 种不同方式尝试在爬虫之后构建简单链接的 2 次尝试。我想做的是学习构建一个简单的网络爬虫,它跟踪链接并提取在随后的新页面上找到的链接。

步骤:所以首先,我会给它一个 url 开始。然后它将获取该页面并将所有链接提取到单个数组中并回显提取的链接,因此在每个页面加载时您只会看到提取的链接回显。然后它将获取这些链接页面中的每一个并将它们的所有链接提取到一个数组中,并同样回显提取的链接。它将执行此操作,直到达到其最大链接深度级别集。

尝试 1

<?php 

include('simple_html_dom.php'); 

$current_link_crawling_level = 0; 
$link_crawling_level_max = 2;

if($current_link_crawling_level == $link_crawling_level_max)
{
    echo "link crawling depth level reached!"; 
    sleep(5);
    exit(); 
}
else
{
    $url = 'http://php.net/manual-lookup.php? 
pattern=str_get_html&scope=quickref'; 
    $curl = curl_init($url); 
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
    $response_string = curl_exec($curl); 

    $html = str_get_html($response_string);

    $current_link_crawling_level++; 

    //to fetch all hyperlinks from the webpage 
    $links = array(); 
    foreach($html->find('a') as $a) 
    { 
        $links[] = $a->href; 
        echo "Value: $a<br />\n"; 
        print_r($links); 

        sleep(1);

        $url = '$value'; 
        $curl = curl_init($a); 
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
        curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
        curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
        $response_string = curl_exec($curl); 

        $html = str_get_html($response_string);

        $current_link_crawling_level++; 

        //to fetch all hyperlinks from the webpage 
        $links = array(); 
        foreach($html->find('a') as $a) 
        { 
            $links[] = $a->href; 
            echo "Value: $a<br />\n";
            print_r($links); 

            sleep(1);           
        } 
    echo "Value: $a<br />\n";
    print_r($links); 
    }
}

?>

第二次尝试:

<?php 

include('simple_html_dom.php'); 

$current_link_crawling_level = 0; 
$link_crawling_level_max = 2;

if($current_link_crawling_level == $link_crawling_level_max)
{
    echo "link crawling depth level reached!"; 
    sleep(5);
    exit(); 
}
else
{
    $url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; 
    $curl = curl_init($url); 
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
    $response_string = curl_exec($curl); 

    $html = str_get_html($response_string);

    $current_link_crawling_level++; 

    //to fetch all hyperlinks from the webpage 
    // Hide HTML warnings
    libxml_use_internal_errors(true);
    $dom = new DOMDocument;
    if($dom->loadHTML($html, LIBXML_NOWARNING))
    {
        // echo Links and their anchor text
        echo '<pre>';
        echo "Link\tAnchor\n";
        foreach($dom->getElementsByTagName('a') as $link) 
        {
            $href = $link->getAttribute('href');
            $anchor = $link->nodeValue;
            echo $href,"\t",$anchor,"\n";

            sleep(1);

            $url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; 
            $curl = curl_init($url); 
            curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
            curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
            curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
            curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
            $response_string = curl_exec($curl); 

            $html = str_get_html($response_string);

            $current_link_crawling_level++; 

            //to fetch all hyperlinks from the webpage 
            // Hide HTML warnings
            libxml_use_internal_errors(true);
            $dom = new DOMDocument;
            if($dom->loadHTML($html, LIBXML_NOWARNING))
            {
                // echo Links and their anchor text
                echo '<pre>';
                echo "Link\tAnchor\n";
                foreach($dom->getElementsByTagName('a') as $link) 
                {
                    $href = $link->getAttribute('href');
                    $anchor = $link->nodeValue;
                    echo $href,"\t",$anchor,"\n";

                    sleep(1);
                }
                echo '</pre>';
            }
            else
            {
                echo "Failed to load html.";
            }
        }
    }
    else
    {
        echo "Failed to load html.";
    }
}
?>

我将不胜感激任何对初学者来说非常简单的代码示例。更好,如果我是初学者的程序风格。

谢谢你!

标签: curldomhyperlinkextract

解决方案


让我们使用 cURL 代替,file_get_contents因为它是处理 HTTPS 请求的更好选择。此外,添加警告抑制控件以避免有关损坏 HTML 的消息

<?php

/*
Using PHP's DOM functions to
fetch hyperlinks and their anchor text
*/

$url = 'https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$data = curl_exec($curl);

// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($data, LIBXML_NOWARNING)){
    // echo Links and their anchor text
    echo '<pre>';
    echo "Link\tAnchor\n";
    foreach($dom->getElementsByTagName('a') as $link) {
        $href = $link->getAttribute('href');
        $anchor = $link->nodeValue;
        echo $href,"\t",$anchor,"\n";
    }
    echo '</pre>';
}else{
    echo "Failed to load html.";

}
?>

推荐阅读