php - 为什么 PHP 有时无法从标题标签中提取标题?
问题描述
以下代码设法从某些页面的标题标签中提取标题,而无法从其他标题完整的页面中提取标题。我无法弄清楚为什么它会失火。我在本地主机上运行它。这个片段应该有效:
$doc = new DOMDocument();
libxml_use_internal_errors( true );
$doc->loadHTML( $contents['body'] );
//Get title
$title = '';
$titleTags = $doc->getElementsByTagName('title');
if( count( $titleTags ) > 0 ) {
$title = mysqli_real_escape_string( $mysql_conn, $titleTags[0]->nodeValue );
}
上面的代码片段有什么问题?它应该从所有标题完整的页面中提取标题。
$mysql_username = 'root';
$mysql_password = '';
$mysql_database = 'powerpage';
$mysql_conn = mysqli_connect( $mysql_host, $mysql_username, $mysql_password, $mysql_database );
if ( !$mysql_conn ) {
echo "Error: Unable to connect to MySQL." . PHP_EOL;
echo "Debugging errno: " . mysqli_connect_errno() . PHP_EOL;
echo "Debugging error: " . mysqli_connect_error() . PHP_EOL;
exit;
}
/**
* Download a Webpage via the HTTP GET Protocol using libcurl
* Developed By: Potent Pages, LLC (https://potentpages.com/)
*/
function _http ( $target, $referer ) {
//Initialize Handle
$handle = curl_init();
//Define Settings
curl_setopt ( $handle, CURLOPT_HTTPGET, true );
curl_setopt ( $handle, CURLOPT_HEADER, true );
curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
curl_setopt ( $handle, CURLOPT_URL, $target );
curl_setopt ( $handle, CURLOPT_REFERER, $referer );
curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
//Execute Request
$output = curl_exec ( $handle );
//Close cURL handle
curl_close ( $handle );
//Separate Header and Body
$separator = "\r\n\r\n";
$header = substr( $output, 0, strpos( $output, $separator ) );
$body_start = strlen( $header ) + strlen( $separator );
$body = substr( $output, $body_start, strlen( $output ) - $body_start );
//Parse Headers
$header_array = Array();
foreach ( explode ( "\r\n", $header ) as $i => $line ) {
if($i === 0) {
$header_array['http_code'] = $line;
$status_info = explode( " ", $line );
$header_array['status_info'] = $status_info;
} else {
list ( $key, $value ) = explode ( ': ', $line );
$header_array[$key] = $value;
}
}
//Form Return Structure
$ret = Array("headers" => $header_array, "body" => $body );
return $ret;
}
/**
* Convert Relative to Absolute URL
* Developed By: Potent Pages, LLC (https://potentpages.com/)
* From: https://potentpages.com/web-crawler-development/tutorials/php/simple-php-web-spider
* Based On: https://stackoverflow.com/questions/4444475/transfrom-relative-path-into-absolute-url-using-php
*/
function relativeToAbsolute( $relative, $base )
{
if($relative == "" || $base == "") return "";
//Check Base
$base_parsed = parse_url($base);
if( !array_key_exists( 'scheme', $base_parsed ) || !array_key_exists( 'host', $base_parsed ) || !array_key_exists( 'path', $base_parsed ) ) {
echo "Base Path \"$base\" Not Absolute Link\n";
return "";
}
//Parse Relative
$relative_parsed = parse_url($relative);
//If relative URL already has a scheme, it's already absolute
if( array_key_exists( 'scheme', $relative_parsed ) && $relative_parsed['scheme'] != '' ) {
return $relative;
}
//If only a query or a fragment, return base (without any fragment or query) + relative
if( !array_key_exists( 'scheme', $relative_parsed ) && !array_key_exists( 'host', $relative_parsed ) && !array_key_exists( 'path', $relative_parsed ) ) {
return $base_parsed['scheme']. '://'. $base_parsed['host']. $base_parsed['path']. $relative;
}
//Remove non-directory portion from path
$path = preg_replace( '#/[^/]*$#', '', $base_parsed['path'] );
//If relative path already points to root, remove base return absolute path
if( $relative[0] == '/' ) {
$path = '';
}
//Working Absolute URL
$abs = '';
//If user in URL
if( array_key_exists( 'user', $base_parsed ) ) {
$abs .= $base_parsed['user'];
//If password in URL as well
if( array_key_exists( 'pass', $base_parsed ) ) {
$abs .= ':'. $base_parsed['pass'];
}
//Append location prefix
$abs .= '@';
}
//Append Host
$abs .= $base_parsed['host'];
//If port in URL
if( array_key_exists( 'port', $base_parsed ) ) {
$abs .= ':'. $base_parsed['port'];
}
//Append New Relative Path
$abs .= $path. '/'. $relative;
//Replace any '//' or '/./' or '/foo/../' with '/'
$regex = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for( $n=1; $n>0; $abs = preg_replace( $regex, '/', $abs, -1, $n ) ) {}
//Return Absolute URL
return $base_parsed['scheme']. '://'. $abs;
}
function parsePage( $target, $referer ) {
global $mysql_conn;
//Parse URL and get Components
$url_components = parse_url( $target );
if($url_components === false) {
die( 'Unable to Parse URL' );
}
$url_host = $url_components['host'];
$url_path = '';
if(array_key_exists( 'path', $url_components ) == false) {
//If not a valid path, mark as done
$query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()";
if( !mysqli_query($mysql_conn, $query) ) {
echo "Line 128";//MY OWN ADDED LINE
die( "Error: Unable to perform Download Time Update Query (path)\n" );
}
return false;
} else {
$url_path = $url_components['path'];
}
//Download Page
echo "Downloading: $target\n";
$contents = _http ( $target, $referer );
echo "Done\n";
//Check Status
if( $contents['headers']['status_info'][1] != 200 ) {
//If not ok, mark as downloaded but skip
$query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()";
if( !mysqli_query($mysql_conn, $query) ) {
echo "Line 144";//MY OWN ADDED LINE
die( "Error: Unable to perform Download Time Update Query (http status)\n" );
}
return false;
}
//Parse Contents
$doc = new DOMDocument();
libxml_use_internal_errors( true );
$doc->loadHTML( $contents['body'] );
//Get title
$title = '';
$titleTags = $doc->getElementsByTagName('title');
if( count( $titleTags ) > 0 ) {
$title = mysqli_real_escape_string( $mysql_conn, $titleTags[0]->nodeValue );
}
//Get Description
$description = '';
$metaTags = $doc->getElementsByTagName('meta');
foreach( $metaTags as $tag ) {
if( $tag->getAttribute('name') == 'description' ) {
$description = mysqli_real_escape_string( $mysql_conn, $tag->getAttribute( 'content' ) );
}
}
//Get first h1
$h1 = '';
$h1Tags = $doc->getElementsByTagName('h1');
if( count( $h1Tags ) > 0 ) {
$h1 = mysqli_real_escape_string( $mysql_conn, $h1Tags[0]->nodeValue );
}
//Insert/Update Page Data
$query = "INSERT INTO pages (path, title, description, h1, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", \"$title\", \"$description\", \"$h1\", NOW()) ON DUPLICATE KEY UPDATE title=\"$title\", description=\"$description\", h1=\"$h1\", download_time=NOW()";
if( !mysqli_query($mysql_conn, $query) ) {
echo "Line 176";//MY OWN ADDED LINE
die( "Error: Unable to perform Insert Query\n" );
}
//Get Links
$links = Array();
$link_tags = $doc->getElementsByTagName( 'a' );
foreach( $link_tags as $tag ) {
if( ($href_value = $tag->getAttribute( 'href' ))) {
$link_absolute = relativeToAbsolute( $href_value, $target );
$link_parsed = parse_url( $link_absolute );
if($link_parsed === null || $link_parsed === false) {
die( 'Unable to Parse Link URL' );
}
if(( !array_key_exists( 'host', $link_parsed ) || $link_parsed['host'] == "" || $link_parsed['host'] == $url_host ) && array_key_exists( 'path', $link_parsed ) && $link_parsed['path'] != "" && array_search( $link_parsed['path'], $links ) === false) {
$links[] = $link_parsed['path'];
}
}
}
//Insert Links
foreach($links as $link) {
$link_escaped = mysqli_real_escape_string( $mysql_conn, $link );
$query = "INSERT IGNORE INTO pages (path, referer, download_time) VALUES (\"$link_escaped\", \"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NULL)";
if( !mysqli_query($mysql_conn, $query) ) {
echo "Line 199";//MY OWN ADDED LINE
die( "Error: Unable to perform Insert Link Value Query\n" );
}
}
return true;
}
//Define Seed Settings
$seed_url = "http://localhost/ebrute.com/crawler/5/1.php";
$seed_components = parse_url( $seed_url );
if($seed_components === false) {
die( 'Unable to Seed Parse URL' );
}
$seed_scheme = $seed_components['scheme'];
$seed_host = $seed_components['host'];
$url_start = $seed_scheme. '://'. $seed_host;
//Download Seed URL
parsePage( $seed_url, "" );
//Loop through all pages on site.
while(1) {
$counter = 0;
$select_query = "SELECT * FROM pages WHERE download_time IS NULL";
if(($select_result = mysqli_query( $mysql_conn, $select_query )) !== false) {
if( ($rowCount = mysqli_num_rows($select_result)) > 0 ) {
for( $i = 0; $i < $rowCount; $i++ ) {
if( ( $row = mysqli_fetch_assoc($select_result) ) !== false ) {
$path = $row['path'];
$referer = $row['referer'];
//Check if first character isn't a '/'
if( $path[0] != '/' ) {
continue;
}
$path = $row['path'];
$referer = $row['referer'];
if( parsePage( $url_start. $path, $referer ) ) {
$counter++;
}
sleep(1);
}
}
} else {
break;
}
} else {
die( "Unable to select un-downloaded pages\n" );
}
if($counter == 0) {
break;
}
}
?>
设法从 1.php 中提取标题,如下所示:
<html>
<head>
<title>
1st Page Title
</title>
<head>
<meta charset="UTF-8">
<meta name="description" content="Page 1 description is this">
<meta name="keywords" content="page, 1, one">
<meta name="author" content="Mr 1">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
This is page <a href="2.php"> 2</a> or <a href="2.php">two</a>.
This is page <a href="10.php"> 10</a> or <a href="10.php">ten</a>.
</body>
</html>
无法从 2.php 和 10.php 中提取。
2.php 看起来像这样:
<html>
<head>
<title>
2nd Page Title
</title>
<head>
<meta charset="UTF-8">
<meta name="description" content="Page 2 description is this">
<meta name="keywords" content="page, 2, two">
<meta name="author" content="Mr 2">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
This is page <a href="3.php"> 3</a> or <a href="3.php">three</a>.
This is page <a href="20.php"> 20</a> or <a href="20.php">twenty</a>.
</body>
</html>
10.php 看起来像这样:
<html>
<head>
<title>
10th Page Title
</title>
<head>
<meta charset="UTF-8">
<meta name="description" content="Page 10 description is this">
<meta name="keywords" content="page, 10, ten">
<meta name="author" content="Mr 10">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
This is page <a href="20.php"> 20</a> or <a href="20.php">twenty</a>.
This is page <a href="2.php"> 2</a> or <a href="2.php">two</a>.
</body>
</html>
这些是我在 localhost (Xampp) 上的文件,用于测试目的。
解决方案
推荐阅读
- reactjs - 如何在我的超级账本资源管理器的反应组件中获取超级账本结构中区块链的最新块?
- python - 在给定输入中查找python中的最大和最小数字
- html - 根据设备更改文本大小
- c++ - 在 Raspberry PI 上使用 libtorch 构建程序时的链接问题
- java - AndroidStudio 中的 MainActivity 未找到自定义 Java 类
- delphi - 在方法链中用作输出和(非参考)输入的值
- python - Tkinter:将条目设置为只读不起作用
- javascript - 使用自定义原型创建 JS 函数的最佳方法是什么?
- javascript - 配对和挑战
- c++ - 通过命令行在 Visual Studio 中编译单个独立文件