php - 用于文件上传/直接输入的 W3 Validator Api
问题描述
我正在尝试http://validator.w3.org/nu/
通过 POST 方法使用 API 进行直接输入。
https://github.com/validator/validator/wiki/Service-%C2%BB-Input-%C2%BB-textarea
这是我尝试过但没有成功的方法
class frontend {
public static function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
$user_agent = self::random_user_agent();
//var_dump($user_agent);
curl_setopt($ch,CURLOPT_USERAGENT,$user_agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
if (strpos($url, 'https') !== false) {
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
}
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
}
$domain = 'yahoo.com';
$url = 'https://'.$domain;
$html = frontend::file_get_contents_curl($url);
libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->loadHTML($html);
$html_file_output = $domain.'.html';
$dir = $_SERVER['DOCUMENT_ROOT'].'/tmp/';
if(!file_exists($dir)) {
mkdir($dir);
}
$file_path = $dir.$html_file_output;
$doc->saveHTMLFile($file_path);
var_dump($file_path); // the filepath where the file is saved /www.domain.com/tmp/html_file.html
$url_validator = 'http://validator.w3.org/nu/';
$query = [
'out' => 'json',
'content' => $html // the HTML resulting from $url variable %3C%21DOCTYPE+html%3E%0....
//'content' => $file_path tried also => /www.domain.com/tmp/the_file.html
];
$query_string = http_build_query($query);
var_dump($query_string); // returns string 'out=json&content=doctype html....' or 'out=json&content=F:/SERVER/www/www.domain.com/tmp/yahoo.com.html'
$ch = curl_init();
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $query_string);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$str_html = curl_exec($ch);
curl_close($ch);
$data = json_decode($str_html);
var_dump($data); // returns null
unlink($file_path);
解决方案
首先,“直接输入”api 只接受multipart/form-data
-format 的 POST 请求,但是当你通过它运行它时,http_build_query()
你会将它转换为application/x-www-form-urlencoded
-format,而该 api 不理解。(给 CURLOPT_POSTFIELDS 一个数组,它会自动转换为multipart/form-data
)
其次,此 API 会阻止缺少User-Agent
标头的请求,并且 libcurl 没有默认 UA(cli 程序有,但 libcurl 没有),因此您必须自己提供一个,但您没有。
...修复这两个,并添加一些简单的错误消息解析,
<?php
$ch=curl_init();
$html=<<<'HTML'
<!DOCTYPE html>
<html lang="">
<head>
<title>Test</title>
</head><ERR&OR
<body>
<p></p>
</body>
</html>
HTML;
curl_setopt_array($ch,array(
CURLOPT_URL=>'http://validator.w3.org/nu/',
CURLOPT_ENCODING=>'',
CURLOPT_USERAGENT=>'PHP/'.PHP_VERSION.' libcurl/'.(curl_version()['version']),
CURLOPT_POST=>1,
CURLOPT_POSTFIELDS=>array(
'showsource'=>'yes',
'content'=>$html
),
CURLOPT_RETURNTRANSFER=>1,
));
$html=curl_exec($ch);
curl_close($ch);
$parsed=array();
$domd=@DOMDocument::loadHTML($html);
$xp=new DOMXPath($domd);
$res=$domd->getElementById("results");
foreach($xp->query("//*[@class='error']",$res) as $message){
$parsed['errors'][]=trim($message->textContent);
}
var_dump($html);
var_dump($parsed);
印刷:
array(1) {
["errors"]=>
array(4) {
[0]=>
string(156) "Error: Saw < when expecting an attribute name. Probable cause: Missing > immediately before.At line 6, column 1</head><ERR&ORâ©<body>â©<p></p>â©"
[1]=>
string(254) "Error: Element err&or not allowed as child of element body in this context. (Suppressing further errors from this subtree.)From line 5, column 8; to line 6, column 6e>â©</head><ERR&ORâ©<body>â©<p></Content model for element body:Flow content."
[2]=>
string(144) "Error: End tag for body seen, but there were unclosed elements.From line 8, column 1; to line 8, column 7>â©<p></p>â©</body>â©</htm"
[3]=>
string(118) "Error: Unclosed element err&or.From line 5, column 8; to line 6, column 6e>â©</head><ERR&ORâ©<body>â©<p></"
}
}
...并且unicode问题源于DOMDocument的默认字符集是.. idk,not-utf8,afaik没有用DOMDocument设置默认字符集的好方法,但是您可以通过这样做来解决它
$domd=@DOMDocument::loadHTML('<?xml encoding="UTF-8">'.$html);
这使它打印:
array(1) {
["errors"]=>
array(4) {
[0]=>
string(147) "Error: Saw < when expecting an attribute name. Probable cause: Missing > immediately before.At line 6, column 1</head><ERR&OR↩<body>↩<p></p>↩"
[1]=>
string(245) "Error: Element err&or not allowed as child of element body in this context. (Suppressing further errors from this subtree.)From line 5, column 8; to line 6, column 6e>↩</head><ERR&OR↩<body>↩<p></Content model for element body:Flow content."
[2]=>
string(135) "Error: End tag for body seen, but there were unclosed elements.From line 8, column 1; to line 8, column 7>↩<p></p>↩</body>↩</htm"
[3]=>
string(109) "Error: Unclosed element err&or.From line 5, column 8; to line 6, column 6e>↩</head><ERR&OR↩<body>↩<p></"
}
}
...哪个更好,但仍然包含网页上使用的箭头,可以用
foreach($xp->query("//*[@class='lf']") as $remove){
$remove->parentNode->removeChild($remove);
}
这使它打印:
array(1) {
["errors"]=>
array(4) {
[0]=>
string(138) "Error: Saw < when expecting an attribute name. Probable cause: Missing > immediately before.At line 6, column 1</head><ERR&OR<body><p></p>"
[1]=>
string(236) "Error: Element err&or not allowed as child of element body in this context. (Suppressing further errors from this subtree.)From line 5, column 8; to line 6, column 6e></head><ERR&OR<body><p></Content model for element body:Flow content."
[2]=>
string(126) "Error: End tag for body seen, but there were unclosed elements.From line 8, column 1; to line 8, column 7><p></p></body></htm"
[3]=>
string(100) "Error: Unclosed element err&or.From line 5, column 8; to line 6, column 6e></head><ERR&OR<body><p></"
}
}
推荐阅读
- reactjs - 可以在对象键中添加下划线字符 _ 以防止自动排序吗?
- asp.net-mvc - ASP.NET Core 2.2 未触发 IValidatableObject 中的 Validate 方法
- python-3.x - 无法从 if 语句中使用 np.linspace 创建的数组中提取十进制数
- java - Java中的警报消息弹出太快
- css - Squarespace 中用于移动设备的单独导航栏
- zooming - Mapbox:如何使缩放适合地图的所有标记?
- php - 如何在 PHP 中构建动态绝对路径?
- installation - 无法正确安装和使用pythreejs
- c++ - 在 Qt 上使用 std::unique_ptr
- c++ - 如何使用值模板编写 doctest 测试用例?