php - 从网站上抓取数据并以纯文本形式获取其 html
问题描述
请检查下面的代码。我正在尝试使用代理来抓取网站,它现在正在工作。问题在于以print_r
不可读格式显示的数据。我需要将其设为“正常”的 html 源代码。我该怎么做?
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.amazon.com');
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_PROXY, '142.234.203.59:12345');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, 'haris20202:veryfastplease123');
$data = curl_exec($ch);
curl_close($ch);
print_r($data);
解决方案
使用功能稍微更全的 curl 函数,响应上方的函数看起来不错,但它包含一个Robot Check
* Rebuilt URL to: https://www.amazon.com/
* Trying 142.234.203.59...
* TCP_NODELAY set
* Connected to 142.234.203.59 (142.234.203.59) port 12345 (#0)
* allocate connect buffer!
* Establish HTTP proxy tunnel to www.amazon.com:443
* Proxy auth using Basic with user 'haris20202'
> CONNECT www.amazon.com:443 HTTP/1.1
Host: www.amazon.com:443
Proxy-Authorization: Basic aGFyaXMyMDIwMjp2ZXJ5ZmFzdHBsZWFzZTEyMw==
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Proxy-Connection: Keep-Alive
< HTTP/1.1 200 Connection established
<
* Proxy replied 200 to CONNECT request
* CONNECT phase completed!
* ALPN, offering http/1.1
* successfully set certificate verify locations:
CAfile: c:/wwwroot/cacert.pem
CApath: none
* CONNECT phase completed!
* CONNECT phase completed!
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: C=US; ST=Washington; L=Seattle; O=Amazon.com, Inc.; CN=www.amazon.com
* start date: Sep 18 00:00:00 2019 GMT
* expire date: Aug 23 12:00:00 2020 GMT
* subjectAltName: host "www.amazon.com" matched cert's "www.amazon.com"
* issuer: C=US; O=DigiCert Inc; CN=DigiCert Global CA G2
* SSL certificate verify ok.
> GET / HTTP/1.1
Host: www.amazon.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Accept: */*
Accept-Encoding: deflate, gzip
< HTTP/1.1 200 OK
< Content-Type: text/html
< Content-Length: 2097
< Connection: keep-alive
< Server: Server
< Date: Tue, 26 Nov 2019 10:14:10 GMT
< Vary: Content-Type,Cookie,Referer,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent
< Content-Encoding: gzip
< x-amz-rid: DTAY61T1CN3HGSADJG16
< Edge-Control: no-store
< X-Cache: Miss from cloudfront
< Via: 1.1 274469ea4a9ada6e05630e17982ca5de.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: PHL50
< X-Amz-Cf-Id: R3hAZb_0qdQYB25p3WwZ5D-wK_1ujzleVSOS7EZo_zsTyMx9oYU6CA==
<
* Connection #0 to host 142.234.203.59 left intact
亚马逊有一个 API——你考虑过使用它吗?面向开发人员的亚马逊
推荐阅读
- sql-server - 批量插入不插入数据
- pandas - 如何根据熊猫中的条件划分值并放在下一列中
- arrays - 在自定义类中使用自定义类数组的 VBA Excel 问题
- ionic-framework - 无法使用 pdf js 在移动设备上查看 PDF
- sql - Hive:子查询中的小于运算符错误
- macos - 代码块在 Mac 终端中返回错误输出
- spring - Spring 查询注解
- django - 如何在django中实现拖放上传文件,而我已经实现了简单的文件上传
- react-native - 如何处理嵌套在View中的scrollView中的TextInput反应原生
- php - Facebook Graph API(将 v2.10 升级到 v3.10)