php - PHP curl登录一次然后使用multi_curl抓取网站
问题描述
我需要先从要求登录的网站中抓取一些数据,并且我确实设法使用 curl 登录,这是我的登录代码:
$login = 'https://example.com/login';
$ch = curl_init($login);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_COOKIEJAR => COOKIE_FILE,
CURLOPT_COOKIEFILE => COOKIE_FILE
]);
$response = curl_exec($ch);
$re = '/<input type="hidden" name="csrf" value="(.*?)" \/>/m';
preg_match_all($re, $response, $matches, PREG_SET_ORDER, 0);
$arr = array(
'email' => 'email@example.com',
'password' => 'Password123',
'csrf' => $matches[0][1]
);
curl_setopt_array($ch, [
CURLOPT_URL => $login,
CURLOPT_USERAGENT => 'Mozilla/5.0',
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => http_build_query($arr),
CURLOPT_COOKIEJAR => COOKIE_FILE,
CURLOPT_COOKIEFILE => COOKIE_FILE,
CURLOPT_FOLLOWLOCATION => true
]);
curl_exec($ch);
现在,登录后我必须抓取70-100 页,我设法通过使用 foreach 循环来做到这一点,但这需要永远。这是我的代码:
$arr = [
[
'id' => '1',
'csrf' => $matches[0][1] //same csrf as in login
],[
'id' => '2',
'csrf' => $matches[0][1] //same csrf as in login
],[
...
],[
'id' => '100',
'csrf' => $matches[0][1] //same csrf as in login
]
];
foreach($arr as $v){
curl_setopt_array($ch,[
CURLOPT_URL => 'https://example.com/submit',
CURLOPT_USERAGENT => 'Mozilla/5.0',
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => http_build_query($v),
CURLOPT_FOLLOWLOCATION => true
]);
$return = curl_exec($ch);
$info = curl_getinfo($ch);
//do something with the returned data
}
但是,如果我尝试使用 multi_curl,我将无法保持登录状态,并且会受到405
http_code
.
有没有使用 curl 登录和 multi抓取的解决方案?谢谢!
编辑 这是我用于 multi_curl 的代码(在 stackoverflow 上找到它):
function multiRequest($data, $options = array()) {
// array of curl handles
$curly = array();
// data to be returned
$result = array();
// multi handle
$mh = curl_multi_init();
// loop through $data and create curl handles
// then add them to the multi-handle
foreach ($data as $id => $d) {
$curly[$id] = curl_init();
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
curl_setopt($curly[$id], CURLOPT_URL, $url);
curl_setopt($curly[$id], CURLOPT_USERAGENT, 'Mozilla/5.0');
curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);
// post?
if (is_array($d)) {
if (!empty($d['post'])) {
curl_setopt($curly[$id], CURLOPT_POST, true);
curl_setopt($curly[$id], CURLOPT_POSTFIELDS, http_build_query($d['post']));
curl_setopt($curly[$id], CURLOPT_FOLLOWLOCATION, true);
}
}
// extra options?
if (!empty($options)) {
curl_setopt_array($curly[$id], $options);
}
curl_multi_add_handle($mh, $curly[$id]);
}
// execute the handles
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
// get content and remove handles
foreach($curly as $id => $c) {
$result[$id] = curl_multi_getcontent($c);
curl_multi_remove_handle($mh, $c);
}
// all done
curl_multi_close($mh);
return $result;
}
解决方案
推荐阅读
- sql - 使用表变量进行 SQL 优化
- entity-framework-core - Blazor w/Entity Framework Core - 编译错误
- linux - 如何将“watch”的输出存储到文件中?
- ios - 无法以编程方式设置 GridLayout 高度
- javascript - 选择并单击 td 时如何在 td 内的字段中插入当前完整日期和日期
- javascript - 在创建事件中动态设置 jQuery ui 滑块值
- sql-server - 我已经估计了工作时间,但是当我添加员工(在本例中为 2)的小时数时,它会重复估计的时间
- linux - crontab 错误:“/tmp/crontab.calJpk”:5:糟糕的日期
- java - 对于同一类的所有实例,自动生成的 serialVersionUID 是否始终相同?
- angular - IONIC-4 使用
在拆分窗格的侧边菜单中