php - 使用 GuzzleClient 抓取时 html 中随机丢失的节点
问题描述
由于子元素的不一致,我在这里处理刮擦问题,有时会出现,有时会丢失。
由于我正在保存引用$values[]
数组的状态,因此我发现有时$value[18]
是电子邮件地址,有时可能是电话或传真。
三个迭代的样本数组如下:
[0] => [
[1] => Firm: The Firm One Name
[2] => Firm:
[3] => The Firm One Name
[4] => Office: 5th Av. 18980, NY
[5] => Office:
[6] => 5th Av. 18980, NY
[7] => City: New York
[8] => City:
[9] => New York
[10] => Country: USA
[11] => Country:
[12] => USA
[13] => Tel: +123 4 567 890
[14] => Tel:
[15] => +123 4 567 890
[16] => Email: person.one@example.com
[17] => Email:
[18] => person.one@example.com
],
[1] => [
[1] => Firm: The Firm Two Name
[2] => Firm:
[3] => The Firm Two Name
[4] => Office: 5th Av. 342680, NY
[5] => Office:
[6] => 5th Av. 342680, NY
[7] => City: New York
[8] => City:
[9] => New York
[10] => Country: USA
[11] => Country:
[12] => USA
[13] => Tel: +123 4 567 890
[14] => Tel:
[15] => +123 4 567 890
[16] => Fax: +123 4 567 891
[17] => Fax:
[18] => +123 4 567 891
[19] => Email: person.two@example.com
[20] => Email:
[21] => person.two@example.com
],
[2] => [[1] => Firm: The Firm Three Name
[2] => Firm:
[3] => The Firm Three Name
[4] => Office: 5th Av. 89280, NY
[5] => Office:
[6] => 5th Av. 89280, NY
[7] => Country: USA
[8] => Country:
[9] => USA
[10] => Fax: +123 4 567 899
[11] => Fax:
[12] => +123 4 567 899
[13] => Email: person.three@example.com
[14] => Email:
[15] => person.three@example.com
]
可能会注意到,当我迭代并保存$values[15]
最后一个数组(即电子邮件地址)时,第一个数组[0][15]
对应于 Tel。数字。
我的问题是,有没有比在字段上进行“疯狂循环”并始终将电子邮件保存为电子邮件而不是电话号码更简单的方法?
我正在使用GuzzleClient()
和$node->filterXPath()
/或$node->filter()
取决于我必须抓住的东西。
我正在处理的 html 结构非常简短,如下例所示,有时会丢失节点...:
<div id="profiledtails">
<div class="abc-g">
<div class="abc-gf">
<div class="abc-u first">Firm:</div>
<div class="abc-u">
<a href="http://example.com/123456/" title="More information here" class="Item" abc-tracker="office" abc-tracking="true">Person One</a>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Office:</div>
<div class="abc-u">
<address>
5th Av.<br>18980,<br>NY
</address>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">City:</div>
<div class="abc-u">New York</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Country:</div>
<div class="abc-u">USA</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Tel:</div>
<div class="abc-u">+123 4 567 890</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Fax:</div>
<div class="abc-u">+123 4 567 891</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Email:</div>
<div class="abc-u">
<a href="mailto:mperson.one@example.com">person.one@example.com</a></div>
</div>
</div>
解决方案
我之前处理过相同的情况,这种情况的唯一解决方案是正则表达式,因为 Html 元素每次都会更改,并且在使用正则表达式之前您无法跟踪值,这是您的修复
$re = '/ <div class="abc-u first">Email:<\/div>
<div class="abc-u">
<a href="mailto:mperson.one@example.com">(.*)<\/a>/';
$str = '<div id="profiledtails">
<div class="abc-g">
<div class="abc-gf">
<div class="abc-u first">Firm:</div>
<div class="abc-u">
<a href="http://example.com/123456/" title="More information here" class="Item" abc-tracker="office" abc-tracking="true">Person One</a>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Office:</div>
<div class="abc-u">
<address>
5th Av.<br>18980,<br>NY
</address>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">City:</div>
<div class="abc-u">New York</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Country:</div>
<div class="abc-u">USA</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Tel:</div>
<div class="abc-u">+123 4 567 890</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Fax:</div>
<div class="abc-u">+123 4 567 891</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Email:</div>
<div class="abc-u">
<a href="mailto:mperson.one@example.com">person.one@example.com</a></div>
</div>
</div>';
preg_match($re, $str, $matches, PREG_OFFSET_CAPTURE, 0);
// Print the entire match result
var_dump($matches);
以同样的方式,您必须为其他值准备正则表达式并准备就绪,上面的代码看起来很乱,但您可以从字符串和正则表达式中删除空格以使其干净。
推荐阅读
- javascript - 如何访问危险 html 显示的数据
- swagger - 打开 API 生成器更改 Multipart/Form-Data 的函数定义签名
- python - 蟒蛇硒1
- python-3.x - Landau 上的拒绝方法 pdf
- javascript - Chrome 开发工具错误 - 从阿里巴巴 OSS 获取图像时,网络选项卡中的状态为 CORS ERROR
- asp.net-core - 在提琴手工作之前,CORS 预检请求不会命中 IIS
- mysql - 不使用赋值运算符的 MySQL 分层树状排序查询
- unix - Unix命令在匹配模式后替换字符串
- python - 尝试设置一对多关系时获取 sqlalchemy.exc.InvalidRequestError
- python - 时间序列 Python 绘图问题(轴不匹配)