web-scraping - Webscraping - 数据提取 - web scraper google chrome 扩展
问题描述
下午好,
我正在尝试从杂货店提取所有产品(名称、价格、图片)。
我正在使用网络刮刀(谷歌浏览器扩展)。当我开始抓取时,我可以看到它正在运行,但是它不返回任何数据。当我点击数据预览时,我可以看到数据。但是我一直收到消息,没有数据被刮掉。
这是我创建的站点地图: {"_id":"collectandgo","startUrl":[" https://colruyt.collectandgo.be/cogo/nl/home"],"selectors":[{"id":"categories","type":"SelectorLink","parentSelectors":["_root"],"selector":"div#arbo.nav__branch.branch","多个":true,"delay":0},{"id":"items","type":"SelectorElement","parentSelectors":["categories"],"selector":"div.product__inner","多个":true,"delay":0},{"id":"productbody","type":"SelectorElement","parentSelectors":["items"],"selector":"div.product__body","多个":true,"delay":0},{"id":"image","type":"SelectorImage","parentSelectors":["productbody"],"selector":"a.product__image","multiple":false,"delay":0},{"id":"productname","type":"SelectorText","parentSelectors":["productbody" ],"selector":"div.product__name","multiple":false,"regex":"","delay":0},{"id":"productdescription","type":"SelectorText"," parentSelectors":["productbody"],"selector":"div.product__description","multiple":false,"regex":"","delay":0},{"id":"productweight","type ":"SelectorText","parentSelectors":["productbody"],"selector":"div.product__weight","multiple":false,"regex":"","delay":0},{"id":"prijs","type":"SelectorText","parentSelectors":["productbody"],"selector":"div. product__price-piece","multiple":false,"regex":"","delay":0},{"id":"eenheidsprijs","type":"SelectorText","parentSelectors":["productbody" ],"selector":"div.product__price-unit","multiple":false,"regex":"","delay":0},{"id":"korting-aankoop-hoeveelheid","type" :"SelectorText","parentSelectors":["productbody"],"selector":"a.promotion__min-amount","multiple":false,"regex":"","延迟":0}]}
解决方案
我复制了您JSON
并验证了它,然后将其复制到 file中,然后在将解析器设置为如下后将stack.json
其加载到BaseX
数据库中:foo
JSON
thufir@dur:~/json$
thufir@dur:~/json$ basex
BaseX 9.0.1 [Standalone]
Try 'help' to get more information.
>
> list
Name Resources Size Input Path
-------------------------------------------------------------------------------
com.w3schools.books 1 6290 https://www.w3schools.com/xml/books.xml
twitter 75 457900
w3school_data 1 5209 https://www.w3schools.com/xml/note.xml
3 database(s).
>
> create database foo
Database 'foo' created in 138.51 ms.
>
> set parser json
PARSER: json
>
> add stack.json
Resource(s) added in 74.72 ms.
>
> list
Name Resources Size Input Path
-------------------------------------------------------------------------------
com.w3schools.books 1 6290 https://www.w3schools.com/xml/books.xml
foo 1 5600
twitter 75 457900
w3school_data 1 5209 https://www.w3schools.com/xml/note.xml
4 database(s).
>
> open foo
Database 'foo' was opened in 0.04 ms.
>
> xquery /
<json type="object">
<__id>collectandgo</__id>
<startUrl type="array">
<_>https://colruyt.collectandgo.be/cogo/nl/home</_>
</startUrl>
<selectors type="array">
<_ type="object">
<id>categories</id>
<type>SelectorLink</type>
<parentSelectors type="array">
<_>_root</_>
</parentSelectors>
<selector>div#arbo.nav__branch.branch</selector>
<multiple type="boolean">true</multiple>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>items</id>
<type>SelectorElement</type>
<parentSelectors type="array">
<_>categories</_>
</parentSelectors>
<selector>div.product__inner</selector>
<multiple type="boolean">true</multiple>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>productbody</id>
<type>SelectorElement</type>
<parentSelectors type="array">
<_>items</_>
</parentSelectors>
<selector>div.product__body</selector>
<multiple type="boolean">true</multiple>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>image</id>
<type>SelectorImage</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>a.product__image</selector>
<multiple type="boolean">false</multiple>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>productname</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>div.product__name</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>productdescription</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>div.product__description</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>productweight</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>div.product__weight</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>prijs</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>div.product__price-piece</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>eenheidsprijs</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>div.product__price-unit</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
<_ type="object">
<id>korting-aankoop-hoeveelheid</id>
<type>SelectorText</type>
<parentSelectors type="array">
<_>productbody</_>
</parentSelectors>
<selector>a.promotion__min-amount</selector>
<multiple type="boolean">false</multiple>
<regex/>
<delay type="number">0</delay>
</_>
</selectors>
</json>
Query executed in 270.99 ms.
>
您想对数据运行什么查询?
您可能想要查看Selenium
或其他用于抓取数据的工具。两者都Selenium
使用BaseX
并Xquery
提供 Java API。
推荐阅读
- arrays - 在 Flutter 中从 TextForm 的 List 中添加更多价值
- angular - Angular Router 路由保护 CanActivate 总是返回 false
- nosql - 我可以通过 Oracle NoSQL 数据库云服务使用 Intellij 进行应用程序开发吗?
- android-layout - 如何在不同行的回收器视图中显示嵌套数组
- reactjs - React-Leaflet - 更新存储在状态中的 Circle 对象,或者优化 Circles 加载速度
- mysql - 如何在 mysql 工作台中导出/导入存储过程以及表
- python - 为什么在类及其对象上为同一属性使用“hasattr”函数时会得到不同的结果?
- javascript - 仅在 ts 中操作 {} 内的内容
- javascript - 如何在JS中删除存储在数组中的输入
- java - 将 SQL 查询迁移到 Spring 规范