首页 > 解决方案 > Webscraping - 数据提取 - web scraper google chrome 扩展

问题描述

下午好,

我正在尝试从杂货店提取所有产品(名称、价格、图片)。

我正在使用网络刮刀(谷歌浏览器扩展)。当我开始抓取时,我可以看到它正在运行,但是它不返回任何数据。当我点击数据预览时,我可以看到数据。但是我一直收到消息,没有数据被刮掉。

这是我创建的站点地图: {"_id":"collectandgo","startUrl":[" https://colruyt.collectandgo.be/cogo/nl/home"],"selectors":[{"id":"categories","type":"SelectorLink","parentSelectors":["_root"],"selector":"div#arbo.nav__branch.branch","多个":true,"delay":0},{"id":"items","type":"SelectorElement","parentSelectors":["categories"],"selector":"div.product__inner","多个":true,"delay":0},{"id":"productbody","type":"SelectorElement","parentSelectors":["items"],"selector":"div.product__body","多个":true,"delay":0},{"id":"image","type":"SelectorImage","parentSelectors":["productbody"],"selector":"a.product__image","multiple":false,"delay":0},{"id":"productname","type":"SelectorText","parentSelectors":["productbody" ],"selector":"div.product__name","multiple":false,"regex":"","delay":0},{"id":"productdescription","type":"SelectorText"," parentSelectors":["productbody"],"selector":"div.product__description","multiple":false,"regex":"","delay":0},{"id":"productweight","type ":"SelectorText","parentSelectors":["productbody"],"selector":"div.product__weight","multiple":false,"regex":"","delay":0},{"id":"prijs","type":"SelectorText","parentSelectors":["productbody"],"selector":"div. product__price-piece","multiple":false,"regex":"","delay":0},{"id":"eenheidsprijs","type":"SelectorText","parentSelectors":["productbody" ],"selector":"div.product__price-unit","multiple":false,"regex":"","delay":0},{"id":"korting-aankoop-hoeveelheid","type" :"SelectorText","parentSelectors":["productbody"],"selector":"a.promotion__min-amount","multiple":false,"regex":"","延迟":0}]}

标签: web-scrapinggoogle-chrome-extensionscreen-scrapingdata-extraction

解决方案


我复制了您JSON验证了它,然后将其复制到 file中,然后在将解析器设置为如下后将stack.json其加载到BaseX数据库中:fooJSON

thufir@dur:~/json$ 
thufir@dur:~/json$ basex
BaseX 9.0.1 [Standalone]
Try 'help' to get more information.
> 
> list
Name                 Resources  Size    Input Path                               
-------------------------------------------------------------------------------
com.w3schools.books  1          6290    https://www.w3schools.com/xml/books.xml  
twitter              75         457900                                           
w3school_data        1          5209    https://www.w3schools.com/xml/note.xml   

3 database(s).
> 
> create database foo
Database 'foo' created in 138.51 ms.
> 
> set parser json
PARSER: json
> 
> add stack.json
Resource(s) added in 74.72 ms.
> 
> list
Name                 Resources  Size    Input Path                               
-------------------------------------------------------------------------------
com.w3schools.books  1          6290    https://www.w3schools.com/xml/books.xml  
foo                  1          5600                                             
twitter              75         457900                                           
w3school_data        1          5209    https://www.w3schools.com/xml/note.xml   

4 database(s).
> 
> open foo
Database 'foo' was opened in 0.04 ms.
> 
> xquery /
<json type="object">
  <__id>collectandgo</__id>
  <startUrl type="array">
    <_>https://colruyt.collectandgo.be/cogo/nl/home</_>
  </startUrl>
  <selectors type="array">
    <_ type="object">
      <id>categories</id>
      <type>SelectorLink</type>
      <parentSelectors type="array">
        <_>_root</_>
      </parentSelectors>
      <selector>div#arbo.nav__branch.branch</selector>
      <multiple type="boolean">true</multiple>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>items</id>
      <type>SelectorElement</type>
      <parentSelectors type="array">
        <_>categories</_>
      </parentSelectors>
      <selector>div.product__inner</selector>
      <multiple type="boolean">true</multiple>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>productbody</id>
      <type>SelectorElement</type>
      <parentSelectors type="array">
        <_>items</_>
      </parentSelectors>
      <selector>div.product__body</selector>
      <multiple type="boolean">true</multiple>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>image</id>
      <type>SelectorImage</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>a.product__image</selector>
      <multiple type="boolean">false</multiple>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>productname</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>div.product__name</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>productdescription</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>div.product__description</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>productweight</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>div.product__weight</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>prijs</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>div.product__price-piece</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>eenheidsprijs</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>div.product__price-unit</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
    <_ type="object">
      <id>korting-aankoop-hoeveelheid</id>
      <type>SelectorText</type>
      <parentSelectors type="array">
        <_>productbody</_>
      </parentSelectors>
      <selector>a.promotion__min-amount</selector>
      <multiple type="boolean">false</multiple>
      <regex/>
      <delay type="number">0</delay>
    </_>
  </selectors>
</json>
Query executed in 270.99 ms.
> 

您想对数据运行什么查询?

您可能想要查看Selenium或其他用于抓取数据的工具。两者都Selenium使用BaseXXquery提供 Java API。


推荐阅读