首页 > 解决方案 > 使用 Scrapy 通过正则表达式提取脚本数据

问题描述

我正在尝试使用 Scrapy 在商店定位器上提取脚本标签的内容,但我有点卡住了。

在查看源代码中,脚本内容如下所示:

<script>
    var map_locations = [{"col_id":"1","col_postcode":"DN18 5DE","col_latitude":"53.6825556","col_longitude":"-0.438675","col_address1":"9a Market Lane","col_name":"XX","col_website":"https:\/\/branches.XX.co.uk\/barton-upon-humber\/9a-market-lane.html?type=0&stores=DN18+5DE?utm_source=directories&utm_medium=local&utm_campaign=yext&utm_content=1444","col_facebook":"https:\/\/www.facebook.com\/XXDN185DE\/","col_city":"Barton-Upon-Humber","col_state":"North Lincolnshire","col_yextid":"1444"}...
</script>

我复制了 xpath 并使用 response.xpath('/html/body/script[1]/text()') 在终端中检索它

现在我想将脚本中的信息解析为单独的列,最终将其加载到 csv 中。

我应该如何解析这些信息?说我是否想要 col_postcode?我读过人们使用正则表达式和 json 的其他帖子。

标签: pythonjsonparsingxpathscrapy

解决方案


.*捕获包含在里面的零个或多个字符[]

import re
import json

# response.xpath will return list of 'Selector' Object & calling extract return the extracted string.
for script in response.xpath("/html/body/script[1]/text()").extract():

    search_ = re.search("\[(.*)\]", script)
    # if multiple script tag's exists, find only which matches the condition.
    if search_:
        for doc in json.loads(search_.group()):
            print(doc['col_postcode'])

输出

DN18 5DE

推荐阅读