python - 如何从我的代码中修改 xpath 以检索所需的数据
问题描述
所以我有一个从标签中检索数据然后将其写入 csv 的scrapy。我现在需要修改 xpath,以便它可以读取如下所示的 var。从“var digitalData”我需要来自“product”的数据。我也会在下面发布我的代码。
var digitalData = {
"page" : {
"pageInfo" : {
"siteCode" : siteCode,
"siteSection": "",
"pageName" : "",
"pageURL" : pageURL,
"pageTrack" : ""
},
"pathIndicator" : {
"depth_2" : "mobile",
"depth_3" : "mobile",
"depth_4" : "smartphones",
"depth_5" : "galaxy-s9"
}
},
"user" : {
"loginStatus" : ""
},
"product" : {
"category" : "",
"model_code" : "SM-G960FZPDBTU",
"model_name" : "SM-G960F/DS",
"displayName" : "Galaxy S9 Hybrid Sim 64GB",
"pvi_type_code" : "",
"pvi_type_name" : "Mobile",
"pvi_subtype_code" : "",
"pvi_subtype_name" : "Smartphone"
}
};
这是我的代码:
import scrapy
import json
import csv
import re
class QuotesSpider(scrapy.Spider):
name = "quotes1"
def start_requests(self):
with open('so_52069753.csv','r') as csvf:
urlreader = csv.reader(csvf, delimiter=',',quotechar='"')
for url in urlreader:
if url[0]=="y":
yield scrapy.Request(url[1])
with open('so_52069753_out.csv', 'w') as csvfile:
fieldnames = ['Category', 'Type', 'Model', 'SK']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
def parse(self, response):
regex = re.compile(r'"product"\s*:\s*(.+?\})', re.DOTALL)
source_json = response.xpath("//script[contains(., 'var digitalData')]/text()").re_first(regex)
if source_json:
source_json = re.sub(r'//[^\n]+', "", source_json)
product = json.loads(source_json)
product_category = product["pvi_type_name"]
product_type = product["pvi_subtype_name"]
product_model = product["displayName"]
product_name = product["model_name"]
if source_json:
source = source_json[0]
#yield ({'Category': get_values("pvi_subtype_name", source), 'Type': get_values("pathIndicator.depth_5", source), 'SK': get_values("model_name", source)})
with open('so_52069753_out.csv', 'a') as csvfile:
fieldnames = ['Category', 'Type', 'Model', 'SK']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'Category': product_category, 'Type': product_type, 'Model': product_model, 'SK': product_name})
如何修改我的 xpath 以读取“var DigitalData”?先感谢您!
解决方案
我们不能处理(使用json.loads()
)整个digitalData
变量,因为
“站点代码”:站点代码,
和
“pageURL”:页面URL,
所以我尝试只获取产品部分:
def parse(self, response):
regex = re.compile(r'"product"\s*:\s*(.+?\})', re.DOTALL)
source_json = response.xpath("//script[contains(., 'var digitalData')]/text()").re_first(regex)
if source_json:
# Now we need to remove comments from the JSON:
# "category" : "", // pathIndicator depth정보 이용하여 설정
# source_json = re.sub(r'//.+$', "", source_json, re.MULTILINE) # this regex doesn't work for me
source_json = re.sub(r'//[^\n]+', "", source_json)
product = json.loads(source_json)
product_category = product["category"]
推荐阅读
- big-o - SICP对增长顺序的定义中提到的常数是什么?
- java - 如何将用户的提供商帐户存储在 Firebase 实时数据库中?
- rust - 将新函数名放入内部宏中
- swift - 如何修复 NSFetchedResultsController controllerDidChangeContent 中的内存泄漏
- android - 信号 11 (SIGSEGV),代码 1 (SEGV_MAPERR),故障地址 0x7d02bdb8
- java - 如何解决“java.sql.SQLException:未调用 ResultSet.next”
- excel - 动态生成除其他字符外的字符并删除重复项
- angular - 路由守卫不适用于 2 observable
- c# - 如何使用 Combobox SelectedValue 使 TextBox 可见性?
- c++ - 两个 glm::mat4 矩阵的点到点 XYZ 距离