python - Scrapy Python 网页抓取 JSON
问题描述
我正在努力弄清楚如何使用 Scrapy Python 抓取 JSON 响应。我能够成功地在同一站点的不同页面上抓取 JSON。我将不胜感激任何帮助。
我将如何刮取“tournamentGroup”中的值(即 id、name)以及年份、标题等。
部分代码:
start_url = 'https://api.wtatennis.com/tennis/tournaments/?page=0&pageSize=100&excludeLevels=ITF&from=2020-09-01&to=2020-09-30'
with urllib.request.urlopen(start_url) as start_url:
json_obj = start_url.read()
rank_list = json.loads(json_obj)
for item in rank_list:
rank_data = []
tourney_id = item['content']['id']
tourney_year = item['year']
rank_data = [tourney_id, tourney_year]
cur.execute("""insert into wta_rankings(tourney_id, tourney_year)
values(%s, %s)
ON CONFLICT DO NOTHING"""
,(rank_data))
conn.commit()
cur.close()
JSON:
{
"pageInfo":{
"page":0,
"numPages":0,
"pageSize":100,
"numEntries":10
},
"content":[
{
"tournamentGroup":{
"id":2023,
"name":"Prague 125K",
"level":"125K",
"metadata":null
},
"year":2020,
"title":"Prague Open",
"startDate":"2020-08-29",
"endDate":"2020-09-06",
"surface":"Clay",
"inOutdoor":"O",
"city":"PRAGUE",
"country":"Czech Republic",
"singlesDrawSize":128,
"doublesDrawSize":32,
"prizeMoney":3125000,
"prizeMoneyCurrency":"USD",
"liveScoringId":"2023"
},
解决方案
尝试这个:
import requests
url = "https://api.wtatennis.com/tennis/tournaments/?page=0&pageSize=100&excludeLevels=ITF&from=2020-09-01&to=2020-09-30"
response = requests.get(url).json()
for item in response["content"]:
print(f"{item['tournamentGroup']['name']} - {item['year']} - {item['title']}")
这给了你(它只是一个样本,你可以得到任何你想要的字段):
Prague 125K - 2020 - Prague Open
US OPEN - 2020 - US Open - New York, United States, NY
WARSAW - 2020 - BNP Paribas Warsaw Open - Warsaw, Poland
ISTANBUL - 2020 - TEB BNP Paribas Tennis Championship Istanbul - Istanbul, Turkey
MADRID - 2020 - Mutua Madrid Open - Madrid, Spain
HIROSHIMA - 2020 - Hana-cupid Japan Women's Open - Hiroshima, Japan
ROME - 2020 - Internazionali BNL d'Italia - Rome, Italy
STRASBOURG - 2020 - Internationaux de Strasbourg - Strasbourg, France
ROLAND GARROS - 2020 - Roland Garros - Paris, France
TASHKENT - 2020 - Tashkent Open - Tashkent, Uzbekistan
如果您在 JSON 中“导航”有困难,只需将响应内容复制到在线JSON 格式化程序中,单击wrench
图标进行修复,然后Format / Beautify
.
推荐阅读
- android - Android:范围存储:getContentResolver().update(...):COLUMN_LAST_MODIFIED:UnsupportedOperationException:不支持更新
- python - 从 DST 感知日期时间对象在 Dataframe 中创建 pandas DatetimeIndex
- mysql - 查询只返回与 where 没有关系的 id
- wordpress - 如何使用 WordPress 在 NextJS 中管理导航菜单
- sql - 如何在结果集中使用 2 个分组列进行 PIVOT?
- gitlab - Spring Cloud 配置服务器 - 引导属性文件 - 配置多个 git 存储库
- java - 如何使用进程和过滤 PID 正确捕获 logcat?
- android - RecyclerView 单元格 wrap_content 不适用于 ConstraintLayout
- python - Azure 函数 Blob 存储触发器未触发
- c# - Blazor - 在编译时生成 HTML