sql - 如何从值是字典的表中选择值,并且字典的键是使用 SQL 查询的数字?
问题描述
所以,我有一张来自 BigQuery 公共表(谷歌分析)的表:
print(bigquery_client.query(
"""
SELECT hits.0.productName
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
where date between '20160101' and '20161231'
""").to_dataframe())
附加代码:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='/Users/<UserName>/Desktop/folder/key/<Key_name>.json'
bigquery_client = bigquery.Client()
木星笔记本中的错误:
BadRequest Traceback (most recent call last)
<ipython-input-31-424833cf8827> in <module>
----> 1 print(bigquery_client.query(
2 """
3 SELECT hits.0.productName
4 from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
5 where date between '20160101' and '20161231'
~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client, date_as_object, max_results, geography_as_object)
1563 :mod:`shapely` library cannot be imported.
1564 """
-> 1565 query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
1566 return query_result.to_dataframe(
1567 bqstorage_client=bqstorage_client,
~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/_tqdm_helpers.py in wait_for_query(query_job, progress_bar_type, max_results)
86 )
87 if progress_bar is None:
---> 88 return query_job.result(max_results=max_results)
89
90 i = 0
~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in result(self, page_size, max_results, retry, timeout, start_index, job_retry)
1370 do_get_result = job_retry(do_get_result)
1371
-> 1372 do_get_result()
1373
1374 except exceptions.GoogleAPICallError as exc:
~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/retry.py in retry_wrapped_func(*args, **kwargs)
281 self._initial, self._maximum, multiplier=self._multiplier
282 )
--> 283 return retry_target(
284 target,
285 self._predicate,
~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/retry.py in retry_target(target, predicate, sleep_generator, deadline, on_error)
188 for sleep in sleep_generator:
189 try:
--> 190 return target()
191
192 # pylint: disable=broad-except
~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in do_get_result()
1360 self._job_retry = job_retry
1361
-> 1362 super(QueryJob, self).result(retry=retry, timeout=timeout)
1363
1364 # Since the job could already be "done" (e.g. got a finished job
~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/base.py in result(self, retry, timeout)
711
712 kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
--> 713 return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
714
715 def cancelled(self):
~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/future/polling.py in result(self, timeout, retry)
135 # pylint: disable=raising-bad-type
136 # Pylint doesn't recognize that this is valid in this case.
--> 137 raise self._exception
138
139 return self._result
BadRequest: 400 Syntax error: Unexpected keyword WHERE at [4:1]
(job ID: 3c15e031-ee7d-4594-a577-0237f8282695)
-----Query Job SQL Follows-----
| . | . | . | . | . | . |
1:
2:SELECT hits.0.productName
3:from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
4:where date between '20160101' and '20161231'
| . | . | . | . | . | . |
如屏幕截图所示,我有 hits 列,哪个值是字典,我需要从 '0' 列中获取内部字典值,但出现错误。实际上,我需要从所有数字列中获取 'productName' 值。
解决方案
您可以采取一种方法来解决此问题,将直接在查询中过滤您想要的数据。
从 BigQuery 过滤:
首先,为了更好地理解,请查看包含产品名称的字段的数据架构:
第一个可能的字段可能是
hits.item.productName
- 点击是一个
RECORD
- item 是
RECORD
内部项目 productName
是字符串hits.item
- 点击是一个
第二个字段可能是
hits.product.v2ProductName
- 产品是
RECORD
内部物品 v2ProductName
是字符串hits.product
对于查询 aRECORD
,您必须“flat”是,使用此处UNNEST([record])
描述的表达式将其转换为表:因此要从查询中返回所有唯一产品名称:hits.product.v2ProductName
- 产品是
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
sql = """
SELECT
DISTINCT p.v2productname
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(product) AS p
WHERE
date BETWEEN '20160101'
AND '20161231'
AND (p.v2productname IS NOT NULL);
"""
v2productname = client.query(sql).to_dataframe()
print(v2productname)
要使用该字段,请hits.item.productName
运行以下命令,但所有记录都是null
:
SELECT
DISTINCT h.item.productname
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS h,
UNNEST(product) AS p
WHERE
date BETWEEN '20160101'
AND '20161231'
AND (h.item.productname IS NOT NULL);
从数据框中过滤:
我尝试使用数据框处理它,但由于数据集中的记录链,它不可能,该函数to_dataframe()
无法处理它。
在简历中:
尝试在 BigQuery 中过滤和处理尽可能多的数据,这样会更快、更经济。
推荐阅读
- assembly - 一些 emu8086 程序使用 org 100h 没有 .model 或 .code,但仍然有效?
- c++ - Xcode 找不到存在的文件(使用 c++ fopen)
- azure - 关于使用 Azure Web 应用获取 Outlook 日历
- python - 在 python BeautifulSoup 上获取带有特定前缀的超链接
- rabbitmq - rabbitmq 预取多个消费者
- unity3d - UI 面板没有根据屏幕尺寸定位自己
- javascript - classLList.toggle 属性不适用于 getElementsByClassName
- flutter - 如何保护颤振项目使用的资产免受apk反编译
- authentication - 登录多个角色和不同视图后如何重定向
- java - 错误:在 Windows 10 中无法找到或加载主类 org.apache.rocketmq.namesrv.NamesrvStartup