首页 > 解决方案 > Google Big Query 中的扩展资源

问题描述

我正在尝试在 bigquery 脚本上使用 ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) 并继续遇到“超出资源”错误。

该表的大小为 219.96 GB,行数为 1,611,220,127

这是脚本:

With cte as (
SELECT 
    Source, 
    ROW_NUMBER() OVER (PARTITION BY FirstName, LastName, MiddleName, Address, Address2, City, State, Zip ORDER BY Attom_ID DESC) as rnk
    ,FirstName, LastName, MiddleName, Gender, Age, DOB, Address, Address2, City, State, Zip, Zip4, TimeZone, Income, HomeValue, Networth, MaritalStatus, IsRenter, HasChildren, CreditRating, Investor, LinesOfCredit, InvestorRealEstate, Traveler, Pets, MailResponder, Charitable, PolicalDonations, PoliticalParty, Attom_ID, GEOID, Score, Score1, Score2, Score3, Score4, Score5, Latitude, Longitude
    from `db.ds.tblA`
) select * from cte where rnk = 1

虽然这是一张表,但它是连接的结果,其中 PRIOR 到 ATTOM_ID 的所有列都来自一个表,而 ATTOM_ID、GEOID 等来自第二个表。我相信结果集中有一些笛卡尔。

表中有多种重复项,我正在尝试对表进行重复数据删除。我害怕尝试使用 MAX(ATTOM_ID) 进行 GROUP_BY,因为我想确保使用正确关联的 GEOID 和 SCORES。我不想把它混在一起。

问题是这个特定的查询超出了资源,所以我想知道我这里是否有任何选项。谢谢!

标签: sqlgoogle-bigquery

解决方案


以下相当于您的原始查询(结果方面),通常解决“超出资源”问题

#standardSQL
SELECT r.* FROM (
  SELECT 
    ARRAY_AGG(STRUCT(Source,FirstName, LastName, MiddleName, Gender, Age, DOB, Address, Address2, City, State, Zip, Zip4, TimeZone, Income, HomeValue, Networth, MaritalStatus, IsRenter, HasChildren, CreditRating, Investor, LinesOfCredit, InvestorRealEstate, Traveler, Pets, MailResponder, Charitable, PolicalDonations, PoliticalParty, Attom_ID, GEOID, Score, Score1, Score2, Score3, Score4, Score5, Latitude, Longitude) ORDER BY Attom_ID DESC LIMIT 1)[OFFSET(0)]
  FROM `db.ds.tblA`
  GROUP BY FirstName, LastName, MiddleName, Address, Address2, City, State, Zip
)

推荐阅读