sql - 如何“分组”一系列重叠的点坐标
问题描述
我有一个相当大的(250k 行)数据集,其中有很多重叠的点坐标。我需要映射这些坐标,但每个实际设施只能使用一个点。
本质上,我需要想办法将所有差异化数据(源代码、表单代码、WastW)合并到每个设施中。
我可以使用 SQL(通过 ORACLE 数据库)来提取原始数据,或者在事后使用 R 或 ArcGIS/Qgis 进行合并。
拉取代码时,我在 SQL 中运行了 group by 语句,但它只能提供如下所示的原始数据。
select distinct
REPORTING.TONS,
REPORTING.YEAR,
REPORTING.NAME,
REPORTING.ID,
REPORTING.source_code,
REPORTING.form_code,
REPORTING.primary_naics,
REPORTING.WASTEW
REPORTING.LAT
REPORTING.LON
from REPORTING where 1=1 and
upper(REPORTING.REPORT_CYCLE) = 2017
order by
REPORTING.TONS,
REPORTING.YEAR,
REPORTING.NAME,
REPORTING.ID,
REPORTING.source_code,
REPORTING.form_code,
REPORTING.primary_naics,
REPORTING.WASTEW
REPORTING.LAT
REPORTING.LON
原始数据:
| Tons | Year | Name | ID | SOURCE_CODE | FORM_CODE | primary_naics | WASTEW | Lat | Lon |
| 0 | 2017 | Airborne | TN5210020140 | G11 | W801 | 928110 | N | 36.611123 | -87.46234 |
| 0 | 2017 | Airborne | TN5210020140 | G13 | W101 | 928110 | Y | 36.611123 | -87.46234 |
| 0 | 2017 | Airborne | TN5210020140 | G14 | W200 | 928110 | N | 36.611123 | -87.46234 |
| 0 | 2017 | Airborne | TN5210020140 | G14 | W101 | 928110 | N | 36.611123 | -87.46234 |
| 0 | 2017 | Airborne | TN5210020140 | G32 | W310 | 928110 | N | 36.611123 | -87.46234 |
| 0 | 2017 | Century | NVD980895338 | G01 | W001 | 562211 | N | 39.61028 | -119.2031 |
| 0 | 2017 | Century | NVD980895338 | G02 | W001 | 562211 | Y | 39.61028 | -119.2031 |
| 0 | 2017 | Century | NVD980895338 | G03 | W002 | 562211 | N | 39.61028 | -119.2031 |
| 0 | 2017 | Century | NVD980895338 | G03 | W004 | 562211 | N | 39.61028 | -119.2031 |
| 0 | 2017 | Century | NVD980895338 | G04 | W004 | 562211 | Y | 39.61028 | -119.2031 |
来自 SQL 拉取的新数据:
+--------+------+-------------------------------------------------------+--------------+-------------+----------------+--------+-------+
| 6.874 | 2017 | FLINT HILLS RESOURCES ALASKA, LLC NORTH POLE TERMINAL | AKD000850701 | G23 | W609 | 424710 | N |
+--------+------+-------------------------------------------------------+--------------+-------------+----------------+--------+-------+
| 7.2015 | 2017 | FLINT HILLS RESOURCES ALASKA, LLC NORTH POLE TERMINAL | AKD000850701 | G19 | W609 | 424710 | N |
| 10.333 | 2017 | FLINT HILLS RESOURCES ALASKA, LLC NORTH POLE TERMINAL | AKD000850701 | G19 | W319 | 424710 | N |
| 0.05 | 2017 | SWANSON RIVER OIL PIPELINE | AKD035419795 | G19 | W310 | 486110 | N |
| 3.242 | 2017 | SWANSON RIVER OIL PIPELINE | AKD035419795 | G07 | W002 | 486110 | N |
| 0.0005 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G11,G11,G11 | W001,W004,W801 | 61131 | N,N,N |
| 0.001 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G11,G11,G11 | W001,W004,W801 | 61131 | N,N,N |
| 0.0015 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G11,G11 | W001,W801 | 61131 | N,N |
| 0.0025 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G11 | W001 | 61131 | N |
| 0.005 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G11,G11 | W001,W801 | 61131 | N,N |
| 0.006 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G22 | W001 | 61131 | N |
| 0.0095 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G11 | W001 | 61131 | N |
| 0.01 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G11 | W001 | 61131 | N |
| 0.015 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G11 | W001 | 61131 | N |
| 0.025 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G09,G11,G15 | W001,W001,W002 | 61131 | N,N,N |
| 0.03 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS | AKD048679567 | G11 | W001 | 61131 | N |
+--------+------+-------------------------------------------------------+--------------+-------------+----------------+--------+-------+
解决方案
关于您希望如何组合某些列中的值有点不清楚 - 特别是WASTEW
,但这里有一个 tidyverse 方法来获取您的原始数据集并对其进行总结。
首先,我们制作一些数据以匹配您的示例数据:
library(dplyr)
library(readr)
"Tons Year Name ID SOURCE_CODE FORM_CODE NAICS WASTEW Lat Lon
0 2017 101ST TN521002 G11 W801 928110 N 36.61112 -87.4623
0 2017 101ST TN521002 G13 Wl0l 928110 Y 36.61112 -87.4623
0 2017 101ST TN521002 G14 W200 928110 N 36.61112 -87.4623
0 2017 101ST TN521002 G14 W603 928110 N 36.61112 -87.4623
0 2017 101ST TN521002 G16 W103 928110 N 36.61112 -87.4623
0 2017 101ST TN521002 G16 W310 928110 N 36.61112 -87.4623
0 2017 101ST TN521002 G22 W113 928110 Y 36.61112 -87.4623
0 2017 101ST TN521002 G22 W219 928110 N 36.61112 -87.4623
0 2017 101ST TN521002 G32 W117 928110 N 36.61112 -87.4623
0 2017 101ST TN521002 G32 W301 928110 N 36.61112 -87.4623
0 2017 101ST TN521002 G32 W310 928110 N 36.61112 -87.4623
0 2017 21ST NVD9808 G12 W00l 562211 N 39.61028 -119.203
0 2017 21ST NVD9808 G98 W00l 562211 Y 39.61028 -119.203
0 2017 21ST NVD9808 G54 W002 562211 N 39.61028 -119.203
0 2017 21ST NVD9808 G34 W004 562211 N 39.61028 -119.203
0 2017 21ST NVD9808 G12 W004 562211 Y 39.61028 -119.203" %>%
read_table2() -> tbl.raw
然后我们可以对您的数据进行分组,以便当我们将这些组汇总为单个观察结果时,数据会以您希望的方式保留:
tbl.raw %>%
group_by(Tons, Year, Name, ID, NAICS, Lat, Lon) -> tbl.grouped
接下来,我们可以将每个组中的SOURCE_CODE
andFORM_CODE
变量折叠成逗号分隔的形式,并计算您想要的WASTEW
变量:
tbl.grouped %>%
summarize(SOURCE_CODE = paste(SOURCE_CODE, collapse = ", "),
FORM_CODE = paste(FORM_CODE, collapse = ", "),
WASTEW = case_when("N" %in% WASTEW & "Y" %in% WASTEW ~ "Y, N",
"N" %in% WASTEW ~ "N",
"Y" %in% WASTEW ~ "Y",
TRUE ~ "")) -> tbl.summarized
最后,我们可以删除分组并仅选择您希望保留的变量:
tbl.summarized %>%
ungroup() %>%
select(Tons, Year, Name, ID, SOURCE_CODE, FORM_CODE, NAICS, WASTEW, Lat, Lon)
推荐阅读
- mysql - 通过过滤状态查找最新的行
- javascript - Make height responsive according to width in JavaScript
- keycloak - 基于子域切换 Keycloak Realm
- cordova - 未定义的属性'then'
- raspberry-pi3 - 具有多个 CAN 总线的 RaspberryPi 3b+ (MPC2515))
- python - MacOS:ModuleNotFoundError:没有名为“object_detection”的模块
- identityserver4 - 注销重定向 Uri 在 Google 上的操作?
- tensorflow - CNN准确率没有提高
- c# - 当我想在设置一个 coockie 后重定向时,它说 HTTP 错误 401.0 - 未经授权
- javascript - RNLocation 包问题:subscribeToLocationUpdates 有时永远不会触发,有时它会自动开始工作