首页 > 解决方案 > 如何“分组”一系列重叠的点坐标

问题描述

我有一个相当大的(250k 行)数据集,其中有很多重叠的点坐标。我需要映射这些坐标,但每个实际设施只能使用一个点。

本质上,我需要想办法将所有差异化数据(源代码、表单代码、WastW)合并到每个设施中。

这是我的数据集看起来原始的内容 原始数据

这是我希望它看起来的样子 我希望它看起来像什么

我可以使用 SQL(通过 ORACLE 数据库)来提取原始数据,或者在事后使用 R 或 ArcGIS/Qgis 进行合并。

拉取代码时,我在 SQL 中运行了 group by 语句,但它只能提供如下所示的原始数据。

select distinct 
REPORTING.TONS, 
REPORTING.YEAR, 
REPORTING.NAME, 
REPORTING.ID,
REPORTING.source_code,
REPORTING.form_code,
REPORTING.primary_naics,
REPORTING.WASTEW
REPORTING.LAT
REPORTING.LON

from REPORTING where 1=1 and 
upper(REPORTING.REPORT_CYCLE) = 2017 
order by 
REPORTING.TONS, 
REPORTING.YEAR, 
REPORTING.NAME, 
REPORTING.ID,
REPORTING.source_code,
REPORTING.form_code,
REPORTING.primary_naics,
REPORTING.WASTEW
REPORTING.LAT
REPORTING.LON

使用来自 PARFAIT 的新代码更新结果 在此处输入图像描述

原始数据:

| Tons | Year | Name     | ID           | SOURCE_CODE | FORM_CODE | primary_naics  | WASTEW | Lat       | Lon       |
| 0    | 2017 | Airborne | TN5210020140 | G11         | W801      | 928110 | N          | 36.611123 | -87.46234 |
| 0    | 2017 | Airborne | TN5210020140 | G13         | W101      | 928110 | Y          | 36.611123 | -87.46234 |
| 0    | 2017 | Airborne | TN5210020140 | G14         | W200      | 928110 | N          | 36.611123 | -87.46234 |
| 0    | 2017 | Airborne | TN5210020140 | G14         | W101      | 928110 | N          | 36.611123 | -87.46234 |
| 0    | 2017 | Airborne | TN5210020140 | G32         | W310      | 928110 | N          | 36.611123 | -87.46234 |
| 0    | 2017 | Century  | NVD980895338 | G01         | W001      | 562211 | N          | 39.61028  | -119.2031 |
| 0    | 2017 | Century  | NVD980895338 | G02         | W001      | 562211 | Y          | 39.61028  | -119.2031 |
| 0    | 2017 | Century  | NVD980895338 | G03         | W002      | 562211 | N          | 39.61028  | -119.2031 |
| 0    | 2017 | Century  | NVD980895338 | G03         | W004      | 562211 | N          | 39.61028  | -119.2031 |
| 0    | 2017 | Century  | NVD980895338 | G04         | W004      | 562211 | Y          | 39.61028  | -119.2031 |

来自 SQL 拉取的新数据:

    +--------+------+-------------------------------------------------------+--------------+-------------+----------------+--------+-------+
| 6.874  | 2017 | FLINT HILLS RESOURCES ALASKA, LLC NORTH POLE TERMINAL | AKD000850701 |     G23     |      W609      | 424710 |   N   |
+--------+------+-------------------------------------------------------+--------------+-------------+----------------+--------+-------+
| 7.2015 | 2017 | FLINT HILLS RESOURCES ALASKA, LLC NORTH POLE TERMINAL | AKD000850701 | G19         | W609           | 424710 | N     |
| 10.333 | 2017 | FLINT HILLS RESOURCES ALASKA, LLC NORTH POLE TERMINAL | AKD000850701 | G19         | W319           | 424710 | N     |
| 0.05   | 2017 | SWANSON RIVER OIL PIPELINE                            | AKD035419795 | G19         | W310           | 486110 | N     |
| 3.242  | 2017 | SWANSON RIVER OIL PIPELINE                            | AKD035419795 | G07         | W002           | 486110 | N     |
| 0.0005 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G11,G11,G11 | W001,W004,W801 |  61131 | N,N,N |
| 0.001  | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G11,G11,G11 | W001,W004,W801 |  61131 | N,N,N |
| 0.0015 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G11,G11     | W001,W801      |  61131 | N,N   |
| 0.0025 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G11         | W001           |  61131 | N     |
| 0.005  | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G11,G11     | W001,W801      |  61131 | N,N   |
| 0.006  | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G22         | W001           |  61131 | N     |
| 0.0095 | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G11         | W001           |  61131 | N     |
| 0.01   | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G11         | W001           |  61131 | N     |
| 0.015  | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G11         | W001           |  61131 | N     |
| 0.025  | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G09,G11,G15 | W001,W001,W002 |  61131 | N,N,N |
| 0.03   | 2017 | UNIVERSITY OF ALASKA FAIRBANKS                        | AKD048679567 | G11         | W001           |  61131 | N     |
+--------+------+-------------------------------------------------------+--------------+-------------+----------------+--------+-------+

标签: sqlroraclegis

解决方案


关于您希望如何组合某些列中的值有点不清楚 - 特别是WASTEW,但这里有一个 tidyverse 方法来获取您的原始数据集并对其进行总结。

首先,我们制作一些数据以匹配您的示例数据:

library(dplyr)
library(readr)

"Tons Year Name ID SOURCE_CODE FORM_CODE NAICS WASTEW Lat Lon
 0 2017 101ST TN521002 G11 W801 928110 N 36.61112 -87.4623
 0 2017 101ST TN521002 G13 Wl0l 928110 Y 36.61112 -87.4623
 0 2017 101ST TN521002 G14 W200 928110 N 36.61112 -87.4623
 0 2017 101ST TN521002 G14 W603 928110 N 36.61112 -87.4623
 0 2017 101ST TN521002 G16 W103 928110 N 36.61112 -87.4623
 0 2017 101ST TN521002 G16 W310 928110 N 36.61112 -87.4623
 0 2017 101ST TN521002 G22 W113 928110 Y 36.61112 -87.4623
 0 2017 101ST TN521002 G22 W219 928110 N 36.61112 -87.4623
 0 2017 101ST TN521002 G32 W117 928110 N 36.61112 -87.4623
 0 2017 101ST TN521002 G32 W301 928110 N 36.61112 -87.4623
 0 2017 101ST TN521002 G32 W310 928110 N 36.61112 -87.4623
 0 2017 21ST NVD9808 G12 W00l 562211 N 39.61028 -119.203
 0 2017 21ST NVD9808 G98 W00l 562211 Y 39.61028 -119.203
 0 2017 21ST NVD9808 G54 W002 562211 N 39.61028 -119.203
 0 2017 21ST NVD9808 G34 W004 562211 N 39.61028 -119.203
 0 2017 21ST NVD9808 G12 W004 562211 Y 39.61028 -119.203" %>% 
  read_table2() -> tbl.raw

然后我们可以对您的数据进行分组,以便当我们将这些组汇总为单个观察结果时,数据会以您希望的方式保留:

tbl.raw %>% 
  group_by(Tons, Year, Name, ID, NAICS, Lat, Lon) -> tbl.grouped

接下来,我们可以将每个组中的SOURCE_CODEandFORM_CODE变量折叠成逗号分隔的形式,并计算您想要的WASTEW变量:

tbl.grouped %>% 
  summarize(SOURCE_CODE = paste(SOURCE_CODE, collapse = ", "), 
            FORM_CODE = paste(FORM_CODE, collapse = ", "), 
            WASTEW = case_when("N" %in% WASTEW & "Y" %in% WASTEW ~ "Y, N", 
                               "N" %in% WASTEW ~ "N", 
                               "Y" %in% WASTEW ~ "Y", 
                               TRUE ~ "")) -> tbl.summarized

最后,我们可以删除分组并仅选择您希望保留的变量:

tbl.summarized %>% 
  ungroup() %>% 
  select(Tons, Year, Name, ID, SOURCE_CODE, FORM_CODE, NAICS, WASTEW, Lat, Lon)

推荐阅读