首页 > 解决方案 > Aggregate rows with conditions, sum quantitative values, and keep character values of the max

问题描述

I have a sf object of points within same polygons. I want to aggregate all info by polygon Id and keep only one point per polygon : summing quantitative values and keeping character values of the max population with a fcode starting with "P".

I don't know how I can do that.

Here, an exemple with a sample of my sf (6 points in one polygon)

#My sf with 6 points in the same polygon (Poly.id =1), so I want to aggregate by Poly.id
str(Paris)
Classes ‘sf’, ‘tbl_df’, ‘tbl’ and 'data.frame': 6 obs. of  15 variables:
 $ geonameId             : chr  "2988507" "9844895" "2988623" "9843563" ...
 $ asciiName             : chr  "Paris" "Paris 01 Ancien - Quartier Champs-Elysees" "Paris 05 Pantheon" "Paris 09 Ancien - Quartier Hotel-de-Ville" ...
 $ members_etmun         : num  30 0 0 0 0 0
 $ members_urbact        : num  3 0 0 0 0 0
 $ participations_eucicop: num  175 1 9 13 1 6
 $ partners_eucicop      : num  151 1 7 8 1 4
 $ projects_eucicop      : num  156 1 8 13 1 6
 $ fcodeName             : chr  "capital of a political entity" "historical populated place" "populated place" "historical populated place" ...
 $ fcode                 : chr  "PPLC" "PPLH" "PPL" "PPLH" ...
 $ countryCode           : chr  "FR" "FR" "FR" "FR" ...
 $ continentCode         : chr  "EU" "EU" "EU" "EU" ...
 $ population            : chr  "2138551" "0" "0" "0" ...
 $ CityId                : int  2 3 4 5 6 7
 $ Poly.id               : int  1 1 1 1 1 1
 $ geometry              :sfc_POINT of length 6; first list element:  'XY' num  3760488 2889158
 - attr(*, "sf_column")= chr "geometry"
 - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
  ..- attr(*, "names")= chr  "geonameId" "asciiName" "members_etmun" "members_urbact" ...

Here, the output I want (Get all character variables of the first row, because it's the max of "population" and the fcode starts with "P", then sum variables(3:7)). So something like that :

str(Paris)
Classes ‘sf’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of  15 variables:
 $ geonameId             : chr  "2988507" 
 $ asciiName             : chr  "Paris"
 $ members_etmun         : num  30 
 $ members_urbact        : num  3 
 $ participations_eucicop: num  205
 $ partners_eucicop      : num  172
 $ projects_eucicop      : num  185
 $ fcodeName             : chr  "capital of a political entity" 
 $ fcode                 : chr  "PPLC" 
 $ countryCode           : chr  "FR" 
 $ continentCode         : chr  "EU" 
 $ population            : chr  "2138551" 
 $ CityId                : int  2 
 $ Poly.id               : int  1 
 $ geometry              :sfc_POINT of length 1; first list element:  'XY' num  3760488 2889158
  ...



I've tried several things with dplyr or the function aggregate but I don't know how to implement conditions (sum of some quantitative variables, and conditional selection of qualitative values)

Below : the info for the reprex

  Paris <-
    structure(
      list(
        geonameId = c(
          "2988507",
          "9844895",
          "2988623",
          "9843563",
          "9843589",
          "2970479"
        ),
        asciiName = c(
          "Paris",
          "Paris 01 Ancien - Quartier Champs-Elysees",
          "Paris 05 Pantheon",
          "Paris 09 Ancien - Quartier Hotel-de-Ville",
          "Paris 10 Ancien - Quartier Invalides",
          "Paris 15 Vaugirard"
        ),
        members_etmun = c(30, 0, 0, 0, 0, 0),
        members_urbact = c(3,
                           0, 0, 0, 0, 0),
        participations_eucicop = c(175, 1, 9, 13,
                                   1, 6),
        partners_eucicop = c(151, 1, 7, 8, 1, 4),
        projects_eucicop = c(156,
                             1, 8, 13, 1, 6),
        fcodeName = c(
          "capital of a political entity",
          "historical populated place",
          "populated place",
          "historical populated place",
          "historical populated place",
          "populated place"
        ),
        fcode = c("PPLC",
                  "PPLH", "PPL", "PPLH", "PPLH", "PPL"),
        countryCode = c("FR",
                        "FR", "FR", "FR", "FR", "FR"),
        continentCode = c("EU", "EU",
                          "EU", "EU", "EU", "EU"),
        population = c("2138551", "0", "0",
                       "0", "0", "0"),
        CityId = 2:7,
        Poly.id = c(1L, 1L, 1L, 1L,
                    1L, 1L),
        geometry = structure(
          list(
            structure(
              c(3760487.79387524,
                2889157.59339236),
              class = c("XY", "POINT", "sfg")
            ),
            structure(
              c(3757309.96074126,
                2891227.55260787),
              class = c("XY", "POINT", "sfg")
            ),
            structure(
              c(3760266.48357105,
                2888217.53980758),
              class = c("XY", "POINT", "sfg")
            ),
            structure(
              c(3760964.29140438,
                2889376.36175204),
              class = c("XY", "POINT", "sfg")
            ),
            structure(
              c(3757255.00710537,
                2889909.86397746),
              class = c("XY", "POINT", "sfg")
            ),
            structure(
              c(3756810.81051837,
                2888176.61058656),
              class = c("XY", "POINT", "sfg")
            )
          ),
          class = c("sfc_POINT",
                    "sfc"),
          precision = 0,
          bbox = structure(
            c(
              xmin = 3756810.81051837,
              ymin = 2888176.61058656,
              xmax = 3760964.29140438,
              ymax = 2891227.55260787
            ),
            class = "bbox"
          ),
          crs = structure(
            list(epsg = 3035L, proj4string = "+proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"),
            class = "crs"
          ),
          n_empty = 0L
        )
      ),
      row.names = c(NA,
                    6L),
      sf_column = "geometry",
      agr = structure(
        c(
          geonameId = NA_integer_,
          asciiName = NA_integer_,
          members_etmun = NA_integer_,
          members_urbact = NA_integer_,
          participations_eucicop = NA_integer_,
          partners_eucicop = NA_integer_,
          projects_eucicop = NA_integer_,
          fcodeName = NA_integer_,
          fcode = NA_integer_,
          countryCode = NA_integer_,
          continentCode = NA_integer_,
          population = NA_integer_,
          CityId = NA_integer_,
          Poly.id = NA_integer_
        ),
        class = "factor",
        .Label = c("constant",
                   "aggregate", "identity")
      ),
      class = c("sf", "tbl_df", "tbl", "data.frame")
    )




What kind of function do I have to use in order to do something like this ? Thank you in advance

标签: raggregatetidyversesf

解决方案


There is probably an easier way to do this but...

You could use by to create the row of the dataframe for each polygon id, then rbind to bind them back together into a single dataframe.

I don't fully understand your 'keeping character values of the max population with an fcode starting with p' but you can adapt the code below if I haven't quite got it right:

## First change the population to a numeric:
Paris$population <- as.numeric(Paris$population)

# This will apply the function in FUN to each subset of d defined by a unique poly.id
polygons <- by(data=Paris, INDICES = list(Paris$Poly.id) , FUN = function(d){

  # Identify all the rows with fcode that starts with P
  startswithP <- d[substr(d$fcode,1,1)=="P",]

  # Now get the row with the highest population from those rows
  # There might be more than one of these so just take the first row.
  maxpop <- startswithP[startswithP$population == max(startswithP$population),][1,]

  # Now construct the row of my dataframe.
  # First sum all the numeric rows of d.
  # Then identify all the character rows of maxpop
  data.frame( as.list(apply(d[sapply(d, is.numeric)],2,sum)), 
              maxpop[sapply(maxpop, is.character)])
})

# Now I have a list of one-row dataframes I can rbind them together.
Paris.perpolygon <- do.call(rbind, polygons)                
Paris.perpolygon

推荐阅读