首页 > 解决方案 > Adding columns by splitting number, and removing duplicates

问题描述

I have a dataframe like the following (this is a reduced example, I have many more rows and columns):

    CH1  CH2  CH3
1  3434  282 7622
2  4442 6968 8430
3  4128 6947  478
4  6718 6716 3017
5  3735 9171 1128
6    65 4876 4875
7  9305 6944 3309
8  4283 6060  650
9  5588 2285  203
10  205 2345 9225
11 8634 4840  780
12 6383    0 1257
13 4533 7692 3760
14 9363 9846 4697
15 3892   79 4372
16 6130 5312 9651
17 7880 7386 6239
18 8515 8021 2295
19 1356   74 8467
20 9024 8626 4136

I need to create additional columns by splitting the values. For example, value 1356 would have to be split into 6, 56, and 356. I do this on a for loop splitting by string. I do this to keep the leading zeros. So far, decent.

# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {

  # Create a temporal variable
  temp <- as.character(data[,col] )

  # Save the new column
  for(mod in c(-1, -2, -3)) {
    # Create the column
    temp <- cbind(temp, str_sub(as.character(data[,col]), mod))
  }

  # Merge to the row
  data <- cbind(data, temp)

}

My problem is that not all cells have 4 digits: some may have 1, 2 or 3 digits. Therefore, I get repeated values when I split. For example, for 79 I get: 79 (original), 9, 79, 79, 79.

Problem: I need to remove the repeated values. Of course, I could do unique, but that gives me rows of uneven number of columns. I need to fill those missing (i.e. the removed repeated values) with NA. I can only compare this by row.

I checked CJ Yetman's answer here, but they only replace consecutive numbers. I only need to keep unique values.

Reproducible Example: Here is a fiddle with my code working: http://rextester.com/IKMP73407

Expected outcome: For example, for rows 11 & 12 of the example (see the link for the reproducible example), if this is my original:

8634  4 34 634 4840  0 40 840  780  0 80 780    
6383  3 83 383    0  0  0   0 1257  7 57 257

I'd like to get this:

8634  4 34 634 4840  0 40 840  780  NA 80 NA    
6383  3 83 383    0  NA  NA   NA 1257  7 57 257

标签: rdataframeunique

解决方案


You can use apply():

The data:

data <- structure(list(CH1 = c(3434L, 4442L, 4128L, 6718L, 3735L, 65L, 
                               9305L, 4283L, 5588L, 205L, 8634L, 6383L, 4533L, 9363L, 3892L, 
                               6130L, 7880L, 8515L, 1356L, 9024L), CH2 = c(282L, 6968L, 6947L, 
                                                                           6716L, 9171L, 4876L, 6944L, 6060L, 2285L, 2345L, 4840L, 0L, 7692L, 
                                                                           9846L, 79L, 5312L, 7386L, 8021L, 74L, 8626L), CH3 = c(7622L, 
                                                                                                                                 8430L, 478L, 3017L, 1128L, 4875L, 3309L, 650L, 203L, 9225L, 780L, 
                                                                                                                                 1257L, 3760L, 4697L, 4372L, 9651L, 6239L, 2295L, 8467L, 4136L
                                                                           )), .Names = c("CH1", "CH2", "CH3"), row.names = c(NA, 20L), class = "data.frame")

Select row 11 and 12:

data <- data[11:12, ]

Using your code:

# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {

  # Create a temporal variable
  temp <- data[,col]

  # Save the new column
  for(mod in c(10, 100, 1000)) {
    # Create the column
    temp <- cbind(temp, data[, col] %% mod)
  }

  data <- cbind(data, temp)
}


data[,1:3] <- NULL

The result is:

   temp V2 V3  V4 temp V2 V3  V4 temp V2 V3  V4
11 8634  4 34 634 4840  0 40 840  780  0 80 780
12 6383  3 83 383    0  0  0   0 1257  7 57 257

Then go through the data row by row and remove duplicates and transpose the outcome:

t(apply(data, 1, function(row) {
  row[duplicated(row)] <- NA
  return(row)
}))

The result is:

   temp V2 V3  V4 temp V2 V3  V4 temp V2 V3  V4
11 8634  4 34 634 4840  0 40 840  780 NA 80  NA
12 6383  3 83 383    0 NA NA  NA 1257  7 57 257

推荐阅读