首页 > 解决方案 > r data frame- convert values to text and rank from most to least occurrences per row

问题描述

New to R and having an issue converting formulas from from excel to R. Any advice would be great. I have a data frame called allwins_df, where I want to change each value to the first letter of the column name.

Then I need to calculate the number of occurrences of each letter on each date and rank them from highest to lowest number of occurrences. Lastly, if there is a tie the tie-winner is the letter who's underlying number was the greatest. i.e on 1/5/01 one of the B values had a higher value than the A's so B is first, then A, then C. Sample of data frame:

Date        A.B  A.C  B.A   B.C  C.A  C.B…
2001-01-01  N/A  NA   0.14  0.35 0.43 NA
2001-01-02  0.8  NA   NA    0.5  NA   0.32
2001-01-03  0.75 0.8  NA    0.65 NA   0.9
2001-01-04  NA   0.66 0.91  NA   NA   NA
2001-01-05  0.52 0.62 0.48  0.82 0.4  NA

With some help I used the following code to turn the values into letters but not sure if I can use the resulting output to rank from most to least occurrences and handle ties.

 cols <- sub("\\..*", "",names(allwins_df)[-1])
    mat <- which(!is.na(allwins_df[-1]), arr.ind = TRUE)
    allwins_df[-1][mat]<- allwins_df[-1][mat] <- cols[mat[, 2]]

Output from code above -only changing values to letters:

Date        A.B A.C B.A B.C C.A C.B…
2001-01-01  N/A NA  B   B   C   NA
2001-01-02  A   A   NA  B   NA  C
2001-01-03  A   A   NA  B   NA  C
2001-01-04  NA  A   B   NA  NA  NA
2001-01-05  A   A   B   B   C   NA

Final Goal Output Example:

Date            
2001-01-01  B   C   
2001-01-02  A   B   C
2001-01-03  A   C   B
2001-01-04  B   A   
2001-01-05  B   A   C

Thanks in advance.

标签: rdataframeranking

解决方案


Here's a possible strategy with tidyverse libraries tidyr and dplyr

Using the data

dd<-read.table(text="Date        A.B  A.C  B.A   B.C  C.A  C.B
2001-01-01  N/A  NA   0.14  0.35 0.43 NA
2001-01-02  0.8  NA   NA    0.5  NA   0.32
2001-01-03  0.75 0.8  NA    0.65 NA   0.9
2001-01-04  NA   0.66 0.91  NA   NA   NA
2001-01-05  0.52 0.62 0.48  0.82 0.4  NA", header=TRUE, na.strings=c("NA","N/A"))

We can do the following

library(tidyr)
library(dplyr)
dd %>% 
  pivot_longer(-Date) %>% 
  separate(name, c("first","second")) %>% 
  group_by(Date, first) %>% 
  filter(!is.na(value)) %>% 
  summarize(count=n(), max=max(value)) %>% 
  arrange(Date, desc(count), desc(max)) %>% 
  mutate(rank=row_number()) %>% 
  pivot_wider(Date, names_from=rank, values_from=first, values_fill=NA)

Which returns

  Date       `1`   `2`   `3`  
  <chr>      <chr> <chr> <chr>
1 2001-01-01 B     C     NA   
2 2001-01-02 A     B     C    
3 2001-01-03 A     C     B    
4 2001-01-04 B     A     NA   
5 2001-01-05 B     A     C   

The idea is that you reshape your data into a more tidy format. Here we turn those columns into rows. Then split the names up so can we look at just the first letter. Then we count them and keep track of the max value for each better. We sort them, and then we reshape them to the wide format that matches your desired shape. Though you might want to rethink that because such a shape isn't always easy to work with in R. "Tidy" data is typically "rectangular" where each row has the same number of columns. Having ragged data like isn't pretty but it all depends on what you are going to do down stream.

Note that in general you want to be careful about trying to directly translate code from excel to R. These two programs work very differently and sometime solving the same problem will involve very different strategy to be efficient in the different environments.


推荐阅读