首页 > 解决方案 > 如何从数据集中获取最多的不同值

问题描述

我正在玩弄我通过市长办公室网站获得的洛杉矶警察数据。从 2017 年到 2018 年,我试图查看第 5 区议会给出的费用和每项具体费用的金额。CHARGECITY_COUNCIL_DIST是我正在查看的两个变量/列。

我曾经table(ArrestData$CHARGE)计算不同值的数量。

我意识到有超过 2400 个唯一条目,因此大部分条目都被省略了。我想知道是否有代码可以查看洛杉矶警察局主要发放的 5 个“收费”。

此外,我试图在一个特定的Council District(再次,另一个变量/列)中找到前 5 项费用,是否有此代码?

旁白:如何将示例数据添加到我的帖子中?在 RStudio 上执行此操作的步骤是什么?有人在之前的帖子中要求我这样做,但我不知道该怎么做。他们告诉我使用dput(head(df,n)),但我的数据太大,即使使用 10 行。他们告诉我通过 RScript 来做,但我不确定他们的意思

标签: r

解决方案


发布对实际数据集/样本数据的引用将有助于创建解决方案。这将有助于帖子遵守其他人提到的可重复性标准。为了这个例子,我们将显式地创建一个数据集。

ArrestData <- data.frame(
  CHARGE=c("CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA",
           "CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA",
           "CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB",
           "CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB",
           "CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC",
           "CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC",
           "CHARGED","CHARGED","CHARGED","CHARGED","CHARGED","CHARGED",
           "CHARGED","CHARGED","CHARGED","CHARGED","CHARGED","CHARGED",
           "CHARGEE","CHARGEE","CHARGEE","CHARGEE","CHARGEE",
           "CHARGEE","CHARGEE","CHARGEE","CHARGEE","CHARGEE",
           "CHARGEF","CHARGEF","CHARGEF","CHARGEF",
           "CHARGEF","CHARGEF","CHARGEF","CHARGEF",
           "CHARGEG","CHARGEG","CHARGEG",           
           "CHARGEG","CHARGEG","CHARGEG",
           "CHARGEH","CHARGEH",
           "CHARGEH","CHARGEH",
           "CHARGEI",
           "CHARGEI"
           ),
  CITY_COUNCIL_DIST=c(0,5)
)

假设您的数据集已命名ArrestData并且您的CHARGE/CITY_COUNCIL_DIST也按说明命名,则此代码应该可以工作。下面的代码将包括所有CHARGE的前 5 名。 CITY_COUNCIL_DISTCITY_COUNCIL_DIST

#install these packages if you do not have them

install.packages("magrittr")
install.packages("dplyr")

#make sure these libraries are present
library(magrittr)
library(dplyr)

ArrestData %>% 
  group_by(CHARGE, CITY_COUNCIL_DIST) %>%
  summarize(count=n()) %>% 
  arrange(CITY_COUNCIL_DIST, desc(count)) %>%
  group_by(CITY_COUNCIL_DIST) %>% 
  mutate(rank = rank(desc(count), ties.method="min")) %>% 
  filter(rank<=5)

为了只过滤掉CITY_COUNCIL_DIST5 的结果,您需要将filter语句更改为如下内容:(取决于您的CITY_COUNCIL_DIST实际值)

filter(rank<=5, CITY_COUNCIL_DIST==5)

推荐阅读