r - How to create a binary variable for logistic regression by using key words in text variable
问题描述
I have criminal sentencing data that contains a text variable which contains phrases like "2 months jail", "14 months prison", "12 months community supervision." I would like to run a logistic regression to determine the odds that a particular defendant is sent to prison or jail, or if they were released to community supervision. So I want to create a binary variable that shows a 1 for someone sent to "jail"/"prison" and a 0 for those sent to another program
I have tried using library(qdap)
but have not had any luck. I have also tried ifelse(df$text %in% "jail", "1", "0")
but it only shows 1 observation when I know there are several thousand.
Small data sample:
data<-data.frame('caseid'=c(1,2,3),'text'=c("went to prison","went to jail","released"))
caseid text
1 1 went to prison
2 2 went to jail
3 3 released
Trying to create a binary variable - sentenced
- to analyze logistically like:
caseid text sentenced
1 1 went to prison 1
2 2 went to jail 1
3 3 released 0
Thank you for any help you can offer!
解决方案
You can do the following in base R
transform(data, sentenced = +grepl("(jail|prison)", text))
# caseid text sentenced
#1 1 went to prison 1
#2 2 went to jail 1
#3 3 released 0
Explanation: "(jail|prison)"
matches "jail"
or "prison"
, and the unary operator +
turns the output of grepl
into an integer
.
推荐阅读
- java - 针对 XML 模式(XSD 文件)的通用 XML 文件验证器
- ios - 刷新控件和搜索栏导致表格视图/滚动视图出现奇怪的偏移,如何解决?
- java - 使用 Gson 反序列化未知 json
- java - Maven 依赖生成自定义的 rest 端点
- arcgis - 如何使用 arcgis 依赖项减小应用程序的大小?
- java - G1 GC - 大型后台 I/O 导致 JVM 无响应 - 8 秒暂停
- python - Jupyter 中的数字输出
- tensorflow - Keras 中的 _uses_learning_phase 是什么?
- tsql - TSQL - 如何根据某些条件将变量声明为不同的(udtt)类型?
- python - Paho MQTT 客户端导入?