首页 > 解决方案 > How to create a binary variable for logistic regression by using key words in text variable

问题描述

I have criminal sentencing data that contains a text variable which contains phrases like "2 months jail", "14 months prison", "12 months community supervision." I would like to run a logistic regression to determine the odds that a particular defendant is sent to prison or jail, or if they were released to community supervision. So I want to create a binary variable that shows a 1 for someone sent to "jail"/"prison" and a 0 for those sent to another program

I have tried using library(qdap) but have not had any luck. I have also tried ifelse(df$text %in% "jail", "1", "0") but it only shows 1 observation when I know there are several thousand.

Small data sample:

data<-data.frame('caseid'=c(1,2,3),'text'=c("went to prison","went to jail","released"))

  caseid           text
1      1 went to prison
2      2   went to jail
3      3       released

Trying to create a binary variable - sentenced - to analyze logistically like:

  caseid           text sentenced
1      1 went to prison         1
2      2   went to jail         1
3      3       released         0

Thank you for any help you can offer!

标签: rtextnlp

解决方案


You can do the following in base R

transform(data, sentenced = +grepl("(jail|prison)", text))
#  caseid           text sentenced
#1      1 went to prison         1
#2      2   went to jail         1
#3      3       released         0

Explanation: "(jail|prison)" matches "jail" or "prison", and the unary operator + turns the output of grepl into an integer.


推荐阅读