首页 > 解决方案 > 如何在 R 数据框中对多个对象进行分类

问题描述

这只是我正在使用的数据框的一小部分:

id       drug        start        stop          dose    unit    route   
2010003  Amlodipine  2009-02-04   2009-11-19    1.5     mg      Oral    
2010003  Amlodipine  2009-11-19   2010-01-11    1.5     mg      Oral      
2010004  Cefprozil   2004-03-12   2004-03-19    175     mg      Oral    
2010004  Clobazam    2002-12-30   2003-01-01    5       mg      Oral

我有一个 Statado文件,它显示了我正在尝试做的事情:

replace class = "ACE Inhibitor" if strmatch(upper(drug), "CAPTOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "ENALAPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "ENALAPRILAT*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "FOSINOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "LISINOPRIL*")
replace class = "ACE Inhibitor" if strmatch(upper(drug), "RAMIPRIL*")
replace class = "Acne Medication" if strmatch(upper(drug), "ADAPALENE*")
replace class = "Acne Medication" if strmatch(upper(drug), "ADAPALENE/BENZOYL PEROXIDE*")
replace class = "Acne Medication" if strmatch(upper(drug), "BENZOYL PEROXIDE*")
replace class = "Acne Medication" if strmatch(upper(drug), "BENZOYL PEROXIDE/CLINDAMYCIN*")
replace class = "Acne Medication" if strmatch(upper(drug), "ISOTRETINOIN*")
replace class = "Acne Medication" if strmatch(upper(drug), "ERYTHROMYCIN/TRETINOIN*")
replace class = "Acne Medication/Acute Promyelocytic Leukemia Medication" if strmatch(upper(drug), "TRETINOIN*")
replace class = "Alpha Agonist" if strmatch(upper(drug), "XYLOMETAZOLINE*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "DOXAZOSIN*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "PHENOXYBENZAMINE*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "PHENTOLAMINE*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "PRAZOSIN*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "TAMSULOSIN*")
replace class = "Alpha Blocker" if strmatch(upper(drug), "TERAZOSIN*")
replace class = "Alpha/Beta Blocker" if strmatch(upper(drug), "CARVEDILOL*")
replace class = "Alpha/Beta Blocker" if strmatch(upper(drug), "LABETALOL*")
replace class = "Alpha-1 Agonist" if strmatch(upper(drug), "PHENYLEPHRINE*")
replace class = "Alpha-1 Agonist" if strmatch(upper(drug), "MIDODRINE*")
replace class = "Alpha-2 Agonist" if strmatch(upper(drug), "CLONIDINE*")
replace class = "Alpha-2 Agonist" if strmatch(upper(drug), "DEXMEDETOMIDINE*")
replace class = "Anaesthetic, general" if strmatch(upper(drug), "KETAMINE*")
replace class = "Anaesthetic, general" if strmatch(upper(drug), "THIOPENTAL*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "BENZOCAINE*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "BUPIVACAINE*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "BUPIVACAINE/FENTANYL*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "TETRACAINE*")
replace class = "Anaesthetic, local" if strmatch(upper(drug), "XYLOCAINE*")
replace class = "Anaesthetic, local/Antiarrythmic" if strmatch(upper(drug), "LIDOCAINE*")
replace class = "Anaesthetic, local/Antiseptic" if strmatch(upper(drug), "HEXYLRESORCINOL*")
replace class = "Anaesthetic, topical" if strmatch(upper(drug), "LIDOCAINE/PRILOCAINE*")
replace class = "Anaesthetic, topical" if strmatch(upper(drug), "PROPARACAINE*")
replace class = "Analgesic" if strmatch(upper(drug), "ACETAMINOPHEN*")
replace class = "Analgesic" if strmatch(upper(drug), "BELLADONNA & OPIUM SUPPOSITORY*")

我想在 R 中做同样的分类,但我不知道 Stata。

注意药物可以有不止一种class

任何建议和帮助将不胜感激。

标签: rdataframestatatext-classification

解决方案


作为第一步,我将从您的 Stata 脚本中导入所有药物数据(假设数据尚未采用干净、可用的格式):

drug_class_data <- read.table("Desktop/stata_script", header=FALSE, sep='"',stringsAsFactors = FALSE)  
drug_class_data <-drug_class_data[,c(2,4)] 
colnames(drug_class_data) <- c('Drug_class','Drug')

删除尾随 * - 在 Stata 脚本中用作通配符

drug_class_data$Drug = gsub("\\*", "", drug_class_data$Drug)

这为您提供了一个包含 2 列('Drug_class' & 'Drug')的数据框 - 该行从 Stata 脚本的每一行中提取引号中的任何数据(下面以粗体突出显示):

如果 strmatch(upper(drug), " CAPTOPRIL* ")替换 class = " ACE Inhibitor "

然后我会将其保存为一个文件,然后您可以根据需要导入该文件(我假设此数据尚不可用,因为您在 Stata 示例中对所有这些值进行了硬编码)。

write.csv(drug_class_data, file = "drug_class_data.csv",row.names=FALSE)

从那里开始,这取决于您是否想要:

1) 每个药物实例的多行,具有明确指定药物类别的单个文本列。每个药物的行数 = 它所属的药物类别数。这种方法有一些优点,但会导致大量重复数据。

2) 每个药物的单行和每个药物类别的多个布尔列 - “ACE Inhibitor”、“Acne Medication”等 - 包含二进制 TRUE 或 FALSE 以指示它是否是该类的成员。

就我个人而言,我倾向于将选项 2 作为下游分析的起点。(正如您提到的药物可能分为多个类别,也有几个药物类别出现分层 - “麻醉,局部”可能是“麻醉,局部/抗心律失常”,“麻醉,局部/防腐剂”等的父术语)

从您的数据框中提取所有独特的药物类别到一个列表中:

drug_class_list <- unique(drug_class_data[,1])

然后我会使用下面的丑陋代码来创建一个新的数据框:

create_flat_table <- function(df_drugs, df_classes){   
# Extract list of drug classes present in df

class_list <- unique(df_classes[,1])  
# Reiterate over this list creating a new column in the drug df and populating it with data   
drugs <- as.list(drug_data['drug'])  
results <- df_drugs   
for(class in class_list){   
class_drugs <- df_classes[df_classes$Drug_class == class,]   
boolean_list <- toupper(df_drugs[,2])%in%class_drugs[,2]
results <- cbind(results, boolean_list )   }   
colnames(results) <- c(colnames(df_drugs), class_list)   
return(results) }

combined_df <- create_flat_table(drug_data, drug_class_data)

生成的数据框将如下所示:

结果数据框

请注意,在此示例中,我已更改数据,以便您的玩具数据集中的至少一种药物与您的药物类别缩写列表中的一个类别匹配。


推荐阅读