首页 > 解决方案 > 从R中的随机森林(RF)中提取每个样本在森林中分配的叶子指数

问题描述

我正在尝试将代码从 Python 转换为 R,以便按照此博客文章中的说明使用随机森林和 UMAP 进行有监督的降维。

我需要获取一个数组,其中包含每个样本在森林中分配给的叶索引,以便我可以将此信息输入{uwot} 包(用于 UMAP)。

我想从以下 R 包中获取此信息:{randomForest}{ranger}{extraTrees}ranger预计需要最低的计算时间(对我的项目至关重要),并且ExtraTrees在某些情况下,可以胜过RF实现(就 而言AUC)。

在 Python 中,scikit-learn返回ExtraTreesClassifier一个 numpy 数组,其中包含每个样本在森林中分配给的叶索引,如下所示:

在此处输入图像描述

为了使事情尽可能地具有可比性,R/Python我们将在reticulate关注博客(但减少模拟数据的大小)时使用 Python 首当其冲地只比较各种RF实现。

Python 实现

library(reticulate)
repl_python()

现在在 Python 解释器中

import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier 
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.datasets import make_classification
from tqdm import tqdm
from umap import UMAP  ## if this doesn't work try: import umap.umap_ as UMAP
from pynndescent import NNDescent
from fastcluster import single
from scipy.cluster.hierarchy import cut_tree, fcluster, dendrogram
from scipy.spatial.distance import squareform
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.model_selection import train_test_split

# turning off automatic plot showing, and setting style
plt.style.use('bmh')

# let us generate some data with 10 clusters per class
X, y = make_classification(n_samples=10000, n_features=500, n_informative=5, 
                           n_redundant=0, n_clusters_per_class=10, weights=[0.80],
                           flip_y=0.05, class_sep=3.5, random_state=42)

# normalizing to eliminate scaling differences
X = pd.DataFrame(StandardScaler().fit_transform(X))

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 70% training and 30% test

用于ExtraTreesClassifier从叶索引和绘图中获取信息。

文中,作者使用了StratifiedKFold& cross_val_predict。在这里,我train_test_split改为创建训练/测试拆分。这样,我可以对 Python 和 R使用相同的训练/拆分,并确保ROC 曲线测量下的面积具有可比性。

# model instance
et = ExtraTreesClassifier(n_estimators=100, min_samples_leaf=500, 
                          max_features=0.80, bootstrap=True, class_weight='balanced', n_jobs=-1, random_state=42)

# Train ExtraTreesClassifer
et.fit(X_train, y_train)

# Print the AUC
print('Area under the ROC Curve:', roc_auc_score(y_test, et.predict_proba(X_test)[:,1]))

模型性能为0.8504 AUC。现在我们继续进行嵌入。

# let us train our model with the full data
et.fit(X, y)

# and get the leaves that each sample was assigned to
leaves = et.apply(X)

numpy数组的维度是多少

leaves.shape

(10000, 100)

# calculating the embedding with hamming distance
sup_embed_et = UMAP(metric='hamming').fit_transform(leaves)

# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_et[y == 0,0], sup_embed_et[y == 0,1], s=1, c='C0', cmap='viridis', label='$y=0$')
plt.scatter(sup_embed_et[y == 1,0], sup_embed_et[y == 1,1], s=1, c='C1', cmap='viridis', label='$y=1$')
plt.title('Supervised embedding with ExtraTrees')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.legend(fontsize=16, markerscale=5);
plt.show()

在此处输入图像描述

# taking a sample of the dataframe
embed_sample = pd.DataFrame(sup_embed_et).sample(5000, random_state=42)

# running fastcluster hierarchical clustering on the improved embedding
H = single(embed_sample)

# getting the clusters
clusters = cut_tree(H, height=0.35)
print('Number of clusters:', len(np.unique(clusters)))

这显示了 19 个集群(模拟数量为 20)。

# creating a dataframe for the clustering sample
clust_sample_df = pd.DataFrame({'cluster': clusters.reshape(-1), 'cl_sample':range(len(clusters))})

# creating an index with the sample used for clustering
index = NNDescent(embed_sample, n_neighbors=10)

# querying for all the data
nn = index.query(sup_embed_et, k=1)

# creating a dataframe with nearest neighbors for all samples
to_cluster_df = pd.DataFrame({'sample':range(sup_embed_et.shape[0]), 'cl_sample': nn[0].reshape(-1)})

# merging to assign cluster to all other samples, and tidying it
final_cluster_df = to_cluster_df.merge(clust_sample_df, on='cl_sample')
final_cluster_df = final_cluster_df.set_index('sample').sort_index()

# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_et[:,0], sup_embed_et[:,1], s=1, c=final_cluster_df['cluster'], cmap='plasma')
plt.title('Hierarchical clustering and extraTrees')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.show()
exit

在此处输入图像描述

R 实施

现在使用我们用 Python 模拟的相同数据R创建一个,并使用'sdata.frame拆分为训练/测试sklearntrain_test_split

library(dplyr)

df <- data.frame(py$X)
df$labels <- as.factor(py$y) # convert to factor for classification

d_train <- data.frame(py$X_train)
d_test <- data.frame(py$X_test)
d_train$labels <- py$y_train
d_test$labels <- py$y_test

# Identity the response column
ycol <- "labels"

# Identify the predictor columns
xcols <- setdiff(names(d_train), ycol)

# Convert response to factor (required by randomForest)
d_train[,ycol] <- as.factor(d_train[,ycol])
d_test[,ycol] <- as.factor(d_test[,ycol])

随机森林

首先,randomForest包(最古老、最知名但不是最优化的包之一):

library(randomForest)
library(cvAUC)

# Train a default RF model with 100 trees
## set 
set.seed(123)  # For reproducibility
system.time(
  model <- randomForest(
    x = d_train[,xcols], 
    y = d_train[,ycol],
    xtest = d_test[,xcols],
    ntree = 100,
    nodes = TRUE # set to keep information on which trees in forest assigned to
    )
  )  ## user: 30.835 system: 0.008 elapsed: 30.837

# Generate predictions on test dataset
preds <- model$test$votes[, 2]
labels <- d_test[,ycol]

# Compute AUC on the test set
cvAUC::AUC(predictions = preds, labels = labels)

模型性能为0.8919 AUC。这比我们在 Python 中看到的要好一些(虽然慢了一点)

现在就像我们在 python 中所做的那样,我们需要对其进行训练并将其应用于整个数据集,跟踪每个样本被分配到森林中的哪些叶子。

set.seed(123)  # For reproducibility
md_full <- randomForest(formula = labels ~ ., data = df, ntree = 100, keep.forest = TRUE)
phat_full <- predict(md_full, newdata = df, type = "prob", nodes = TRUE)

# get the leaf indices that each sample was assigned to in the forest
leaves <- attr(phat_full, "nodes")
dim(leaves)

将其带回 Python 并绘制repl_python()

# Assign R object as Python object
leafs = r.leaves
# Get embeddings from UMAP
sup_embed_rf = UMAP(metric='hamming').fit_transform(leafs)

# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_rf[y == 0,0], sup_embed_rf[y == 0,1], s=1, c='C0', cmap='viridis', label='$y=0$')
plt.scatter(sup_embed_rf[y == 1,0], sup_embed_rf[y == 1,1], s=1, c='C1', cmap='viridis', label='$y=1$')
plt.title('Supervised embedding with randomForest')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.legend(fontsize=16, markerscale=5);
plt.show()
exit

在此处输入图像描述

# taking a sample of the dataframe
embed_sample = pd.DataFrame(sup_embed_rf).sample(5000, random_state=42)

# running fastcluster hierarchical clustering on the improved embedding
H = single(embed_sample)

# getting the clusters
clusters = cut_tree(H, height=0.35)
print('Number of clusters:', len(np.unique(clusters)))

这显示了 12 个集群(尽管模拟的数量是 20)。我很困惑,考虑到这种方法的AUC更高,我发现更少的集群?

# creating a dataframe for the clustering sample
clust_sample_df = pd.DataFrame({'cluster': clusters.reshape(-1), 'cl_sample':range(len(clusters))})

# creating an index with the sample used for clustering
index = NNDescent(embed_sample, n_neighbors=10)

# querying for all the data
nn = index.query(sup_embed_et, k=1)

# creating a dataframe with nearest neighbors for all samples
to_cluster_df = pd.DataFrame({'sample':range(sup_embed_et.shape[0]), 'cl_sample': nn[0].reshape(-1)})

# merging to assign cluster to all other samples, and tidying it
final_cluster_df = to_cluster_df.merge(clust_sample_df, on='cl_sample')
final_cluster_df = final_cluster_df.set_index('sample').sort_index()

# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_et[:,0], sup_embed_et[:,1], s=1, c=final_cluster_df['cluster'], cmap='plasma')
plt.title('Hierarchical clustering and randomForest')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.show()
exit

在此处输入图像描述

众所周知,其他 R 包在准确性和速度方面都优于其他 R 包,randomForest所以我也想从ranger和获取这些信息ExtraTrees。例如,“ranger”具有出色的速度并支持高维或宽数据(例如scRNA 测序数据)

游侠

到目前为止,这是我使用该ranger方法得到的结果:

library(ranger)
library(pROC)
set.seed(123)

# ranger speed
system.time(
  df_ranger <- ranger(
    formula = labels ~ .,
    data = d_train,
    num.trees = 100,
    num.threads = 1, # default is the number of CPUs on machine
    probability = TRUE
  )
) # user 13.047 system: 0.01 elapsed: 13.048

pred.ranger <- predict(df_ranger, data = d_test, type = "terminalNodes")
# get model accuracy
ranger.roc <- roc(d_test$labels, pred.ranger$predictions[,2])
pROC::auc(ranger.roc)

该模型在0.5471 AUC时表现不佳。

set.seed(123)
df_ranger <- ranger(formula = labels ~ ., data = df, num.trees = 100, probability = TRUE)

pred.ranger <- predict(df_ranger, data = df, type = "terminalNodes")
lyves <- pred.ranger$predictions

现在在 enterrepl_python()和 Python 中输入:

# Assign R object as Python object
lyves = r.lyves
# Get embeddings from UMAP
sup_embed_rg = UMAP(metric='hamming').fit_transform(lyves)

# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_rg[y == 0,0], sup_embed_rg[y == 0,1], s=1, c='C0', cmap='viridis', label='$y=0$')
plt.scatter(sup_embed_rg[y == 1,0], sup_embed_rg[y == 1,1], s=1, c='C1', cmap='viridis', label='$y=1$')
plt.title('Supervised embedding with ranger')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.legend(fontsize=16, markerscale=5);
plt.show()

在此处输入图像描述

# taking a sample of the dataframe
embed_sample = pd.DataFrame(sup_embed_rg).sample(5000, random_state=42)

# running fastcluster hierarchical clustering on the improved embedding
H = single(embed_sample)

# getting the clusters
clusters = cut_tree(H, height=0.35)
print('Number of clusters:', len(np.unique(clusters)))

这显示了 17 个集群(尽管模拟的数量是 20)。

# creating a dataframe for the clustering sample
clust_sample_df = pd.DataFrame({'cluster': clusters.reshape(-1), 'cl_sample':range(len(clusters))})

# creating an index with the sample used for clustering
index = NNDescent(embed_sample, n_neighbors=10)

# querying for all the data
nn = index.query(sup_embed_et, k=1)

# creating a dataframe with nearest neighbors for all samples
to_cluster_df = pd.DataFrame({'sample':range(sup_embed_et.shape[0]), 'cl_sample': nn[0].reshape(-1)})

# merging to assign cluster to all other samples, and tidying it
final_cluster_df = to_cluster_df.merge(clust_sample_df, on='cl_sample')
final_cluster_df = final_cluster_df.set_index('sample').sort_index()

# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_et[:,0], sup_embed_et[:,1], s=1, c=final_cluster_df['cluster'], cmap='plasma')
plt.title('Hierarchical clustering and ranger')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.show()
exit

在此处输入图像描述

额外的树

extraTrees方法可能scikit-learn是与's最接近的比较ExtraTreesClassifier

library(extraTrees)
y <- "labels"
characteristics <- setdiff(names(d_train), y)
train <- d_train[, characteristics]
test <- d_test[, characteristics]
set.seed(123)

system.time({
model_extraTrees <- extraTrees(x = train, 
                               y = as.factor(d_train$labels),
                               ntree = 500,
                               numThreads = 1 # must be set explicitly as the default is 1
                               )
}) # user 10.184 system: 0.104 elapsed: 9.770

更新:此时无法从 { extraTrees} 获取叶索引,因此我提出了功能请求

h20

在这个优秀的基准 repo中,机器学习库h2o被证明可以实现更高的AUC,所以我也想尝试一下。我将在所有其他方法中使用1 core和喜欢。ntrees = 100

在此处输入图像描述

library(h2o)
h2o.init(nthreads = 1)

# convert data to h2o objects
train <- as.h2o(d_train)
test <- as.h2o(d_test)

# Convert response to factor (required by randomForest)
train[,ycol] <- as.factor(train[,ycol])
test[,ycol] <- as.factor(test[,ycol])

system.time(
  model <- h2o.randomForest(
    x = xcols,
    y = ycol,
    training_frame = train,
    seed = 123,
    ntrees = 100
    )
  ) ## user: 0.199 system: 0.018 elapsed: 18.399

perf <- h2o.performance(model = model, newdata = test)
h2o.auc(perf)

模型性能为0.9077 AUC - 这是迄今为止最好的。我不确定是否可以获得分配的叶子索引?

随机森林至少有32 个 R 包

如果有人熟悉其他可以获取叶子索引的具有良好性能的软件包,我很想知道它。如果您发现任何错误,请告诉我,谢谢。

标签: rrandom-forestr-ranger

解决方案


推荐阅读