首页 > 解决方案 > 如何将 PCA 与 SOM 相结合,以便在 python 中获得适当的数据点集群?

问题描述

我有大约 5 个不同的案例,我从每个案例中提取了大约 13/14 个统计特征。我想创建类似异常检测的方法,在其中使用主成分分析 (PCA) 减少特征矩阵,并考虑使用自组织图 (SOM) 来帮助组织集群,使其变得更加清晰,然后我想到了实现使用以下能够进行异常检测(我从这个链接得到它:机器学习异常检测和状态监控):

以下问题是:

编码:

def cov_matrix(data, verbose=False):
    covariance_matrix = np.cov(data, rowvar=False)
    if is_pos_def(covariance_matrix):
        inv_covariance_matrix = np.linalg.inv(covariance_matrix)
        if is_pos_def(inv_covariance_matrix):
            return covariance_matrix, inv_covariance_matrix
        else:
            print("Error: Inverse of Covariance Matrix is not positive definite!")
    else:
        print("Error: Covariance Matrix is not positive definite!")

def MahalanobisDist(inv_cov_matrix, mean_distr, data, verbose=False):
    inv_covariance_matrix = inv_cov_matrix
    vars_mean = mean_distr
    diff = data - vars_mean
    md = []
    for i in range(len(diff)):
        md.append(np.sqrt(diff[i].dot(inv_covariance_matrix).dot(diff[i])))
    return md

def MD_detectOutliers(dist, extreme=False, verbose=False):
    k = 3. if extreme else 2.
    threshold = np.mean(dist) * k
    outliers = []
    for i in range(len(dist)):
        if dist[i] >= threshold:
            outliers.append(i)  # index of the outlier
    return np.array(outliers)

def MD_threshold(dist, extreme=False, verbose=False):
    k = 3. if extreme else 2.
    threshold = np.mean(dist) * k
    return threshold

def is_pos_def(A):
    if np.allclose(A, A.T):
        try:
            np.linalg.cholesky(A)
            return True
        except np.linalg.LinAlgError:
            return False
    else:
        return False

## Get the Statistical features
## Form matrix
## Obtain the principal components
## Do SOM to the principal components (I am using miniSOM)
    # Initialization of SOM and training:
    som_shape = (1, 5)
    full_PCA_dataframe_np = full_pca_dataframe.to_numpy()
    som = MiniSom(som_shape[0], som_shape[1], full_PCA_dataframe_np.shape[1], sigma=.4, learning_rate=.15, neighborhood_function='gaussian')
    som.train_batch(full_PCA_dataframe_np, 8000, verbose=True)

    # each neuron represents a cluster
    winner_coordinates = np.array([som.winner(x) for x in full_PCA_dataframe_np]).T
    # with np.ravel_multi_index we convert the bidimensional coordinates to a monodimensional index
    cluster_index = np.ravel_multi_index(winner_coordinates, som_shape)
    
    # plotting the clusters using the first 2 dimentions of the data
    for c in np.unique(cluster_index):
        plt.scatter(full_PCA_dataframe_np[cluster_index == c, 0], full_PCA_dataframe_np[cluster_index == c, 1], label='cluster='+str(c), alpha=.5)

    # plotting centroids
    for centroid in som.get_weights():
        plt.scatter(centroid[:, 0], centroid[:, 1], marker='x',  s=25, linewidths=5, color='k', label='centroid')
    plt.legend()
    plt.show()

## Get the datapoints and Implement the Mahalanobis distance metric on each case:
data_train = np.array(X_train_PCA.values) # Say Case 1
data_test = np.array(X_test_PCA.values) # Say Case 3

# Obtain the covaraince matrix and implement Mahalanobis distance:
cov_matrix, inv_cov_matrix  = cov_matrix(data_train)
mean_distr = data_train.mean(axis=0)
dist_test = MahalanobisDist(inv_cov_matrix, mean_distr, data_test, verbose=False)
dist_train = MahalanobisDist(inv_cov_matrix, mean_distr, data_train, verbose=False)
threshold = MD_threshold(dist_train, extreme = True)

# Form matrix with anomaly column:
anomaly_train = pd.DataFrame()
anomaly_train['Mob dist']= dist_train
anomaly_train['Thresh'] = threshold
# If Mob dist above threshold: Flag as anomaly
anomaly_train['Anomaly'] = anomaly_train['Mob dist'] > anomaly_train['Thresh']
anomaly_train.index = X_train_PCA.index
anomaly = pd.DataFrame()
anomaly['Mob dist']= dist_test
anomaly['Thresh'] = threshold
# If Mob dist above threshold: Flag as anomaly
anomaly['Anomaly'] = anomaly['Mob dist'] > anomaly['Thresh']
anomaly.index = X_test_PCA.index
anomaly.head()

关于 SOM 的另一个问题,我使用 PCA 对 SOM 的输入是 50 行和 2 列,其中我有 5 个集群。当涉及到 SOM 时,我需要输入什么?这是我的代码使用miniSOM

# Initialization of SOM and training:
som_shape = (7, 7)
full_PCA_dataframe_np = full_pca_dataframe.to_numpy()
som = MiniSom(som_shape[0], som_shape[1], full_PCA_dataframe_np.shape[1], sigma=.5, learning_rate=.5, neighborhood_function='gaussian')
som.train_batch(full_PCA_dataframe_np, 8000, verbose=True)

标签: pythonpcaanomaly-detectionself-organizing-maps

解决方案


推荐阅读