首页 > 解决方案 > 在python中选择特征

问题描述

我正在尝试执行此算法http://venom.cs.utsa.edu/dmz/techrep/2007/CS-TR-2007-011.pdf

import pandas as pd
import pathlib
import gaitrec
from tsfresh import extract_features
from collections import defaultdict
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances

class PFA(object):
    def __init__(self, n_features, q=None):
        self.q = q
        self.n_features = n_features

    def fit(self, X):
        if not self.q:
            self.q = X.shape[1]
        pca = PCA(n_components=self.q).fit(X)
        A_q = pca.components_.T
        kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
        clusters = kmeans.predict(A_q)
        cluster_centers = kmeans.cluster_centers_
        dists = defaultdict(list)
        for i, c in enumerate(clusters):
            dist = euclidean_distances(A_q[i, :].reshape(1,-1), cluster_centers[c, :].reshape(1,-1))[0][0]
            dists[c].append((i, dist))
        self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
        self.features_ = X[:, self.indices_]


p = pathlib.Path(gaitrec.__file__).parent
dataset_file = p / 'DatasetC' / 'subj_001' / 'walk0' / 'subj_0010.csv'
read_csv = pd.read_csv(dataset_file, sep=';', decimal='.', names=['time','x','y', 'z', 'id'])
read_csv['id'] = 0

if __name__ == '__main__':
    print(read_csv)
    extracted_features = extract_features(read_csv, column_id="id", column_sort="time")
    features_withno_nanvalues = extracted_features.dropna(how='all', axis=1)
    print(features_withno_nanvalues)
    X = features_withno_nanvalues.to_numpy()
    pfa = PFA(n_features=2274, q=1)
    pfa.fit(X)
    Y = pfa.features_
    print(Y) #feature extracted
    column_indices = pfa.indices_ #index of the features
    print(column_indices)

C:\Users\Thund\AppData\Local\Programs\Python\Python37\python.exe C:/Users/Thund/Desktop/RepoBitbucket/Gaitrec/gaitrec/extraction.py
      time         x         y         z  id
0        0 -0.833333  0.416667 -0.041667   0
1        1 -0.833333  0.416667 -0.041667   0
2        2 -0.833333  0.416667 -0.041667   0
3        3 -0.833333  0.416667 -0.041667   0
4        4 -0.833333  0.416667 -0.041667   0
...    ...       ...       ...       ...  ..
1337  1337 -0.833333  0.416667  0.083333   0
1338  1338 -0.833333  0.416667  0.083333   0
1339  1339 -0.916667  0.416667  0.083333   0
1340  1340 -0.958333  0.416667  0.083333   0
1341  1341 -0.958333  0.416667  0.083333   0

[1342 rows x 5 columns]
Feature Extraction: 100%|██████████| 3/3 [00:04<00:00,  1.46s/it]
C:\Users\Thund\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\decomposition\_pca.py:461: RuntimeWarning: invalid value encountered in true_divide
  explained_variance_ = (S ** 2) / (n_samples - 1)
variable  x__abs_energy  ...  z__variation_coefficient
id                       ...                          
0           1430.496338  ...                  5.521904

[1 rows x 2274 columns]
C:/Users/Thund/Desktop/RepoBitbucket/Gaitrec/gaitrec/extraction.py:21: ConvergenceWarning: Number of distinct clusters (2) found smaller than n_clusters (2274). Possibly due to duplicate points in X.
  kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
[[1430.49633789   66.95824   ]]
[0, 1]

Process finished with exit code 0

我不明白警告和从 2k+ 特征中提取前 2 个的原因,这就是我所做的:

  1. 从原始数据生成协方差矩阵
  2. 使用 SVD 方法计算协方差矩阵的特征向量和特征值
  3. 这两个步骤结合起来就是你所说的 PCA。主成分是原始数据的协方差矩阵的特征向量,然后应用K-means算法。

我的问题是:

  1. 如何修复它给我的警告?
  2. 它只从 2k+ 个特征中选择 2 个特征,所以有什么问题吗?

标签: pythonscikit-learnk-meanspcafeature-selection

解决方案


如评论中所述,拟合后的特征来自 A_q 矩阵的索引,该矩阵的 PCA 特征数量减少。由于重塑,您将获得两个功能而不是 q 个功能(在本例中为 1 个)。self.features_ 应该可能来自 A_q 而不是 X。


推荐阅读