首页 > 解决方案 > DataFrame 删除无用的列

问题描述

我使用以下代码来构建和准备我的 pandas 数据框:

data = pd.read_csv('statistic.csv', 
parse_dates=True, index_col=['DATE'], low_memory=False)
data[['QUANTITY']] = data[['QUANTITY']].apply(pd.to_numeric, errors='coerce')
data_extracted = data.groupby(['DATE','ARTICLENO']) 
['QUANTITY'].sum().unstack()
#replace string nan with numpy data type
data_extracted = data_extracted.fillna(value=np.nan)
#remove footer of csv file
data_extracted.index = pd.to_datetime(data_extracted.index.str[:-2], 
errors="coerce")
#resample to one week rythm
data_resampled = data_extracted.resample('W-MON', label='left', 
loffset=pd.DateOffset(days=1)).sum()
# reduce to one year
data_extracted = data_extracted.loc['2015-01-01' : '2015-12-31']
#fill possible NaNs with 1 (not 0, because of division by zero when doing 
pct_change
data_extracted = data_extracted.replace([np.inf, -np.inf], np.nan).fillna(1)
data_pct_change = 
data_extracted.astype(float).pct_change(axis=0).replace([np.inf, -np.inf], 
np.nan).fillna(0)
# actual dropping logic if column has no values at all
data_pct_change.drop([col for col, val in data_pct_change.sum().iteritems() 
if val == 0 ], axis=1, inplace=True)
normalized_modeling_data = preprocessing.normalize(data_pct_change, 
norm='l2', axis=0)
normalized_data_headers = pd.DataFrame(normalized_modeling_data, 
columns=data_pct_change.columns)
normalized_modeling_data = normalized_modeling_data.transpose()
kmeans = KMeans(n_clusters=3, random_state=0).fit(normalized_modeling_data)
print(kmeans.labels_)
np.savetxt('log_2016.txt', kmeans.labels_, newline="\n")
for i, cluster_center in enumerate(kmeans.cluster_centers_):
        plp.plot(cluster_center, label='Center {0}'.format(i))
plp.legend(loc='best')
plp.show()

不幸的是,我的数据框中有很多 0(文章不是从同一日期开始的,所以如果 A 开始于 2015 年,B 开始于 2016 年,B 将在 2015 年全年获得 0)这是分组数据帧:

ARTICLENO     205123430604  205321436644  405659844106  305336746308  
DATE                                                                     
2015-01-05            9.0            6.0          560.0         2736.0   
2015-01-19            2.0            1.0          560.0         3312.0   
2015-01-26            NaN            5.0          600.0         2196.0   
2015-02-02            NaN            NaN           40.0         3312.0   
2015-02-16            7.0            6.0          520.0         5004.0   
2015-02-23           12.0            4.0          480.0         4212.0   
2015-04-13           11.0            6.0          920.0         4230.0 

这里是相应的百分比变化:

ARTICLENO     205123430604   205321436644  405659844106  305336746308  
DATE                                                                     
2015-01-05       0.000000       0.000000       0.000000       0.000000   
2015-01-19      -0.777778      -0.833333       0.000000       0.210526   
2015-01-26      -0.500000       4.000000       0.071429      -0.336957   
2015-02-02       0.000000      -0.800000      -0.933333       0.508197   
2015-02-16       6.000000       5.000000      12.000000       0.510870   
2015-02-23       0.714286      -0.333333      -0.076923      -0.158273 

405659844106 处的因子 12 是“正确的”这是我的数据框中的另一个示例:

ARTICLENO     305123446353  205423146377  305669846421  905135949255  
DATE                                                                     
2015-01-05         2175.0          200.0            NaN            NaN   
2015-01-19         2550.0            NaN            NaN            NaN   
2015-01-26          925.0            NaN            NaN            NaN   
2015-02-02          675.0            NaN            NaN            NaN   
2015-02-16         1400.0          200.0          120.0            NaN   
2015-02-23         6125.0          320.0            NaN            NaN   

以及相应的百分比变化:

ARTICLENO      305123446353  205423146377  305669846421    905135949255  
DATE                                                                  
2015-01-05       0.000000       0.000000       0.000000    0.000000   
2015-01-19       0.172414      -0.995000       0.000000   -0.058824   
2015-01-26      -0.637255       0.000000       0.000000    0.047794   
2015-02-02      -0.270270       0.000000       0.000000   -0.996491   
2015-02-16       1.074074     199.000000     119.000000  279.000000   
2015-02-23       3.375000       0.600000      -0.991667    0.310714   

正如你所看到的,有 200-300 的因子变化来自被替换的 NaN 变为实际值。

该数据用于进行 kmeans 聚类,而这种“废话”数据会破坏我的 kmeans-centers。

有谁知道如何删除这些列?

标签: python-3.xpandas

解决方案


我使用以下语句删除了无意义的列:

max_nan_value_count = 5
data_extracted = data_extracted.drop(data_extracted.columns[data_extracted.apply(lambda 
col: col.isnull().sum() > max_nan_value_count)], axis=1)

推荐阅读