python-3.x - DataFrame 删除无用的列
问题描述
我使用以下代码来构建和准备我的 pandas 数据框:
data = pd.read_csv('statistic.csv',
parse_dates=True, index_col=['DATE'], low_memory=False)
data[['QUANTITY']] = data[['QUANTITY']].apply(pd.to_numeric, errors='coerce')
data_extracted = data.groupby(['DATE','ARTICLENO'])
['QUANTITY'].sum().unstack()
#replace string nan with numpy data type
data_extracted = data_extracted.fillna(value=np.nan)
#remove footer of csv file
data_extracted.index = pd.to_datetime(data_extracted.index.str[:-2],
errors="coerce")
#resample to one week rythm
data_resampled = data_extracted.resample('W-MON', label='left',
loffset=pd.DateOffset(days=1)).sum()
# reduce to one year
data_extracted = data_extracted.loc['2015-01-01' : '2015-12-31']
#fill possible NaNs with 1 (not 0, because of division by zero when doing
pct_change
data_extracted = data_extracted.replace([np.inf, -np.inf], np.nan).fillna(1)
data_pct_change =
data_extracted.astype(float).pct_change(axis=0).replace([np.inf, -np.inf],
np.nan).fillna(0)
# actual dropping logic if column has no values at all
data_pct_change.drop([col for col, val in data_pct_change.sum().iteritems()
if val == 0 ], axis=1, inplace=True)
normalized_modeling_data = preprocessing.normalize(data_pct_change,
norm='l2', axis=0)
normalized_data_headers = pd.DataFrame(normalized_modeling_data,
columns=data_pct_change.columns)
normalized_modeling_data = normalized_modeling_data.transpose()
kmeans = KMeans(n_clusters=3, random_state=0).fit(normalized_modeling_data)
print(kmeans.labels_)
np.savetxt('log_2016.txt', kmeans.labels_, newline="\n")
for i, cluster_center in enumerate(kmeans.cluster_centers_):
plp.plot(cluster_center, label='Center {0}'.format(i))
plp.legend(loc='best')
plp.show()
不幸的是,我的数据框中有很多 0(文章不是从同一日期开始的,所以如果 A 开始于 2015 年,B 开始于 2016 年,B 将在 2015 年全年获得 0)这是分组数据帧:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 9.0 6.0 560.0 2736.0
2015-01-19 2.0 1.0 560.0 3312.0
2015-01-26 NaN 5.0 600.0 2196.0
2015-02-02 NaN NaN 40.0 3312.0
2015-02-16 7.0 6.0 520.0 5004.0
2015-02-23 12.0 4.0 480.0 4212.0
2015-04-13 11.0 6.0 920.0 4230.0
这里是相应的百分比变化:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 -0.777778 -0.833333 0.000000 0.210526
2015-01-26 -0.500000 4.000000 0.071429 -0.336957
2015-02-02 0.000000 -0.800000 -0.933333 0.508197
2015-02-16 6.000000 5.000000 12.000000 0.510870
2015-02-23 0.714286 -0.333333 -0.076923 -0.158273
405659844106 处的因子 12 是“正确的”这是我的数据框中的另一个示例:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 2175.0 200.0 NaN NaN
2015-01-19 2550.0 NaN NaN NaN
2015-01-26 925.0 NaN NaN NaN
2015-02-02 675.0 NaN NaN NaN
2015-02-16 1400.0 200.0 120.0 NaN
2015-02-23 6125.0 320.0 NaN NaN
以及相应的百分比变化:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 0.172414 -0.995000 0.000000 -0.058824
2015-01-26 -0.637255 0.000000 0.000000 0.047794
2015-02-02 -0.270270 0.000000 0.000000 -0.996491
2015-02-16 1.074074 199.000000 119.000000 279.000000
2015-02-23 3.375000 0.600000 -0.991667 0.310714
正如你所看到的,有 200-300 的因子变化来自被替换的 NaN 变为实际值。
该数据用于进行 kmeans 聚类,而这种“废话”数据会破坏我的 kmeans-centers。
有谁知道如何删除这些列?
解决方案
我使用以下语句删除了无意义的列:
max_nan_value_count = 5
data_extracted = data_extracted.drop(data_extracted.columns[data_extracted.apply(lambda
col: col.isnull().sum() > max_nan_value_count)], axis=1)
推荐阅读
- angular - Angular 7 无法绑定到“routerlink”,因为它不是“a”的已知属性
- javascript - 选项卡处于非活动状态或浏览器最小化时的计时器延迟
- c++ - 如何将多项式的系数和指数提取为字符串
- python - 自动化重复的键盘动作
- node.js - 卡片上的操作按钮尺寸异常
- r - 在 R 中创建数据集的 For 循环
- wso2 - 如何在 wso2 api 管理器中验证和授权最终应用程序用户?
- gulp - 在项目中使用全局 gulp
- java - RequestMapping:如何访问用于休息端点的“方法”值
- windows - Windows 服务器上是否有机会查看 TCP 端口使用了多长时间?