首页 > 解决方案 > Error with Multi-variate Time Series clustering using dtwclust - There are missing values in the series

问题描述

I am working on a time series clustering using dtwclust package in R. I am using the fundamental dataset from S&P500 stocks. The Head of the data frame is as follows :

'>print(head(DFCleanPF))

  ticker reportperiod  price    ebitusd epsusd    pe   marketcap         gp
1    ADS   2019-06-29 140.13 1761800000  16.13  8.44  7340500696 3185200000
2    ADS   2019-03-30 174.98 1818600000  17.42  9.78  9273534921 3295100000
3    ADS   2018-12-30 150.08 1894300000  17.56  8.49  8175386182 3570300000
4    ADS   2018-09-29 236.16 1735100000  17.21 13.67 12975478451 3439600000
5    ADS   2018-06-29 233.20 1692000000  16.01 14.58 12919222400 3398900000
6    ADS   2018-03-30 212.86 1652700000  14.55 14.64 11805497214 3402700000
> '

Thus I have the quarterly filling data sets for few stocks ordered by ticker and dates. After that I perform normalize function and try to derive the cluster from the aforementioned data frame.

'DFCleanPF.norm <- BBmisc::normalize(DFCleanPF, method="standardize")
#show(DFCleanPF.norm)
missmap(DFCleanPF.norm)
mvc <- tsclust(DFCleanPF.norm, k = 4L, distance = "gak", seed = 390,
               args = tsclust_args(dist = list(sigma = 100)))'

missmap(DFCleanPF.norm) returns a DF with zero missing value.

However I encounter an error while there is no missing value in the data set. The error is as follows :

'Error in FUN(X[[i]], ...) : There are missing values in the series
In addition: There were 50 or more warnings (use warnings() to see the first 50)
>'

I am new in time series clustering. My goad is to find an ideal method to cluster the stocks to determine the risk and return based on the key parameters captured over time( for multiple records. Therefore I appreciate helps in: 1. Create an ideal data frame from the source dataset mentioned above and get rid of the errors. 2. The ideal dtwclust function with an optimum K value to get the best combination of cluster.

Please find the sample data sets. We can consider other clustering models which could help us achieving the goals. Not sure if the current data frame be treated as an input or we need to treat the symbols as numeric variables and create few matrices against each symbol.

Here is the sample data set :

enter image description here

标签: rtime-seriescluster-analysisdtw

解决方案


推荐阅读