# K-means for stock market timeseries

22 views (last 30 days)
Abdelrazzaq on 19 Jan 2014
Answered: Abdelrazzaq on 2 Feb 2014
Hi
I am doing my research to test the accuracy of different volatility models in forecasting the stock market volatility using indexes time series. I need to cluster the data normally with K-means into two groups. I already have the time series from different stock markets but all came with the same length. I Just need to cluster each of them into two subsets. Then the first subset will be used to train the models and the second one will be used to test and to forecast the models. I wonder if you can give the direct code or at least how to start the k-means in Matlab.
I seriously look forward to hearing from you very soon.
Regards, Abdelrazzaq.

AJ von Alt on 20 Jan 2014
Edited: AJ von Alt on 20 Jan 2014
The function kmeans is part of the Statistics Toolbox in MATLAB. The following code demonstrates how to use k-means to cluster data into two groups and pull out the individual groups.
% Generate random data
nSamples = 100;
sampleWidth = 5;
X = rand(nSamples,sampleWidth);
trainingSetSize = 20;
% seperate into two groups using euclidean distance
% IDX will be size nsamples x 1 where each element indicates the label at
% that index
IDX = kmeans( X , 2 , 'distance' , 'sqEuclidean');
% separate the data into two groups
G1 = X(IDX == 1 , : );
G2 = X(IDX == 2 , : );
As a result of the k-means clustering, the groups will be self similar and would likely make very bad training and test data for an ML algorithm. A much more suitable function for generating training and test sets is the randsample function in the Statistics toolbox. By uniformly sampling a population at random, this function will provide more diverse training data to your ML algorithm and help improve its robustness.
% Randomly select trainingSetSize samples without replacement
rsIDX = randsample( size(X,1) , trainingSetSize );
% Create a logical mask for the selected values
tsMASK = false( nSamples , 1 );
% Separate the data into training and test samples.
GTraining = X( tsMASK , : );
GTest = X( ~ tsMASK , : ) ;
Abdelrazzaq on 20 Jan 2014
Dear AJ von Alt; Thanks for your response. Actually you gave me the codes but I still cannot partition my timeseries data as I am using stock price series and need to convert them into log return series then to perform the partitioning process. I need to know more how much is the sample size. is it the number of the observations in each series? So how to use the codes you provided on my series. I am new user in Matlab.
Moreover, I need to decompose the data and to regularize my them before doing my test. Then I need to train a group of GARCH univariate and multivariate models, then to test the forecasting accuracy with RMSE error measurement. In the second step, I need to apply the extreme value theory and finally a group of hybrid models namely HMM-GARCH and GA-HMM will be applied. I know that the optimtool and statistical tool boxes all can apply those models but I wonder if you can give me as much help as you can in order to save the time and to get the statistical results as soon as possible. I am in contact with an professional academic staff in the department of CS in our university and they will help me as well. I am reading books in Matlab and I recently bought the last version of Matlab and Simulink student DVD, but I've got already the 2012 version and all toolboxes were included on it. However, I still need more help. I look forward to hearing from you as I must at least preprocess the data and apply my classic models as soon as possible within maximum 3 days. I almost found the codes for all but must receive the technical support to avoid any possible mistake. Regards, Abedlrazzaq

Abdelrazzaq on 2 Feb 2014
So what ?!!!!
Please reply I used the codes for more than one time horizon but look at the bad result I got for weekly and quarterly matrix return series:
GTraining = Empty matrix: 754-by-0
Really strange!!!! I am sure that I am using the right and the same codes above and I already performed the test for other time series and the I obtained the results without having this crazy error message.
I need to complete the part of data analysis within maximum two days. So please advise and give me your attention,,,, I doubled checked all things , all functions are correct and no zero numbers.