Monday, March 2, 2015

Experiments in Time Series Clustering

Last night I spotted this tweet about the R package TSclust.

I should start by saying that I really don’t know what I’m doing, so be warned.  I thought it would interesting to apply TSclust to the S&P 500 price time series.  I took the 1-day simple rate of change, grouped by year with dplyr, and then indexed by the day of the year all in one pipeR pipeline.  Since the TSclust paper

TSclust: An R Package for Time Series Clustering

Journal of Statistical Software, Volume 62, Issue 1

November 2014

http://www.jstatsoft.org/v62/i01/paper

demonstrates interoperability with hclust in their OECD interest rate example ( Section 5.2 ), I thought I could visualize the results nicely with treewidget from the epiwidgets package.  Just because the htmlwidget was designed for phylogeny doesn’t mean we can’t use it for finance.  Here is the result.

For reference and searching, I’ll copy the code below, but all of this can be found in this Github repo.

``````library(TSclust)
library(quantmod)
library(dplyr)
library(pipeR)
library(tidyr)
library(epiwidgets)

sp5 <- getSymbols("^GSPC",auto.assign=F,from="1900-01-01")[,4]

sp5 %>>%
# dplyr doesn't like xts, so make a data.frame
(
data.frame(
date = index(.)
,price = .[,1,drop=T]
)
) %>>%
# add a column for Year
mutate( year = as.numeric(format(date,"%Y"))) %>>%
# group by our new Year column
group_by( year ) %>>%
# within each year, find what day in the year so we can join
mutate( pos = rank(date) ) %>>%
mutate( roc = price/lag(price,k=1) - 1 ) %>>%
# can remove date
select( -c(date,price) ) %>>%
as.data.frame %>>%
# years as columns as pos as row
# remove last year since assume not complete
( .[,-ncol(.)] ) %>>%
# remove pos since index will be same
select( -pos ) %>>%
# fill nas with previous value
na.fill( 0 ) %>>%
t %>>%
# use TSclust diss; notes lots of METHOD options
diss( METHOD="ACF" ) %>>%
hclust %>>%
ape::as.phylo() %>>%
treewidget``````

2. Thanks so much for responding. I'll work through the paper thoroughly, but on first skim, I get the point, but I'm not sure I would classify all efforts on all time series as meaningless. If nothing else, hopefully readers will get some efficient dplyr code to get xts data in the pivoted format.

Thanks again. If I can get confident enough in my understanding, I'll try to write a post, or if you would like happy to put up a guest post :)

3. I worked through the paper, and I think there method of clustering and potential application vary much differently from what I intended. For instance, in this, I am clustering years (returns not prices) based on ACF, and the output seems reasonable. More than anything this has given me a reason to develop some interactive explorations.

Thanks again.

4. Nice to know someone sees my tweets and that they generate some inquiry! This package did work well for me in an sales transaction example that I'll present at INFORMS 2015, n=3345, with the DWT wavelets method. Very pleased with its findings. I could not get DTWARP to work with this large sample size.

For equities, you might compare the 500 equities against each other over the entire time series, rather than years. Then you can see how they wiggle against each other. Year-on-year will have seasonal effects that you could pull using decompose() and larger economic cycles that would dominate the series.

For equities you might look at distance methods including SAX or DWT, maybe DTWARP. Good luck!!