Sunday, December 29, 2013

Price Spread based Mean Reversion Strategy within R and Python

Below piece of code within R and Python show how to apply basic mean reversion strategy based on price spread (also log price spread) for Gold and USD Oil ETFs.

#
# R code
#
# load price data of Gold and Usd Oil ETF 
g=read.csv("gold.csv", header=F)
o=read.csv("uso.csv", header=F)

# one month window length
wLen=22 

len=dim(g)[1]
hedgeRatio=matrix(rep(0,len),len)

# to verify if spread is stationary 
adfResP=0
# flag to enable log price
isLogPrice=0
for (t in wLen:len){
  g_w=g[(t-wLen+1):t,1]
  o_w=o[(t-wLen+1):t,1]
  
  if (isLogPrice==1){
    g_w=log(g_w)
    o_w=log(o_w)
  }
# linear regression
  reg=lm(o_w~g_w)
# get hedge ratio 
  hedgeRatio[t]=reg$coefficients[2];  
# verify if spread (residual) is stationary 
 adfRes=adf.test(reg$residuals, alternative='stationary')
# sum of p values  
  adfResP=adfResP+adfRes$p.value
}
# estimate mean p value
avgPValue=adfResP/(len-wLen)
# > 0.5261476
# as avg p value (0.5261476) indicates, actually, spread is not stationary, so strategy wont make much return. 


portf=cbind(g,o)
sportf=portf
if (isLogPrice==1){
  sportf=log(portf)
}
# estimate spread of portfolio = oil - headgeRatio*gold
spread=matrix(rowSums(cbind(-1*hedgeRatio,1)*sportf))

plot(spread[,1],type='l')

# trim N/A sections
start=wLen+1
hedgeRatio=hedgeRatio[start:len,1]
portf=portf[start:len,1:2]
spread=matrix(spread[start:len,1])

# negative Z score will be used as number of shares
# runmean and runsd are in caTools package
meanSpread=runmean(spread,wLen,endrule="constant") 
stdSpread=runsd(spread,wLen,endrule="constant")
numUnits=-(spread-meanSpread)/stdSpread #

positions=cbind(numUnits,numUnits)*cbind(-1*hedgeRatio,1)*portf

# daily profit and loss
lagPortf=lags(portf,1)[,3:4]
lagPos=lags(positions,1)[,3:4]
pnl=rowSums(lagPos*(portf-lagPortf)/lagPortf);

# return is P&L divided by gross market value of portfolio
ret=tail(pnl,-1)/rowSums(abs(lagPos))
plot(cumprod(1+ret)-1,type='l')

# annual percentage rate
APR=prod(1+ret)^(252/length(ret)) 
# > 1.032342 
sharpRatio=sqrt(252)*mean(ret)/stdev(ret)
# > 0.3713589

'''

Python code

Created on 29 Dec 2013

@author: deniz turan (denizstij@gmail.com)
'''

import numpy as np
import pandas as pd
from scipy.stats import linregress

o=pd.read_csv("uso.csv",header=0,names=["price"])
g=pd.read_csv("gold.csv",header=0,names=["price"])

len=o.price.count()
wLen=22
hedgeRatio= np.zeros((len,2))

for t in range(wLen, len):
    o_w=o.price[t-wLen:t]
    g_w=g.price[t-wLen:t]

    slope, intercept, r, p, stderr = linregress(g_w, o_w)
    hedgeRatio[t,0]=slope*-1
    hedgeRatio[t,1]=1


portf=np.vstack((g.price,o.price)).T
# spread 
spread=np.sum(np.multiply(portf,hedgeRatio),1)

# negative Z score will be used as number of shares
meanSpread=pd.rolling_mean(spread,wLen); 
stdSpread=pd.rolling_std(spread,wLen); 
numUnits=-(spread-meanSpread)/stdSpread #

#drop NaN values
start=wLen
g=g.drop(g.index[:start])
o=o.drop(o.index[:start])
hedgeRatio=hedgeRatio[start:,]
portf=portf[start:,]
spread=spread[start:,]
# number of units

numUnits=numUnits[start:,]
# position
positions=np.multiply(np.vstack((numUnits,numUnits)).T,np.multiply(portf,hedgeRatio))

# get lag 1
lagPortf=np.roll(portf,1,0);
lagPortf[0,]=lagPortf[1,];
lagPos=np.roll(positions,1,0);
lagPos[0,]=lagPos[1,];

spread=np.sum(np.multiply(portf,hedgeRatio),1)
pnl=np.sum(np.divide(np.multiply(lagPos,(portf-lagPortf)),lagPortf),1)

# return
ret=np.divide(pnl,np.sum(np.abs(lagPos),1))

APR=np.power(np.prod(1+ret),(252/float(np.size(ret,0)))) 
sharpRatio=np.sqrt(252)*float(np.mean(ret))/float(np.std(ret))
print " APR %f, sharpeRatio=%f" %( APR,sharpRatio)

Although, p-value for ADF test, APR (annual percentage rate) and sharpe ratio indicate, this strategy is not profitable, it is very basic strategy to apply.

Tuesday, December 03, 2013

Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) within R

Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) is a kind of ARMA (p,q) model to represent volatility:
$$ \sigma_t^2=\alpha_0 + \alpha_1 \epsilon_{t-1}^2 + \cdots + \alpha_q \epsilon_{t-q}^2 + \beta_1 \sigma_{t-1}^2 + \cdots + \beta_p\sigma_{t-p}^2 = \alpha_0 + \sum_{i=1}^q \alpha_i \epsilon_{t-i}^2 + \sum_{i=1}^p \beta_i \sigma_{t-i}^2 $$
where $\epsilon$ is a i.i.d random variable generally from a normal N(0,1) or student distribution and $\alpha_0 >0 , \alpha_t ≥ 0 , \beta_t ≥0, \alpha_t + \beta_t <1 $. In practice, generally low order of GARCH models are used in many application such as GARCH(1,1), GARCH (1,2), GARCH (2,1).

By utilising GARCH (1,1), to forecaste variance in k step ahead, following formula can be used:
$$ \sigma_t^2 (k)=\frac{\alpha_0}{1-\alpha_1 - \beta_1} ; k \to \infty$$


Lets have a look how to apply GARCH over monthly return of SP 500 from 1926 within R.
>library("fGarch") ## Library for GARCH
# get data
>data=read.table("http://www.mif.vu.lt/~rlapinskas/DUOMENYS/Tsay_fts3/sp500.dat",header=F)
> dim(data)
[1] 792   1
## First step is to model monthly return.  
>data=read.table("http://www.mif.vu.lt/~rlapinskas/DUOMENYS/Tsay_fts3/m-intc7308.txt",header=T)
# lets find out lag 
>ret=pacf(data)
> which.max(abs(ret$acf))
[1] 3
## Lets use ARMA(3,0) and GARCH(1,1) to model return 
> m1=garchFit(~arma(3,0)+garch(1,1),data=data,trace=F)
> summary(m1)

Title:
 GARCH Modelling 

Call:
 garchFit(formula = ~arma(3, 0) + garch(1, 1), data = data, trace = F) 

Mean and Variance Equation:
 data ~ arma(3, 0) + garch(1, 1)

 [data = data]

Conditional Distribution:
 norm 

Coefficient(s):
         mu          ar1          ar2          ar3        omega       alpha1  
 7.7077e-03   3.1968e-02  -3.0261e-02  -1.0649e-02   7.9746e-05   1.2425e-01  
      beta1  
 8.5302e-01  

Std. Errors:
 based on Hessian 

Error Analysis:
         Estimate  Std. Error  t value Pr(>|t|)    
mu      7.708e-03   1.607e-03    4.798 1.61e-06 ***
ar1     3.197e-02   3.837e-02    0.833  0.40473    
ar2    -3.026e-02   3.841e-02   -0.788  0.43076    
ar3    -1.065e-02   3.756e-02   -0.284  0.77677    
omega   7.975e-05   2.810e-05    2.838  0.00454 ** 
alpha1  1.242e-01   2.247e-02    5.529 3.22e-08 ***
beta1   8.530e-01   2.183e-02   39.075  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So, monthly return can be modeled as:
$$ r_t= 0.0078+0.032r_{t-1}-0.03r_{t-2}-0.01r_{t-3} + a_t$$ $$ \sigma_t^2=0.000080 + 0.12\alpha_{t-1}+0.85(\sigma_{t-1})^2$$
But since as t values of ARMA models are insignificant, we can use directly GARCH(1,1) to model it
> m2=garchFit(~garch(1,1),data=data,trace=F)
> summary(m2)

Title:
 GARCH Modelling 

Call:
 garchFit(formula = ~garch(1, 1), data = data, trace = F) 

Mean and Variance Equation:
 data ~ garch(1, 1)

 [data = data]

Conditional Distribution:
 norm 

Coefficient(s):
        mu       omega      alpha1       beta1  
7.4497e-03  8.0615e-05  1.2198e-01  8.5436e-01  

Std. Errors:
 based on Hessian 

Error Analysis:
        Estimate  Std. Error  t value Pr(>|t|)    
mu     7.450e-03   1.538e-03    4.845 1.27e-06 ***
omega  8.061e-05   2.833e-05    2.845  0.00444 ** 
alpha1 1.220e-01   2.202e-02    5.540 3.02e-08 ***
beta1  8.544e-01   2.175e-02   39.276  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So we can reduce model to more simpler form such as:
$$ r_t= 0.0074+a_t$$ $$ \sigma_t^2=0.000080 + 0.12a{t-1}+0.85(\sigma_{t-1})^2$$
Hence, variance of $a_t$ is:
$$ \frac{0.000080}{1-0.12 - 0.85} = 0.0027$$
Lets analyse if residuals are serially independent.
>stress=residuals(m2,standardize=T)
> Box.test(stress,12,type='Ljung')

 Box-Ljung test

data:  stress
X-squared = 11.9994, df = 12, p-value = 0.4457
So, above test indicates that there is not significant serial correlation in residuals, so we can conclude GARCH(1,1) is a good model for SP 500. We can also predict next 5 volatility of monthly return of SP 500 as:
> predict(m2,5)
  meanForecast  meanError standardDeviation
1  0.007449721 0.05377242        0.05377242
2  0.007449721 0.05388567        0.05388567
3  0.007449721 0.05399601        0.05399601
4  0.007449721 0.05410353        0.05410353
5  0.007449721 0.05420829        0.05420829

Friday, November 29, 2013

Testing AutoRegressive Conditional Heteroskedasticity (ARCH) Effect within R

Even though title as it self "AutoRegressive Conditional Heteroskedasticity (ARCH)" sounds scary, base idea is simple: conditional volatility (heteroskedasticity) of a time series is self (auto) regressive (ARMA (p,q)), in other words volatility is not constant over time:
$$ \sigma_t^2=\alpha_0 + \alpha_1 \epsilon_{t-1}^2 + \cdots + \alpha_q \epsilon_{t-q}^2 = \alpha_0 + \sum_{i=1}^q \alpha_i \epsilon_{t-i}^2 $$

In this text, i explain how to test if a time series has ARCH effect. In above formula, if null hypothesis is chosen as $ \alpha_t=0 $, we can conclude it has ARCH effect if null hypothesis is rejected. Box.test command in R can be used for this purpose. Note Box.test computes Ljung–Box test statistic for examining the null hypothesis of independence in a given time series. Below we analyse log return of Intel from 1973 to 2008.

>data=read.table("http://www.mif.vu.lt/~rlapinskas/DUOMENYS/Tsay_fts3/m-intc7308.txt",header=T)
>ret=log(data[,2]+1)
# lets find out lag 
>a=pacf(ret)
>summary(a)
       Length Class  Mode     
acf    26     -none- numeric  
type    1     -none- character
n.used  1     -none- numeric  
lag    26     -none- numeric  
series  1     -none- character
snames  0     -none- NULL  
# now test if there is a serial correlation
>Box.test(ret,lag=26,type='Ljung')

 Box-Ljung test

data:  ret
X-squared = 37.0152, df = 26, p-value = 0.07452

## As we can not reject the null hypothesis (independence) , we assume there is no serial correlation. 
## So we can now test if variance is constant or not.
> var=(ret-mean(ret))^2
> Box.test(var,lag=26,type='Ljung')

 Box-Ljung test

data:  var
X-squared = 104.7286, df = 26, p-value = 2.073e-11
Box.test shows we can reject the null hypothesis (independence) on variance, so it has significant serial correlation, in other words ARCH effect.

Friday, November 22, 2013

Java Garbage Collection Log File Scraper

'''
 
Created by Deniz Turan
 
Oracle Java (tested on 7), young generation garbage collection activity scraper. 
Extracts following fields from GC log file and save to a csv file.

Count,LogTime,logGCOffsetTime,logGCOffsetTime2, 
YGPreSize,YGPostSize,YGTotalSize, YGElapsedTime,     # Young generation
OLDPreSize,OLDPostSize,OLDTotalSize,OLDElapsedTime   # Old generation 

 
'''
from subprocess import call
import glob
import os
 
logDir="C:\\temp\\gc\\"
finalResultFileName=logDir+"finalResults.csv"
filterExtension="*.log";
 
def getLogFileList(search_dir):
        files = filter(os.path.isfile, glob.glob(search_dir + filterExtension))
        files.sort(key=lambda x: os.path.getmtime(x))       
        return files
 
def openResultFile():
    print "Creating result file : %s"% (finalResultFileName)
    # remove previous file
    call("rm "+finalResultFileName,shell=True)
    resultFileFD = open( finalResultFileName ,"a")
    ## create header
    resultFileFD.write("Count,LogTime,logGCOffsetTime,logGCOffsetTime2,")
    resultFileFD.write("YGPreSize,YGPostSize,YGTotalSize, YGElapsedTime,")
    resultFileFD.write("OLDPreSize,OLDPostSize,OLDTotalSize,OLDElapsedTime\n")
    return resultFileFD
       
def closeResultFile(resultFileFD):   
    print "Closing result file "
    resultFileFD.close();
 
def getFieldValue(strVal):
    index=strVal.index("K")
    index2=strVal.index("K", index+1)
    index3=strVal.index("K", index2+1)
    
    part1=strVal[:index]
    part2=strVal[index+3:index2]
    part3=strVal[index2+2:index3]
    return (part1,part2,part3)
 
#####################################################
# Main
#####################################################
if __name__ == '__main__':
    # prepare result file   
    resultFileFD=openResultFile ()
    
    print "Started to process log files"
    logFileList=getLogFileList(logDir)
    count=0
    for f in logFileList:
        print "Processing GC Log file %s"%f
        logFD = open(f)
        line = logFD.readline()
        while (line != "" ):           
            if "ParNew" in line :
                    count=count+1                   
                    fields=line.split(" ")
                    logTime=fields[0]
                    logGCOffsetTime=fields[1]
                    logGCOffsetTime2=fields[3]
                    res=getFieldValue(fields[5])
                    YGPreSize,YGPostSize,YGTotalSize=res
                    YGElapsedTime=fields[6]
                    res=getFieldValue(fields[8])
                    OLDPreSize,OLDPostSize,OLDTotalSize=res
                    OLDElapsedTime=fields[9]
                    print line
                   
                    print "%d %s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n"%(count,logTime,logGCOffsetTime,logGCOffsetTime2, \
                                                                   YGPreSize,YGPostSize,YGTotalSize, YGElapsedTime,\
                                                                   OLDPreSize,OLDPostSize,OLDTotalSize,OLDElapsedTime)
                    # print to file as CSV now
                    resultFileFD.write("%d,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n"%(count,logTime,logGCOffsetTime,logGCOffsetTime2, \
                                                                   YGPreSize,YGPostSize,YGTotalSize, YGElapsedTime,\
                                                                   OLDPreSize,OLDPostSize,OLDTotalSize,OLDElapsedTime))                   
                    
            line = logFD.readline()
            continue
        logFD.close();
    closeResultFile(resultFileFD);           
    print "finished processing log files"
    pass

Monday, November 18, 2013

Simple Passive Momentum Trading with Bollinger Band

Below, you can see a simple trading algorithm based on momentum and bollinger band on Quantopian.com

# Simple Passive Momentum Trading with Bollinger Band
import numpy as np
import statsmodels.api as stat
import statsmodels.tsa.stattools as ts

# globals for batch transform decorator
R_P = 1 # refresh period in days
W_L = 30 # window length in days
lookback=22
def initialize(context):
    context.stock = sid(24) # Apple (ignoring look-ahead bias)
    # for long and shorting 
    context.max_notional = 1000000
    context.min_notional = -1000000.0
    # set a fixed slippage
    set_slippage(slippage.FixedSlippage(spread=0.01))
                
def handle_data(context, data):
    # find moving average 
    rVal=getMeanStd(data)

    # lets dont do anything if we dont have enough data yet    
    if rVal is None:
        return    
    
    meanPrice,stdPrice = rVal
    price=data[context.stock].price
    notional = context.portfolio.positions[context.stock].amount * price
    
    # Passive momentum trading where for trading signal, Z-score is estimated
    h=((price-meanPrice)/stdPrice)
    # Bollinger band, if price is out of 2 std of moving mean, than lets trade     
    if h>2 and notional < context.max_notional  :
       # long
       order(context.stock,h*1000)
    if h<-2 and notional > context.min_notional:
       # short
       order(context.stock,h*1000)
     
@batch_transform(window_length=W_L, refresh_period=R_P) 
def getMeanStd(datapanel):
    prices = datapanel['price']
    meanPrice=prices.mean()
    stdPrice=prices.std()
    if meanPrice is not None and stdPrice is not None :
        return (meanPrice, stdPrice)
    else:
        return None

Screen shot of the back testing result is:
Click here to run algorithm on Quantopian.com.

Sunday, November 10, 2013

Cointegration Tests (ADF and Johansen) within R

In pair trading, in addition to correlation, cointegration can be very useful tool to determine which securities to be picked up. In this text, i demonstrate two approaches to test cointegration between two financial time series, two ETFs (EWA and EWC), within R. As you see below chart, in last 5 years, there is a cointegration between EWA and EWC.


ADF Test

In this test, we use linear regression to estimate spread between two securities and then ACF to test if spread is stationary, which in a way also test of cointegration for two securities.
>library("quantmod")  # To get data for symbols
> library("fUnitRoots") # Unit test 

## Lets get first data  for EWA and EWC from yahoo finance and extract adjusted close prices
>getSymbols("EWA")
>getSymbols("EWC")
>ewaAdj=unclass(EWA$EWA.Adjusted)
>ewcAdj=unclass(EWC$EWC.Adjusted)

## Now lets do linear regression where we assume drift is zero. Since we are not sure which security is dependent and independent, we need to apply following for both case

## EWC is dependent here 
> reg=lm (ewcAdj~ewaAdj+0)

## And now lets use adf test on spread (which is actually residuals of regression above step)
> adfTest(reg$residuals, type="nc")

Title:
 Augmented Dickey-Fuller Test

Test Results:
  PARAMETER:
    Lag Order: 1
  STATISTIC:
    Dickey-Fuller: -1.8082
  P VALUE:
    0.07148 

## EWA is dependent here this time
> reg=lm (ewaAdj~ewcAdj+0)
> adfTest(reg$residuals, type="nc") 

Title:
 Augmented Dickey-Fuller Test

Test Results:
  PARAMETER:
    Lag Order: 1
  STATISTIC:
    Dickey-Fuller: -1.7656
  P VALUE:
    0.07793 

We use most negative Dickey-Fuller value (-1.8082 and -1.7656) to choice which regression formula to use. Based on that, We choice EWC is dependent. Within 90% confidence level (p-value is 7%), we can reject null hypothesis (unit root), so we can assume spread (residual) is stationary, therefore there is a cointegration. Below coded a function for this purpose:

cointegrationTestLM_ADF <-function(A, B, startDate) {
  cat("Processing stock:",A ," and ", B, " start date:",startDate)
  
  aData=getSymbols(A,from=startDate,auto.assign = FALSE)
  aAdj=unclass(aData[,6])
  bData=getSymbols(B,from=startDate,auto.assign = FALSE)
  bAdj=unclass(bData[,6])
  lenA=length(aAdj)
  lenB=length(bAdj)
  N= min(lenA,lenB) 
  startA=0
  startB=0
  if (lenA!=N || lenB!=N){
    startA=lenA-N+1
    startB=lenB-N+1
  }
  cat("\nIndex start",A,":",startA," Length ",lenA )
  cat("\nIndex start",B,":",startB," Length ",lenB)
  aAdj=aAdj[startA:lenA,]
  bAdj=bAdj[startB:lenB,]
  
  regA=lm(aAdj~bAdj+0)
  
  summary(regA)
  regB=lm(bAdj~aAdj+0)
  summary(regB)
  
  coA <- adfTest(regA$residuals, type="nc")
  coB=adfTest(regB$residuals, type="nc")   
  
  
  cat("\n",A," p-value",coA@test$p.value," statistics:",coA@test$statistic)     
  cat("\n",B," p-value",coB@test$p.value," statistics:",coB@test$statistic)     
  
  
  # Lets choice most negative
  if (coA@test$statistic < coB@test$statistic){
   cat("\nStock ",A, " is dependent on stock ",B)
    cat("\np-value",coA@test$p.value," statistics:",coA@test$statistic)     
    p=coA@test$p.value
    s=coA@test$statistic
  }else {
    cat("\n Stock ",B, " is dependent on stock:",A)
    cat("\n p-value",coB@test$p.value," statistics:",coB@test$statistic)     
    p=coB@test$p.value
    s=coB@test$statistic     
   }   
  return(c(s,p))
}
How to run it:
res=cointegrationTestLM_ADF("EWA","EWC",'2007-01-01')
Processing stock: EWA  and  EWC  start date: 2007-01-01
Index start EWA : 0  Length  1731
Index start EWC : 0  Length  1731
 EWA  p-value 0.0501857  statistics: -1.948774
 EWC  p-value 0.04719164  statistics: -1.981454
 Stock  EWC  is dependent on stock: EWA
 p-value 0.04719164  statistics: -1.981454

res
  -1.98145360    0.04719164 

Johansen Test

As you see above ADF approach has some drawbacks such as:
- Not sure which security is dependent or independent
- Can not test multiple instruments

Johansen test addresses these points.
> library("urca") # For cointegration 

> coRes=ca.jo(data.frame(ewaAdj,ewcAdj),type="trace",K=2,ecdet="none", spec="longrun")
> summary(coRes)

###################### 
# Johansen-Procedure # 
###################### 

Test type: trace statistic , with linear trend 

Eigenvalues (lambda):
[1] 0.004881986 0.001200577

Values of teststatistic and critical values of test:

          test 10pct  5pct  1pct
r <= 1 |  2.07  6.50  8.18 11.65
r = 0  | 10.51 15.66 17.95 23.52

Eigenvectors, normalised to first column:
(These are the cointegration relations)

                EWA.Adjusted.l2 EWC.Adjusted.l2
EWA.Adjusted.l2        1.000000       1.0000000
EWC.Adjusted.l2       -1.253545      -0.3702406

Weights W:
(This is the loading matrix)

               EWA.Adjusted.l2 EWC.Adjusted.l2
EWA.Adjusted.d     0.007172485    -0.003894786
EWC.Adjusted.d     0.011970316    -0.001504604

Johansen test estimates the rank (r) of given matrix of time series with confidence level. In our example we have two time series, therefore Johansen tests null hypothesis of r=0 < (no cointegration at all), r<1 (till n-1, where n=2 in our example). As in example above, if r<=1 test value (2.07) was greater than a confidence level's value (say 10%: 6.50), we would assume there is a cointegration of r time series (in this case r<=1). But as you see, none of our test values are greater than than critical values at r<0 and r<=1, therefore there is no cointegration. This is opposite of ADF result we found above. Based on my some research, i've found that Johansen test can be misleading in some extreme case (see that discussion for more info). Once a cointegration is established, eigenvector (normalized first column) would be used as weight for a portfolio.

In addition above methods, KPSS(Kwiatkowski–Phillips–Schmidt–Shin)can be also used to test stationarity.

Sunday, November 03, 2013

Stationary Tests : Augmented Dickey–Fuller (ADF), Hurst Exponent, Variance Ratio (VRTest) of Time Series within R

Several trading strategies (momentum, mean reverting, ...) are based on if data is stationary or not. In this text, i demonstrate how to test it statistically.
library("quantmod") # for downloading fx data
library("pracma") # for hurst exponent 
library("vrtest") # variance ratio test
library("tseries") # for adf test
library("fUnitRoots")  # for adf test

## first lets fetch USD/CAD data for last 5 years
getFX("UDSCAD")
usdCad=unclass(USDCAD) # unwrap price column
# estimate log return
n=length(usdCad)
usdcadLog=log(usdCad[1:n])

## First use Augmented Dickey–Fuller Test (adf.test) to test USD/CAD is statationary
>adfTest(usdCad, lag=1)

Title:
 Augmented Dickey-Fuller Test

Test Results:
  PARAMETER:
    Lag Order: 1
  STATISTIC:
    Dickey-Fuller: 0.2556
  P VALUE:
    0.6978 

Description:
 Sun Nov 03 16:47:27 2013 by user: deniz

## As you see above, null hypothesis (unit root) can not be rejected with p-value ~70 %

## So we demonstrated it is not stationary. So if there is a trend or mean reverting. Hurst exponent (H) can be used for this purpose (Note Hursy exponent relies on that random walk diffuse in proportion to square root of time.). 
#Value of H can be interpreted such as: 
#H=0.5:Brownian motion (Random walk)
#H<0.5:Mean reverting  
#H>0.5:Trending 
> hurst(usdcadLog) Hurst exponent
[1] 0.9976377

#So, USDCAD is in trending phase. 

## Another way to test stationary is to use V:
> vrtest::Auto.VR(usdcadLog)
[1] 83.37723
> vrtest::Lo.Mac(usdcadLog,c(2,4,10))
$Stats
           M1       M2
k=2  22.10668 15.98633
k=4  35.03888 25.46031
k=10 56.21660 41.58861

## Another way to analyse stationary condition is via linear regression in which we will try to establish if there is a link between data and diff(data(t-1))

>deltaUsdcadLog=c(0,usdcadLog[2:n]-usdcadLog[1:n-1])
> r=lm(deltaUsdcadLog ~ usdcadLog)
> summary(r)

Call:
lm(formula = deltaUsdcadLog ~ usdcadLog)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0121267 -0.0013094 -0.0000982  0.0012327  0.0103982 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -7.133e-05  1.306e-04  -0.546   0.5853  
usdcadLog    8.772e-03  5.234e-03   1.676   0.0944 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.002485 on 498 degrees of freedom
Multiple R-squared:  0.005608, Adjusted R-squared:  0.003611 
F-statistic: 2.808 on 1 and 498 DF,  p-value: 0.0944

> r$coefficients[2]
  usdcadLog 
0.008771754 

## Coefficient (Beta) gives clue about if there is mean reverting. If it is negative, there is a mean reverting.  As you see above, it is positive, therefore as we already concluded before, it is trending. If it was negative, we would use following to find out half life of mean revertion:

>-log(2)/r$coefficients[2]

Wednesday, October 30, 2013

Downloading daily USD/CAD fx rate within R

Below code illustrates how to download daily USD/CAD fx rate within R from http://ratedata.gaincapital.com
downloadAndExtract_USD_CAD <- function(years,outFile) {
  mStr=c("01","02","03","04","05","06","07","08","09","10","11","12")

  if (file.exists(outFile)) {
    file.remove(outFile)
  }
  
  URL="http://ratedata.gaincapital.com"
  for (y in years) {
    for (m in mStr) {
      for (w in 1:5){
        mName=paste(m,month.name[as.numeric(m)])
        wName=paste("USD_CAD_Week",w,".zip",sep="");
        fullName=paste(URL,y,mName,wName,sep="/")
        a=paste("Downloading:",fullName,"\n" ,sep=" ",collapse="")
        cat(a)
        try(downloadAndExtractData(fullName,outFile),silent=TRUE)
      }
    }
  }
}

downloadAndExtractData <- function(zipfile,outFile) {
  # Create a name for the dir where we'll unzip
  zipdir <- tempfile()
  # Create the dir using that name
  dir.create(zipdir)
  dFile=paste(zipdir,"\\zzz.zip",sep="");
  print (dFile)
  try(download.file(zipfile,destfile=dFile, mode="wb"),silent=T)
  # Unzip the file into the dir
  unzip(dFile, exdir=zipdir)
  # Get the files into the dir
  files <- list.files(zipdir)
  # Throw an error if there's more than one
  if(length(files)>2) stop("More than one data file inside zip")
  # Get the full name of the file
  f<- paste(zipdir, files[1], sep="/")
  size=file.info(f)$size
  if (size==0) { 
    cat("Zero file size\n")
    return()
  }
  # Read the file
  cat("\n")
  print(c("Downladed tmp file:",f))
  cat("\n")
  dat=read.csv(f,header=F)
  len=dim(dat)
  print(c("Downloaded #nrows:",len[1]))
   
  #we are just interested in prices at 16:59
  index=grepl(" 16:59",dat[,"V3"])
  d=dat[index,]
  
  # append to output file
  write.table(d,outFile,append=TRUE,row.names=FALSE,col.names=FALSE) 
  
  # lets read all written data so far
  dat=read.csv(outFile)  
  cat("\n")
  print(c("Total rows:",dim(dat)[1]))
  
  # tidy up a bit
  file.remove(f)
}

Friday, October 25, 2013

fBasics (Basic Stats) library in R

% Load the package fBasics.
> library(fBasics)

% Load the data.
% header=T means 1st row of the data file contains variable names. The default is header=F, i.e., no names.
> da=read.table("http://www.mif.vu.lt/~rlapinskas/DUOMENYS/Tsay_fts3/d-ibm3dx7008.txt",header=T) 

> ibm=da[,2] % Obtain IBM simple returns
> sibm=ibm*100 % Percentage simple returns

> basicStats(sim) % Compute the summary statistics

% Turn to log returns in percentages
> libm=log(ibm+1)*100
> t.test(libm) % Test mean being zero.

Sunday, September 29, 2013

Efficient Frontier Portfolio Monte Carlo Simulation in Python

'''
Created on 29 Sep 2013
@author: deniz turan (denizstij AT gmail DOT com)
'''

import numpy as np
import QSTK.qstkutil.qsdateutil as du
import QSTK.qstkutil.tsutil as tsu
import QSTK.qstkutil.DataAccess as da

import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd


class PortfolioEfficientFrontierMonteCarloSimulator():
    
    
    def simulate(self, dt_start, endDate, ls_symbols, weights):        
        dt_timeofday = dt.timedelta(hours=16)
        ldt_timestamps = du.getNYSEdays(dt_start, dt_end, dt_timeofday)

        c_dataobj = da.DataAccess('Yahoo')
        ls_keys = ['open', 'high', 'low', 'close', 'volume', 'actual_close']
        ldf_data = c_dataobj.get_data(ldt_timestamps, ls_symbols, ls_keys)
        d_data = dict(zip(ls_keys, ldf_data))        
        closePrices = d_data['close'].values
        
        rows = closePrices.shape[0]
        
        dailyReturns = (closePrices[0:rows, :] / closePrices[0, :])
        dailyPortfolioCum = dailyReturns * weights;
        dailyPortfolioSum = np.sum(dailyPortfolioCum, axis=1)
        dailyPortfolioRet = (dailyPortfolioSum[1:rows] / dailyPortfolioSum[0:rows - 1]) - 1
        dailyPortfolioRet = np.append(0, dailyPortfolioRet) 

        mean = np.mean(dailyPortfolioRet, axis=0)
        vol = np.std(dailyPortfolioRet, axis=0)
        cumReturn = dailyPortfolioSum[-1]
        T = 252;
        riskFree = 0
        sharpRatio = np.sqrt(T) * (mean - riskFree) / vol
        return [mean, vol, sharpRatio,cumReturn]

    def generateShortWeights(self, numWeight):        
        weights = np.zeros(numWeight)
        weights[0:numWeight - 1] = np.random.randint(-101, 101, numWeight - 1)
        weights[numWeight - 1] = 100 - np.sum(weights)
        np.random.shuffle(weights) # eliminate bias on last element
        weights = weights / np.sum(weights)    
        return weights

    def generateNoShortWeights(self, numWeight):        
        weights = np.random.rand(numWeight)
        weights = weights / np.sum(weights)
        return weights

    
    def genareteWeight(self, numWeight, isShortAllowed):
        if isShortAllowed :
            return self.generateShortWeights(numWeight)
        else:
            return self.generateNoShortWeights(numWeight)
        
                
myClass = PortfolioEfficientFrontierMonteCarloSimulator()
dt_start = dt.datetime(2011, 1, 1)
dt_end = dt.datetime(2011, 12, 31)
stockList = ['AAPL', 'GLD', 'GOOG', 'XOM']
# weight=[0.4, 0.4, 0.0, 0.2]
numTrial=10000
res= np.zeros((numTrial,4))
weights= np.zeros((numTrial,4))

for i in range(0, numTrial):
    weights[i] = myClass.genareteWeight(4, True)    
    res[i,:]=myClass.simulate(dt_start, dt_end, stockList, weights[i])
    
#  find index of min/max of mean and vol
minMeanIndex=res[:,0].argmin()
maxMeanIndex=res[:,0].argmax()
minVolIndex=res[:,1].argmin()
maxVolIndex=res[:,1].argmax()

# min and max mean and vol
maxVol=res[maxVolIndex,1]
maxMean=res[maxMeanIndex,0]
minVol=res[minVolIndex,1]
minMean=res[minMeanIndex,0]

# lets plot now 
plt.clf()
plt.scatter(res[:,1],res[:,0],marker="+", linewidths=0.5)
# Plot global mean variance portfolio
plt.scatter(res[minVolIndex,1],res[minVolIndex,0],c='m',marker='x',linewidths=3) 
plt.xlim([minVol*0.8,maxVol*1.2])
plt.ylim([minMean*0.8,maxMean*1.2])
plt.ylabel('Return')
plt.xlabel('Vol')
plt.savefig('efficientFrontier.png', format='png')

# lets print some stats now
print "Global Mean-Variance, mean=%s, vol=%s, weights: %s" %( res[minVolIndex,0],res[minVolIndex,1], weights[minVolIndex,:])
print "minMeanIndex  mean=%s, vol=%s, weights: %s" %( res[minMeanIndex,0],res[minMeanIndex,1], weights[minMeanIndex,:])
print "maxMeanIndex  mean=%s, vol=%s, weights: %s" %( res[maxMeanIndex,0],res[maxMeanIndex,1], weights[maxMeanIndex,:])
print "maxVolIndex mean=%s, vol=%s, weights: %s" %( res[maxVolIndex,0],res[maxVolIndex,1], weights[maxVolIndex,:])

Efficient Frontier Portfolio (no short allowed)