""" Created on Sat Jan-03-2015 @author: Deniz Turan (http://denizstij.blogspot.co.uk/) """ import numpy as np import statsmodels.api as sm from scipy.interpolate import interp1d def kpssTest(x, regression="LEVEL",lshort = True): """ KPSS Test for Stationarity Computes the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test for the null hypothesis that x is level or trend stationary. Parameters ---------- x : array_like, 1d data series regression : str {'LEVEL','TREND'} Indicates the null hypothesis and must be one of "Level" (default) or "Trend". lshort : bool a logical indicating whether the short or long version of the truncation lag parameter is used. Returns ------- stat : float Test statistic pvalue : float the p-value of the test. usedlag : int Number of lags used. Notes ----- Based on kpss.test function of tseries libraries in R. To estimate sigma^2 the Newey-West estimator is used. If lshort is TRUE, then the truncation lag parameter is set to trunc(3*sqrt(n)/13), otherwise trunc(10*sqrt(n)/14) is used. The p-values are interpolated from Table 1 of Kwiatkowski et al. (1992). If the computed statistic is outside the table of critical values, then a warning message is generated. Missing values are not handled. References ---------- D. Kwiatkowski, P. C. B. Phillips, P. Schmidt, and Y. Shin (1992): Testing the Null Hypothesis of Stationarity against the Alternative of a Unit Root. Journal of Econometrics 54, 159--178. Examples -------- x=numpy.random.randn(1000) # is level stationary kpssTest(x) y=numpy.cumsum(x) # has unit root kpssTest(y) z=x+0.3*arange(1,len(x)+1) # is trend stationary kpssTest(z,"TREND") """ x = np.asarray(x,float) if len(x.shape)>1: raise ValueError("x is not an array or univariate time series") if regression not in ["LEVEL", "TREND"]: raise ValueError(("regression option %s not understood") % regression) n = x.shape[0] if regression=="TREND": t=range(1,n+1) t=sm.add_constant(t) res=sm.OLS(x,t).fit() e=res.resid table=[0.216, 0.176, 0.146, 0.119] else: t=np.ones(n) res=sm.OLS(x,t).fit() e=res.resid table=[0.739, 0.574, 0.463, 0.347] tablep=[0.01, 0.025, 0.05, 0.10] s=np.cumsum(e) eta=np.sum(np.power(s,2))/(np.power(n,2)) s2 = np.sum(np.power(e,2))/n if lshort: l=np.trunc(3*np.sqrt(n)/13) else: l=np.trunc(10*np.sqrt(n)/14) usedlag =int(l) s2=R_pp_sum(e,len(e),usedlag ,s2) stat=eta/s2 pvalue , msg=approx(table, tablep, stat) print "KPSS Test for ",regression," Stationarity\n" print ("KPSS %s=%f" % (regression, stat)) print ("Truncation lag parameter=%d"% usedlag ) print ("p-value=%f"%pvalue ) if msg is not None: print "\nWarning:",msg return ( stat,pvalue , usedlag ) def R_pp_sum (u, n, l, s): tmp1 = 0.0 for i in range(1,l+1): tmp2 = 0.0 for j in range(i,n): tmp2 += u[j]*u[j-i] tmp2 = tmp2*(1.0-(float(i)/((float(l)+1.0)))) tmp1 = tmp1+tmp2 tmp1 = tmp1/float(n) tmp1 = tmp1*2.0 return s + tmp1 def approx(x,y,v): if (v>x[0]): return (y[0],"p-value smaller than printed p-value") if (v
Showing posts with label Financial Time Series Analysis. Show all posts
Showing posts with label Financial Time Series Analysis. Show all posts
Saturday, January 03, 2015
Stationarity Test with KPSS (Kwiatkowski–Phillips–Schmidt–Shin) in Python
In addition to augmented Dickey–Fuller test (ADF), KPSS (Kwiatkowski–Phillips–Schmidt–Shin) is widely used to verify stationarity of a signal. Python statsmodels contains ADF test, but I could not find any implementation of KPSS in python. Therefore, I converted R implementation of kpss.test in tseries library to python as follows (Click here to download the source code) :
Sunday, November 10, 2013
Cointegration Tests (ADF and Johansen) within R
In pair trading, in addition to correlation, cointegration can be very useful tool to determine which securities to be picked up. In this text, i demonstrate two approaches to test cointegration between two financial time series, two ETFs (EWA and EWC), within R. As you see below chart, in last 5 years, there is a cointegration between EWA and EWC.
- Not sure which security is dependent or independent - Can not test multiple instruments Johansen test addresses these points.
ADF Test
In this test, we use linear regression to estimate spread between two securities and then ACF to test if spread is stationary, which in a way also test of cointegration for two securities.>library("quantmod") # To get data for symbols > library("fUnitRoots") # Unit test ## Lets get first data for EWA and EWC from yahoo finance and extract adjusted close prices >getSymbols("EWA") >getSymbols("EWC") >ewaAdj=unclass(EWA$EWA.Adjusted) >ewcAdj=unclass(EWC$EWC.Adjusted) ## Now lets do linear regression where we assume drift is zero. Since we are not sure which security is dependent and independent, we need to apply following for both case ## EWC is dependent here > reg=lm (ewcAdj~ewaAdj+0) ## And now lets use adf test on spread (which is actually residuals of regression above step) > adfTest(reg$residuals, type="nc") Title: Augmented Dickey-Fuller Test Test Results: PARAMETER: Lag Order: 1 STATISTIC: Dickey-Fuller: -1.8082 P VALUE: 0.07148 ## EWA is dependent here this time > reg=lm (ewaAdj~ewcAdj+0) > adfTest(reg$residuals, type="nc") Title: Augmented Dickey-Fuller Test Test Results: PARAMETER: Lag Order: 1 STATISTIC: Dickey-Fuller: -1.7656 P VALUE: 0.07793We use most negative Dickey-Fuller value (-1.8082 and -1.7656) to choice which regression formula to use. Based on that, We choice EWC is dependent. Within 90% confidence level (p-value is 7%), we can reject null hypothesis (unit root), so we can assume spread (residual) is stationary, therefore there is a cointegration. Below coded a function for this purpose:
cointegrationTestLM_ADF <-function(A, B, startDate) { cat("Processing stock:",A ," and ", B, " start date:",startDate) aData=getSymbols(A,from=startDate,auto.assign = FALSE) aAdj=unclass(aData[,6]) bData=getSymbols(B,from=startDate,auto.assign = FALSE) bAdj=unclass(bData[,6]) lenA=length(aAdj) lenB=length(bAdj) N= min(lenA,lenB) startA=0 startB=0 if (lenA!=N || lenB!=N){ startA=lenA-N+1 startB=lenB-N+1 } cat("\nIndex start",A,":",startA," Length ",lenA ) cat("\nIndex start",B,":",startB," Length ",lenB) aAdj=aAdj[startA:lenA,] bAdj=bAdj[startB:lenB,] regA=lm(aAdj~bAdj+0) summary(regA) regB=lm(bAdj~aAdj+0) summary(regB) coA <- adfTest(regA$residuals, type="nc") coB=adfTest(regB$residuals, type="nc") cat("\n",A," p-value",coA@test$p.value," statistics:",coA@test$statistic) cat("\n",B," p-value",coB@test$p.value," statistics:",coB@test$statistic) # Lets choice most negative if (coA@test$statistic < coB@test$statistic){ cat("\nStock ",A, " is dependent on stock ",B) cat("\np-value",coA@test$p.value," statistics:",coA@test$statistic) p=coA@test$p.value s=coA@test$statistic }else { cat("\n Stock ",B, " is dependent on stock:",A) cat("\n p-value",coB@test$p.value," statistics:",coB@test$statistic) p=coB@test$p.value s=coB@test$statistic } return(c(s,p)) }How to run it:
res=cointegrationTestLM_ADF("EWA","EWC",'2007-01-01') Processing stock: EWA and EWC start date: 2007-01-01 Index start EWA : 0 Length 1731 Index start EWC : 0 Length 1731 EWA p-value 0.0501857 statistics: -1.948774 EWC p-value 0.04719164 statistics: -1.981454 Stock EWC is dependent on stock: EWA p-value 0.04719164 statistics: -1.981454 res -1.98145360 0.04719164
Johansen Test
As you see above ADF approach has some drawbacks such as:- Not sure which security is dependent or independent - Can not test multiple instruments Johansen test addresses these points.
> library("urca") # For cointegration > coRes=ca.jo(data.frame(ewaAdj,ewcAdj),type="trace",K=2,ecdet="none", spec="longrun") > summary(coRes) ###################### # Johansen-Procedure # ###################### Test type: trace statistic , with linear trend Eigenvalues (lambda): [1] 0.004881986 0.001200577 Values of teststatistic and critical values of test: test 10pct 5pct 1pct r <= 1 | 2.07 6.50 8.18 11.65 r = 0 | 10.51 15.66 17.95 23.52 Eigenvectors, normalised to first column: (These are the cointegration relations) EWA.Adjusted.l2 EWC.Adjusted.l2 EWA.Adjusted.l2 1.000000 1.0000000 EWC.Adjusted.l2 -1.253545 -0.3702406 Weights W: (This is the loading matrix) EWA.Adjusted.l2 EWC.Adjusted.l2 EWA.Adjusted.d 0.007172485 -0.003894786 EWC.Adjusted.d 0.011970316 -0.001504604Johansen test estimates the rank (r) of given matrix of time series with confidence level. In our example we have two time series, therefore Johansen tests null hypothesis of r=0 < (no cointegration at all), r<1 (till n-1, where n=2 in our example). As in example above, if r<=1 test value (2.07) was greater than a confidence level's value (say 10%: 6.50), we would assume there is a cointegration of r time series (in this case r<=1). But as you see, none of our test values are greater than than critical values at r<0 and r<=1, therefore there is no cointegration. This is opposite of ADF result we found above. Based on my some research, i've found that Johansen test can be misleading in some extreme case (see that discussion for more info). Once a cointegration is established, eigenvector (normalized first column) would be used as weight for a portfolio. In addition above methods, KPSS(Kwiatkowski–Phillips–Schmidt–Shin)can be also used to test stationarity.
Sunday, November 03, 2013
Stationary Tests : Augmented Dickey–Fuller (ADF), Hurst Exponent, Variance Ratio (VRTest) of Time Series within R
Several trading strategies (momentum, mean reverting, ...) are based on if data is stationary or not. In this text, i demonstrate how to test it statistically.
library("quantmod") # for downloading fx data library("pracma") # for hurst exponent library("vrtest") # variance ratio test library("tseries") # for adf test library("fUnitRoots") # for adf test ## first lets fetch USD/CAD data for last 5 years getFX("UDSCAD") usdCad=unclass(USDCAD) # unwrap price column # estimate log return n=length(usdCad) usdcadLog=log(usdCad[1:n]) ## First use Augmented Dickey–Fuller Test (adf.test) to test USD/CAD is statationary >adfTest(usdCad, lag=1) Title: Augmented Dickey-Fuller Test Test Results: PARAMETER: Lag Order: 1 STATISTIC: Dickey-Fuller: 0.2556 P VALUE: 0.6978 Description: Sun Nov 03 16:47:27 2013 by user: deniz ## As you see above, null hypothesis (unit root) can not be rejected with p-value ~70 % ## So we demonstrated it is not stationary. So if there is a trend or mean reverting. Hurst exponent (H) can be used for this purpose (Note Hursy exponent relies on that random walk diffuse in proportion to square root of time.). #Value of H can be interpreted such as: #H=0.5:Brownian motion (Random walk) #H<0.5:Mean reverting #H>0.5:Trending > hurst(usdcadLog) Hurst exponent [1] 0.9976377 #So, USDCAD is in trending phase. ## Another way to test stationary is to use V: > vrtest::Auto.VR(usdcadLog) [1] 83.37723 > vrtest::Lo.Mac(usdcadLog,c(2,4,10)) $Stats M1 M2 k=2 22.10668 15.98633 k=4 35.03888 25.46031 k=10 56.21660 41.58861 ## Another way to analyse stationary condition is via linear regression in which we will try to establish if there is a link between data and diff(data(t-1)) >deltaUsdcadLog=c(0,usdcadLog[2:n]-usdcadLog[1:n-1]) > r=lm(deltaUsdcadLog ~ usdcadLog) > summary(r) Call: lm(formula = deltaUsdcadLog ~ usdcadLog) Residuals: Min 1Q Median 3Q Max -0.0121267 -0.0013094 -0.0000982 0.0012327 0.0103982 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.133e-05 1.306e-04 -0.546 0.5853 usdcadLog 8.772e-03 5.234e-03 1.676 0.0944 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.002485 on 498 degrees of freedom Multiple R-squared: 0.005608, Adjusted R-squared: 0.003611 F-statistic: 2.808 on 1 and 498 DF, p-value: 0.0944 > r$coefficients[2] usdcadLog 0.008771754 ## Coefficient (Beta) gives clue about if there is mean reverting. If it is negative, there is a mean reverting. As you see above, it is positive, therefore as we already concluded before, it is trending. If it was negative, we would use following to find out half life of mean revertion: >-log(2)/r$coefficients[2]
Friday, October 25, 2013
fBasics (Basic Stats) library in R
% Load the package fBasics. > library(fBasics) % Load the data. % header=T means 1st row of the data file contains variable names. The default is header=F, i.e., no names. > da=read.table("http://www.mif.vu.lt/~rlapinskas/DUOMENYS/Tsay_fts3/d-ibm3dx7008.txt",header=T) > ibm=da[,2] % Obtain IBM simple returns > sibm=ibm*100 % Percentage simple returns > basicStats(sim) % Compute the summary statistics % Turn to log returns in percentages > libm=log(ibm+1)*100 > t.test(libm) % Test mean being zero.
Labels:
Basic Statistic,
finance,
Financial Time Series Analysis,
R,
T Test
Subscribe to:
Posts (Atom)