Monday, February 24, 2014

Linear Algebra and Modern (Markowitz) Portfolio Theory within R

In this text, i demonstrate basic linear algebra, matrix factorization (LU and QR) to solve linear equations and regression, lagrange method, basic numerical methods (bisection and Newton-Raphson) in context of modern (Markowitz, efficient frontier) portfolio theory to estimate optimum weights for given constraints within R. Firstly, lets do some basic linear equation exercises in R.

Load matrixcalc library

library("matrixcalc")

Lets assume we have a linear equation such as: Ax=b where A is square matrix and x is unknown parameters column vector and b is also column vector.

A = matrix(c(2, 2, 4, 9), 2, 2, byrow = T)
A
##      [,1] [,2]
## [1,]    2    2
## [2,]    4    9
b = c(8, 21)

To solve this equation, solve command can be used such as:

solve(A, b)
## [1] 3 1

We can also use various factorization method to resolve this linear equation.

LU Factorization

LU factorization creates elimination matrices (lower and upper) which will be used to solve linear equations.

  1. The upper triangular U has the pivots on its diagonal
  2. The lower triangular L has ones on its diagonal
  3. L has the multipliers \( l_{ij} \) below the diagonal
lu = lu.decomposition(A)
lu
## $L
##      [,1] [,2]
## [1,]    1    0
## [2,]    2    1
## 
## $U
##      [,1] [,2]
## [1,]    2    2
## [2,]    0    5

To solve x via L (lower triangle), U (upper triangle), we have two steps:

First step is to find a column vector c which solves Lc=b equation

c = solve(lu$L, b)
c
## [1] 8 5

In the last step, Ux=c equation is solved which yields x, unknown vector.

x = solve(lu$U, c)
x
## [1] 3 1

QR Factorization

QR factorization decomposites a matrix (A) into two matrices (Q, R) , where A=QR , Q is an orthogonal matrix and R is an upper triangular matrix.


aqr = qr(A)
# orthogonal matrix
aQ = qr.Q(aqr)
aQ
##         [,1]    [,2]
## [1,] -0.4472 -0.8944
## [2,] -0.8944  0.4472
# upper triangular matrix
aR = qr.R(aqr)
aR
##        [,1]   [,2]
## [1,] -4.472 -8.944
## [2,]  0.000  2.236

Since orthogonal matrix has properties of : \( QQ^T=I \), Ax=B equation can be written as: \( Rx=Q^Tb \) where , first A decomposited into QR and then mupltiplied by \( Q^T \) .

aQT = t(aQ)
qb = aQT %*% b
x = backsolve(aR, qb)
x
##      [,1]
## [1,]    3
## [2,]    1

# all of above steps can be done with a one command too
x = qr.solve(A, b)
x
## [1] 3 1

QR factorization for Linear regression

QR factorization is also useful for linear regression fit (\( y=\beta x+\alpha \)) For example, lets try to estimate coefficients (\( \beta \) and \( \alpha \)) of linear regression between Apple and Google's montly returns.

library(quantmod)

# download price of Google and apple
getSymbols(c("GOOG", "AAPL"))
## [1] "GOOG" "AAPL"

google = c(coredata(monthlyReturn(GOOG)))
apple = c(coredata(monthlyReturn(AAPL)))
# let plot returns
plot(google, apple)

plot of chunk unnamed-chunk-9

Lets apply QR factorization now to find out linear regression (apple~google)

x = cbind(1, google)
xqr = qr(x)

xQ = qr.Q(xqr, complete = T)
xR = qr.R(xqr, complete = T)
# Compute u = QTy
u = t(xQ) %*% apple
# and finally solve for $ \beta $ and $ \alpha $
backsolve(xR[1:2, 1:2], u[1:2])
## [1] 0.01636 0.66522

# lets verify the result with lm command
lm(apple ~ google)
## 
## Call:
## lm(formula = apple ~ google)
## 
## Coefficients:
## (Intercept)       google  
##      0.0164       0.6652

Linear Algebra in Portfolio Theory

Linear equations are common in modern (markowitz, efficient frontier) portfolio theory in finance to estimate optimal weights subject to some constrains, such as maximize return or minimize variance. For example:

\[ maximize: \mu^{T}w \] \[ subject to: e^{T}w=1 \] \[ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ w^T\Sigma w=\sigma^2 \]

where we have following return vector, target risk and covariance matrix :

mu = c(0.08, 0.1, 0.13, 0.15, 0.2)
mu
## [1] 0.08 0.10 0.13 0.15 0.20
s = 0.25^2  # target risk 
s
## [1] 0.0625
Sigma = matrix(c(0.0196, -0.00756, 0.01288, 0.00875, -0.0098, -0.00756, 0.0324, 
    -0.00414, -0.009, 0.00945, 0.01288, -0.00414, 0.0529, 0.020125, 0.020125, 
    0.00875, -0.009, 0.020125, 0.0625, -0.013125, -0.0098, 0.00945, 0.020125, 
    -0.013125, 0.1225), 5, 5, byrow = T)
Sigma
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,]  0.01960 -0.00756  0.01288  0.00875 -0.00980
## [2,] -0.00756  0.03240 -0.00414 -0.00900  0.00945
## [3,]  0.01288 -0.00414  0.05290  0.02013  0.02013
## [4,]  0.00875 -0.00900  0.02013  0.06250 -0.01312
## [5,] -0.00980  0.00945  0.02013 -0.01312  0.12250

To find optimum weights, above constraint problem needs to be solved. We will use Lagrangian method for this purpose. Lagrangian is:

\[ F(w,\lambda)= \mu^{T}w + \lambda_1(e^{T}w-1)+\lambda_2(w^T\Sigma w-\sigma^2) \]

To maximize return, we can take first partial derivative of $ F'(w, \lambda)=0$ equation:

\[ G(w, \lambda) = F'(w, \lambda) = \left| \begin{array}{c} \mu+\lambda_1 e + 2\lambda_2 w \Sigma \\ e^{T}w-1 \\ w^T\Sigma w-\sigma^2 \end{array} \right| = 0 \]

To solve this quadratic equation, we can deploy numerical methods such as bisection or Newton-Raphson method. Let me explain briefly, what are these methods:

Bisection Method

If a function f(x) is continuous between [a,b] and if f(a) and f(b) have different sign, then by using following algorithm f(x)=0 can be found:

  1. Increase count= count+1
  2. Compute f( c ) where c =(a + b)/2 is the midpoint of [a, b]
  3. If sign(f( c )) == sign(f(a)), let a=c, otherwise b=c;
  4. If |b-a| > tolerance and count < totalIteration then goto 1

R implementation would be:

bisection <- function(f, a, b, tol = 0.001, iteration = 100) {
    count = 0
    while (b - a > tol & count < iteration) {
        c <- (a + b)/2
        if (sign(f(c)) == sign(f(a))) 
            a <- c else b <- c
        count = count + 1
    }
    (a + b)/2
}

This is analogous to binary search algorithm in computer science.

Newton-Raphson method

Newton-Raphson method utilises fundemantal theory of derivatives in order to find root of continuous function f(x)=0 in a recursive function, with a given starting point:

\[ x_{k+1}=x_k- \frac{f(x_k)}{f'(x_k)} \]

Compare to bisection, Newton-Raphson method converge faster, if ever converges.

Lagrange's Method & Newton-Raphson Method

Armed with Newton-Raphson method, now we can solve \( G(w, \lambda) = F'(w, \lambda) \). Algorithm to solve it:

  1. Compute G(x) and G'(x)
  2. Pick a starting point ($x_0$)
  3. Solve the linear system : $G'(x_k)u = G(x_k)$
  4. Update $x_{k+1} = x_k − u$
  5. Repeat steps 3 and 4 with a number of times or $x_{k+1}- x_k$ is less than a predefined tolerance threshold.

First, we need to get partial derivative of G which will be used in in Newton-Raphson method:

\[ G'(w, \lambda) = \left| \begin{array}{ccc} 2\lambda_2 \Sigma & e & 2 \Sigma w\\ e^{T} & 0 & 0 \\ 2\Sigma w & 0 & 0 \end{array} \right| = 0 \]

Function to compute $ G(w, \lambda) $

G <- function(x, mu, Sigma, sigmaP2) {
    n <- length(mu)
    c(mu + rep(x[n + 1], n) + 2 * x[n + 2] * (Sigma %*% x[1:n]), sum(x[1:n]) - 
        1, t(x[1:5]) %*% Sigma %*% x[1:5] - sigmaP2)
}

Derivative of G

DG <- function(x, mu, Sigma, sigmaP2) {
    n <- length(mu)
    grad <- matrix(0, n + 2, n + 2)
    grad[1:n, 1:n] <- 2 * x[n + 2] * Sigma
    grad[1:n, n + 1] <- 1
    grad[1:n, n + 2] <- 2 * (Sigma %*% x[1:5])
    grad[n + 1, 1:n] <- 1
    grad[n + 2, 1:n] <- 2 * t(x[1:5]) %*% Sigma
    grad
}

Initial weights and $ \lambda $:

x = c(rep(0.5, 5), 1, 1)
x
## [1] 0.5 0.5 0.5 0.5 0.5 1.0 1.0

Lets apply Newton-Raphson iteration now:

for (i in 1:100) {
    x <- x - solve(DG(x, mu, Sigma, s), G(x, mu, Sigma, s))
}

and numerical solution:

x
## [1] -0.39550  0.09606  0.04584  0.70988  0.54372 -0.09201 -0.85715

We need to verify that $ G'(x) $ is positive definite, in order to have a maximum critical points. Lets first see what is $ G'(x) $

DG(x, mu, Sigma, sigmaP2)[1:5, 1:5]
##          [,1]      [,2]      [,3]     [,4]    [,5]
## [1,] -0.03360  0.012960 -0.022080 -0.01500  0.0168
## [2,]  0.01296 -0.055543  0.007097  0.01543 -0.0162
## [3,] -0.02208  0.007097 -0.090686 -0.03450 -0.0345
## [4,] -0.01500  0.015429 -0.034500 -0.10714  0.0225
## [5,]  0.01680 -0.016200 -0.034500  0.02250 -0.2100

Since $ G'(x) $ is symetric and square, all negative eigen value of $ G'(x) $ confirms $G'(x) is $negative definite:

eigen(DG(x, mu, Sigma, s)[1:5, 1:5])$values
## [1] -0.02024 -0.05059 -0.05806 -0.14445 -0.22364

Since all eigen values are negative, we can conclude that x[1:5] is optimum weights for the constraint and maximum return is:

t(x[1:5]) %*% mu
##        [,1]
## [1,] 0.1992

Sunday, February 02, 2014

Conjugate Prior, Poisson, Exponential and Gamma Distributions for Website statistics within R

Poisson , Exponential and Gamma distributions are ideal distributions to model number of visitors, arrival time between each visitors and number of visitors between an interval for Website statistics. Let say, if a website attracts 1000 visitors in a day, we can use Poisson distributions to find the probability of k number of visitors in a day. PMF of Poisson is distribution:
$$ pmf(k)= \frac{\lambda^k}{k!} e^{-\lambda} $$
Lets use R to estimate some probabilities with poisson distribution:

# lets assume we have 1000 visitors per day 
>rate=1000

# What is the prob of having exactly 700 visitors in a day
> dpois(700,lambda=rate)
[1] 2.095737e-24

# What is the prob of having no visitors at all in a day
> dpois(0,lambda=rate)
[1] 0

# What is the prob of having less than 700 visitors in a day
> ppois(700,lambda=rate,lower=TRUE)
[1] 6.93301e-24

# What is the prob of having more than 1100 visitors in a day
> ppois(1100,lambda=rate,lower=FALSE) # we are interested in upper tail now, 1100 > rate
[1] 0.000867641

# Maximum arrivals with at least 95% confidence
> qpois(0.95,lambda=rate)
[1] 1052


On the other hand, Exponential distribution is used to estimate probabilities of hit times for a given Poisson distribution. The probability density function (pdf) of an exponential distribution is
$$ f(x;\lambda) = \begin{cases} \lambda e^{-\lambda x} & x \ge 0, \\ 0 & x < 0. \end{cases} $$
The cumulative distribution function is given by
$$ F(x;\lambda) = \begin{cases} 1-e^{-\lambda x} & x \ge 0, \\ 0 & x < 0. \end{cases} $$
Lets use R to estimate some probabilities with Exponential distribution:
# lets assume we have 1000 visitors per day, then average number of visitors per minute time is 1.44:
>AvgVisitorNumberDay=1000
>avgMin=24*60/AvgVisitorNumberDay
> avgMin
[1] 1.44  # 1.44 visitor per minute
# then rate (lambda) for Exponential distribution is (lambda=1/E(X))
> rate=1/avgMin
> rate
[1] 0.6944444 

# lets find the probability that first hit happens at most 45 sec? (P(<45/60) = 1- e^(-rate*60/45))
> pexp(45/60,rate)
[1] 0.4059747

# How long do we have to wait maximum to observe first hit with 95% confidence
> qexp(0.95,rate)
[1] 4.313854 # 258.6 seconds



Another useful distribution is Gamma ($ \alpha, \beta $) distribution which can be used to model the time required for $ \alpha $ events to happen given a Poisson process with mean time $ \beta $. In another words, it is waiting times until a certain number of events happen. For example, Gamma(shape=5, rate=1/3) is the distribution of the length of time (in min) you’d expect to have to wait for 5 visitor hit, given that in average 3 visitors arrive per min. Another example is an insurance company observes that large commercial fire claims occur randomly in time with a mean of 0.7 years between claims. For its financial planning it would like to estimate how long it will be before it pays out the 5th such claim, The time is given by Gamma(5,0.7). Lets have a look at some examples in R.
> rate  # number of hits in a minute , estimated above
[1] 0.6944444 

# pdf of 1 events given rate and have to wait 5 
dgamma(1,rate=rate,shape=5)

# What is the probability that site admin have to wait between 2 to 4 minutes before 5 visitors arrive to website.  
> pgamma(4,rate=rate,shape=5)-pgamma(2,rate=rate,shape=5)
[1] 0.1350604


As you see above, it is trivial to estimate probabilities if parameters (rate) are known. If true parameters are not known, conjugate distribution can be used to estimate parameters. For example, to estimate $ \lambda $ in Poisson distribution, we can use Gamma ($ \alpha, \beta $) distribution as conjugate prior.

Lets assume, we want to estimate $ \lambda $ of Poisson distribution with real-time hit data, with number of visits in fixed time interval on the fly. These are the steps:

- Prior, hyperparameters initialize: Decide on $ \alpha $ and $ \beta $ of Gamma ($ \alpha, \beta $). Best estimator for these parameters are mean and standard deviation of historical arrivals. Set $ \alpha' = \mu $ and $ \beta' = \sigma $
- Posterior hyperparameters update : In each fixed time period $ t $, find number of web page hits $ x_t $ and then update hyperparameters such as $ \alpha' = \alpha' + x_t, \beta' = \beta'+1 $
- Posterior predictive: $\lambda$ is mean of $\operatorname{NB}(\tilde{x}|\alpha',\frac{1}{1+\beta'})$ where NM is Negative binomial distribution. Or more simply:
$$ \lambda= \frac{\alpha'}{\beta'}$$


Similarly we can estimate $ \lambda $ of Exponential distribution with real-time hit data by using number of visits and interval between each visit on the fly. These are the steps:

- Prior, hyperparameters initialize: Decide on $ \alpha $ and $ \beta $ of Gamma ($ \alpha, \beta $). Best estimator for these parameters are mean and standard deviation of historical arrivals. Set $ \alpha' = \mu $ and $ \beta' = \sigma $
- Posterior hyperparameters update : With each page visit and estimate time interval ($ \delta$), and then update parameters such as $ \alpha' = \alpha' + 1, \beta' = \beta'+ \delta $
- Posterior predictive: $\lambda$ is mean of $\operatorname{Lomax}(\tilde{x}|\beta',\alpha')$ where Lomax is Lomax distribution. Or more simply:
$$ \lambda= \frac{\beta'}{\alpha -1'}$$