Sunday, February 02, 2014

Conjugate Prior, Poisson, Exponential and Gamma Distributions for Website statistics within R

Poisson , Exponential and Gamma distributions are ideal distributions to model number of visitors, arrival time between each visitors and number of visitors between an interval for Website statistics. Let say, if a website attracts 1000 visitors in a day, we can use Poisson distributions to find the probability of k number of visitors in a day. PMF of Poisson is distribution:
$$ pmf(k)= \frac{\lambda^k}{k!} e^{-\lambda} $$
Lets use R to estimate some probabilities with poisson distribution:

# lets assume we have 1000 visitors per day 
>rate=1000

# What is the prob of having exactly 700 visitors in a day
> dpois(700,lambda=rate)
[1] 2.095737e-24

# What is the prob of having no visitors at all in a day
> dpois(0,lambda=rate)
[1] 0

# What is the prob of having less than 700 visitors in a day
> ppois(700,lambda=rate,lower=TRUE)
[1] 6.93301e-24

# What is the prob of having more than 1100 visitors in a day
> ppois(1100,lambda=rate,lower=FALSE) # we are interested in upper tail now, 1100 > rate
[1] 0.000867641

# Maximum arrivals with at least 95% confidence
> qpois(0.95,lambda=rate)
[1] 1052


On the other hand, Exponential distribution is used to estimate probabilities of hit times for a given Poisson distribution. The probability density function (pdf) of an exponential distribution is
$$ f(x;\lambda) = \begin{cases} \lambda e^{-\lambda x} & x \ge 0, \\ 0 & x < 0. \end{cases} $$
The cumulative distribution function is given by
$$ F(x;\lambda) = \begin{cases} 1-e^{-\lambda x} & x \ge 0, \\ 0 & x < 0. \end{cases} $$
Lets use R to estimate some probabilities with Exponential distribution:
# lets assume we have 1000 visitors per day, then average number of visitors per minute time is 1.44:
>AvgVisitorNumberDay=1000
>avgMin=24*60/AvgVisitorNumberDay
> avgMin
[1] 1.44  # 1.44 visitor per minute
# then rate (lambda) for Exponential distribution is (lambda=1/E(X))
> rate=1/avgMin
> rate
[1] 0.6944444 

# lets find the probability that first hit happens at most 45 sec? (P(<45/60) = 1- e^(-rate*60/45))
> pexp(45/60,rate)
[1] 0.4059747

# How long do we have to wait maximum to observe first hit with 95% confidence
> qexp(0.95,rate)
[1] 4.313854 # 258.6 seconds



Another useful distribution is Gamma ($ \alpha, \beta $) distribution which can be used to model the time required for $ \alpha $ events to happen given a Poisson process with mean time $ \beta $. In another words, it is waiting times until a certain number of events happen. For example, Gamma(shape=5, rate=1/3) is the distribution of the length of time (in min) you’d expect to have to wait for 5 visitor hit, given that in average 3 visitors arrive per min. Another example is an insurance company observes that large commercial fire claims occur randomly in time with a mean of 0.7 years between claims. For its financial planning it would like to estimate how long it will be before it pays out the 5th such claim, The time is given by Gamma(5,0.7). Lets have a look at some examples in R.
> rate  # number of hits in a minute , estimated above
[1] 0.6944444 

# pdf of 1 events given rate and have to wait 5 
dgamma(1,rate=rate,shape=5)

# What is the probability that site admin have to wait between 2 to 4 minutes before 5 visitors arrive to website.  
> pgamma(4,rate=rate,shape=5)-pgamma(2,rate=rate,shape=5)
[1] 0.1350604


As you see above, it is trivial to estimate probabilities if parameters (rate) are known. If true parameters are not known, conjugate distribution can be used to estimate parameters. For example, to estimate $ \lambda $ in Poisson distribution, we can use Gamma ($ \alpha, \beta $) distribution as conjugate prior.

Lets assume, we want to estimate $ \lambda $ of Poisson distribution with real-time hit data, with number of visits in fixed time interval on the fly. These are the steps:

- Prior, hyperparameters initialize: Decide on $ \alpha $ and $ \beta $ of Gamma ($ \alpha, \beta $). Best estimator for these parameters are mean and standard deviation of historical arrivals. Set $ \alpha' = \mu $ and $ \beta' = \sigma $
- Posterior hyperparameters update : In each fixed time period $ t $, find number of web page hits $ x_t $ and then update hyperparameters such as $ \alpha' = \alpha' + x_t, \beta' = \beta'+1 $
- Posterior predictive: $\lambda$ is mean of $\operatorname{NB}(\tilde{x}|\alpha',\frac{1}{1+\beta'})$ where NM is Negative binomial distribution. Or more simply:
$$ \lambda= \frac{\alpha'}{\beta'}$$


Similarly we can estimate $ \lambda $ of Exponential distribution with real-time hit data by using number of visits and interval between each visit on the fly. These are the steps:

- Prior, hyperparameters initialize: Decide on $ \alpha $ and $ \beta $ of Gamma ($ \alpha, \beta $). Best estimator for these parameters are mean and standard deviation of historical arrivals. Set $ \alpha' = \mu $ and $ \beta' = \sigma $
- Posterior hyperparameters update : With each page visit and estimate time interval ($ \delta$), and then update parameters such as $ \alpha' = \alpha' + 1, \beta' = \beta'+ \delta $
- Posterior predictive: $\lambda$ is mean of $\operatorname{Lomax}(\tilde{x}|\beta',\alpha')$ where Lomax is Lomax distribution. Or more simply:
$$ \lambda= \frac{\beta'}{\alpha -1'}$$

No comments: