Hidden Markov model

Difference between a mixture model and HHM

  • If we examine a single time slice of the model, it can be seen as a mixture distribution with component densities given by p(X|Z)
  • It can be interpreted as an extension of a mixture model where the choice of mixture component for each observation is not independent but depends on the choice of component for the previous observations  (p(Z_n|Z_{n-1}))

Applications

  • Speech recognition
  • Natural language modeling
  • On-line handwriting recognition
  • analysis of biological sequences such as protein and DNA

Transition probability

  • Latent variables; discrete multinomial variables Z_n = describe which component of the mixture is responsible for generating the corresponding observation X_n
  • The probability distribution of Z_n depends on the previous latent variable Z_{n-1} through conditional distribution p(Z_n|Z_{n-1})
  • Conditional distribution

p(Z_n|Z_{n-1}, A) = \displaystyle \prod_{k=1}^{K}\prod_{j=1}^{K} A_{jk}^{Z_{n-1, j}Z_{nk}}

  • Inital latent node z_1 does not have a parent node, so it has a marginal distribution

p(Z_1|\pi) = \displaystyle \prod_{k=1}^{K} \pi_{k}^{z_{1k}}

  • Lattice or trellis diagram

Emission probability

Example;

  • Three Gaussian distribution/ two dice problem
  • Handwriting

What is a Markov model?

Markov Models

  • A way to exploit special sequential aspect (e.g. correlations between observations that are close in the sequence).
  • For example, rainy day or not. If we treat the data as i.i.d., then the only information we glean from the data is the frequency of rainy days without any weather trends that last few days. Therefore, knowing whether or not it rains today helps to predict tomorrow’s weather.
  • A Markov Model is one of the simplest ways to relax such i.i.d. assumptions and express such effects in a probabilistic model.

Joint distribution for Markov chain 

  • General expression of the joint distribution for a sequence of observations

p(X_1, ..., X_N) = \displaystyle\prod_{n=2}^{N} p(X_n|X_1, ..., X_{n-1})

  • First-order Markov chain

If we assume that each of the conditional probability on the right-hand side is independent of all previous observations except the most recent one, we obtain the first-order Markov chain.

p(X_1, ..., X_N) = p(X_1)\displaystyle\prod_{n=2}^{N} p(X_n|X_{n-1})

In other words, the conditional distribution for observation X_n is given by

p(X_n|X_1, ..., X_{n-1}) = p(X_n|X_{n-1})

  • Second-order markov chain

Likewise, a second order Markov chain, [Figure]

p(X_n|X_1, ..., X_{n-1}) = p(X_n|X_{n-1}, X_{n-2})

which menas each observation is only influenced by two previous observations and independent of all observations before.

Sequential data

What is sequential data?

  • Data with poor or no i.i.d assumption
  • Often found in time series data. For instance,
    • Rainfall measurements on successive days at a particular location
    • Daily values of a currency exchange rate
    • Acoustic features at successive time frames used for speech recognition)

Stationary vs. nonstationary sequential distributions

  • Stationary: data evolves in time but the distribution from which it is generated remains the same
  • Nonstationary: the generative distribution itself evolves with time

Applications of Markov Models

  • Financial forecasting: predict the next value in a time series given observations of the previous values
  • Speech recognition: predict the next speech spectrum given observations of the previous speech spectrum values
  • Note: Markov models are useful in applications where more recent observations are more informative than less recent observations

Latent variable

  • Hidden Markov model: where the latent variables are discrete
  • Linear dynamical systems: where latent variables are Gaussian

 

Hypothesis testing steps

Question: What is the mean age of our population?

  1. Data
    • Take a random sample
    • Sample mean = 27
    • Sample std = 3
    • n = 10
  2. Assumption
    • The population follows normal distribution
  3. Hypothesis
    • H0: Mean = 30
    • Ha: Mean != 30
  4. Pick test statistics
    • Z = (SampleMean – M0)/sigma_sample = (SampleMean – M0)/(sigma/sqrt(n))
    • Since we don’t know sigma, std of sample will be used instead of sigma –> No longer Z
  5. Distribution
    • t-distribution
  6. Decision rule
    • Minimize type 1 error (incorrect rejection of null hypothesis) = 5%
    • ==> table look up to find t distribution critical values (standard error) (e.g. 2.262 for 5% error)
  7. Calculate test statistic
    • t = (SampleMean – M0) /(sigma/sqrt(n))
    • t =(27-30)/(3/sqrt(10))= -3.15
  8. Statistical decision
    1. -3.15 < -2.262 ==> reject the null
    2. confidence interval
    3. Screen Shot 2015-12-30 at 4.04.43 AM
  9. Conclusion
  10. P-value

Hypothesis testing

What is p-value?

  • The p-value is the probability of getting the observed value of the test statistic, or a value with even greater evidence against H0, if the null hypothesis is actually true.
  • = A measure of the strength of the evidence against the null hypothesis
  • The smaller the p-value, the greater the evidence against the null hypothesis

 

1224px-p-value_in_statistical_significance_testing-svg

Rough guideline:

  • p < 0.01: Very strong evidence against H0
  • 0.01 < p < 0.05: Strong evidence against H0
  • 0.05 < p < 0.10: Some weak  evidence against H0
  • p > 0.10: Little or no evidence against Ho