Outer State Space Race: Markov chains and matrices

I had heard of Markov chains before, but because my perspective had always been tightly focused on language, I never felt they were really useful.

To review, a Markov chain basically describes a scenario or system where one event leads to other events (and so on, perhaps in a cycle) with varying probabilities, but the probability of any one event leading to another given event is completely independent of all other probabilities.

In the HIV/AIDS scenario model that originally got me interested in learning about Markov chains, the chances of your next “step”—future health outcome—depend only on your present status.

This is called the Markov property. It’s like the opposite of the Monty Hall problem I wrote about previously. The probability of there being a car behind Door 2 changes after Monty opens Door 3. Meanwhile, in a Markov chain, every probability is independent of the others.

Markov chains, as far as I know, have limited usefulness in natural language processing, because sequences of language data—letters, words, sentences, paragraphs—are highly dependent on not only the previous item, but to some extent all previous items.

You can see the implications of this in Markov text generators, based on everything from Shakespeare to Eminem, that struggle to produce anything meaningful. Or in the case of part-of-speech tagging, it is simply inaccurate that a POS of a word could depend only on the POS of the previous word.

(Even human brains falter in processing so-called garden path sentences, such as “The horse raced past the barn fell,” where you expect perhaps an adverb of some sort after the word “barn.”)

Matrix-izing a Markov chain

The simplest representation of a Markov chain is called a transition diagram, which basically looks like a flowchart (although it could be cyclical) with arrows going between different states and probability values attached to those arrows.

Now, what recently got me interested in Markov chains, as I wrote previously, is the fact that they can also be represented by matrices, and manipulated with the tools of linear algebra.

The following is a Markov chain as a matrix, also known as a transition matrix, since the values in the matrix represent the probabilities of transitioning from one state to another. In this case, suppose our system has five states.

(Annotating a matrix is beyond my $\LaTeX$ skills—and perhaps the capabilities of the $\KaTeX$ renderer—so I will use a code block.)

                  you want to go to state
                      A  B  C  D  E
                  A [ probabilities ] sum to 1 reading across
                  B [ probabilities ]          ""
 you are in state C [ probabilities ]          ""
                  D [ probabilities ]          ""
                  E [ probabilities ]          ""

The collection of possible states of a system is called its state space (what a cool term!) and notated as $S = \{1,2,3,\cdots,N\}$ in the general case. For our system, it’s $S = \{A, B, C, D, E\}$.

To give a more concrete example, for a text-generating Markov chain, the state space would include all the relevant units (letters of the alphabet plus spaces and punctuation for a character-by-character generator, words in the vocabulary for a word-by-word generator, etc.).

In formal notation, a possible transition matrix (a.k.a. Markov matrix, stochastic matrix) might look something like this:

$$ P = \begin{bmatrix} 1\over4 & 1\over2 & 1\over4 \\[0.5em] 1\over3 & 0 & 2\over3 \\[0.5em] 1\over2 & 0 & 1\over2 \end{bmatrix} $$

Capital $P$ represents the whole matrix and each small $p$ with subscripts ($p_{ij}$, like $p_{DE}$) represents the probability of going from state $i$ (some row) to state $j$ (some column in that row). $p_{ij}$ can also be used in a general sense to mean all of the probabilities in the matrix.

Each row must add up to 1 because it includes every possible next step from that state.

Also note that the probabilities of the state staying the same from one time step to the next run along the diagonal from top left to bottom right, which means that the identity matrix would represent a static system—one that never changes:

$$ P = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} $$

When $\pi$ is not $\pi$

Now, a transition matrix by itself doesn’t really have any added value compared to the transition diagram it represents, other than looking nicely organized. So let’s make a simple matrix with some random values and do matrix-y stuff with it, like multiply it by something.

For the time being, I live in Southeast Asia and I eat mostly local food, which means that for any given meal, there’ll either be rice or noodles on my plate. (Not that I’m complaining—everything is ridiculously tasty!)

Consider a system $X$ (that’s the conventional choice of letter) that describes my meals with a state space $S = \{N, R\}$. The transition matrix $P$ is then

$$ P = \begin{bmatrix} 0.4 & 0.6 \\ 0.2 & 0.8 \end{bmatrix} $$

Again, I can’t label the rows and columns of the matrix, so let’s stick to alphabetical order and make $N$ first. Thus:

$p_{NN}$, probability of repeating noodles: $0.4$
$p_{NR}$, probability of rice after noodles: $0.6$
$p_{RN}$, probability of noodles after rice: $0.2$
$p_{RR}$, probability of repeating rice: $0.8$

The initial state can be represented by a row vector with $S$ values (i.e., as many values as there are possible states), with a 1 in the spot corresponding to the current state and a 0 for all other states.

If my first meal was a noodle dish, then I can express that initial state like this:

$$ \pi^{(0)} = \begin{bmatrix}1 & 0\end{bmatrix} $$

The math notation for this is not so straightforward.

Lowercase pi here represents a given state, not the constant pi that relates to circles! I don’t know who came up with that idea, but I don’t think it was a wise choice.
The superscript number is also not what it seems: it’s not an exponent, but rather the number of time steps from now. So $\pi^{(0)}$ means our state at 0 time steps from now, or in other words, now.
Why is this vector of the more uncommon horizontal variety (i.e., a row vector)? That’s a more complicated question (and the reason that I got stuck on the warmup problem in the Linear Algebra for Coders course).

This is the part where I turned to Python because I couldn’t get any farther with the math alone.

import numpy as np
meals = np.array([[0.4, 0.6],[0.2,0.8]])
now = np.array([[1, 0]])

The next step (and the ones thereafter)

To find the probabilities for the next meal, we can multiply this vector and the transition matrix together. If you imagine matrix-vector multiplication as a vector undergoing a linear transformation, then this sort of makes sense.

There’s a catch, though. Transition matrices are set up to be read across, from left to right, with each row telling us the next-step probabilities if we are in that state:

                you want to go to state
                      N    R
 you are in state N [ 0.4  0.6 ]
                  R [ 0.2  0.8 ]

So if we already know that we will be in state $N$ to begin with, we can essentially filter out the other rows, leaving a single row:

$$ \pi^{(1)} = \begin{bmatrix}p_{NN} & p_{NR}\end{bmatrix} = \begin{bmatrix}0.4 & 0.6\end{bmatrix}$$

Now, if we suppose that our state vector is a column vector $\vec v$,

$$ \vec v = \begin{bmatrix}1 \\ 0\end{bmatrix}$$

and multiply it by the matrix,

$$ \pi^{(1)} \stackrel{?}{=} P\vec v = \begin{bmatrix} 0.4 & 0.6 \\ 0.2 & 0.8 \end{bmatrix} \begin{bmatrix}1 \\ 0\end{bmatrix} = \begin{bmatrix} 1\cdot0.4 + \textcolor{lightgray}{0\cdot0.6} \\ 1\cdot0.2 + \textcolor{lightgray}{0\cdot0.8} \end{bmatrix} = \begin{bmatrix} 0.4 \\ 0.2 \end{bmatrix} $$

Hmm, that’s not quite right. The values don’t match what we know about our system, and we want a row, not a column.

Since the product of a matrix and a vector is the shape of the vector, $\pi^{(0)}$ should be a row vector if we want $\pi^{(1)}$ to be a row vector. We could also transpose the matrix, but then the rows of the matrix no longer represent probabilities of transitioning from a given state. Using a row vector maintains the idea that rows describe origin states and the property that values of rows add up to 1.

$$ \begin{aligned} \pi^{(1)} = \pi^{(0)}P &= \begin{bmatrix}1 & 0\end{bmatrix} \begin{bmatrix} 0.4 & 0.6 \\ 0.2 & 0.8 \end{bmatrix} \\ &= \begin{bmatrix} 1\cdot0.4 + \textcolor{lightgray}{0\cdot0.2} & 1\cdot0.6 + \textcolor{lightgray}{0\cdot0.8} \end{bmatrix} \\ &= \begin{bmatrix} 0.4 & 0.6 \end{bmatrix} \end{aligned} $$

Intuitively, you can imagine that you’re “feeding” the collection of probabilities through the state and getting the “transformed” probabilities.

In [11]: now @ meals
Out[11]: array([[0.4, 0.6]])

Also, since the probabilities at time $t + 1$ only depend on the probabilities at time $t$, we can do this recursively to find $\pi^{(n)}$.

In [15]: p1 = now @ meals

In [15]: p1 @ meals
Out[16]: array([[0.28, 0.72]]) # this is pi^(2), or probabilities at time 2

On a day where I had noodles for breakfast, then, I have a 40% chance of having noodles and a 60% chance of having rice for lunch, and then a 28% chance of having noodles and a 72% chance of having rice for dinner.

Other possibilities for the state vector

Actually, I oversimplified the above example a bit. The state vector doesn’t have to be a binary thing. Since its values must sum to 1, it can also reflect an initial probability distribution:

$$ \pi^{(0)} = \begin{bmatrix}0.45 & 0.55\end{bmatrix} $$

As you can see, this changes the subsequent probabilities:

In [19]: now = np.array([[0.45,0.55]])

In [20]: now @ meals
Out[20]: array([[0.29, 0.71]])

In the HIV/AIDS scenario I referred to at the beginning of this post, the initial state vector is used to represent the population (since subsets of a population can be expressed as the probability that a random person will belong to that subset).

Here are the givens of that problem. The states are HIV asymptomatic, HIV symptomatic, AIDS, and death.

$$ \begin{gathered} P = \begin{bmatrix} 0.97 & 0.07 & 0.02 & 0.01 \\ 0 & 0.93 & 0.05 & 0.02 \\ 0 & 0 & 0.85 & 0.15 \\ 0 & 0 & 0 & 1.00 \end{bmatrix} \\ \pi^{(0)} = \begin{bmatrix} 0.85 & 0.10 & 0.05 & 0 \end{bmatrix} \end{gathered} $$

This means that the initial population was 85% asymptomatic, 10% symptomatic, 5% AIDS patients, and 0% dead.

If we used a vector with a single 1 and the rest 0s, then that would represent the outcomes for a homogenous population, or more likely, a single person.

hiv = np.array([[0.9, 0.07, 0.02, 0.01],
                [0  , 0.93, 0.05, 0.02],
                [0  , 0   , 0.85, 0.15],
                [0  , 0   , 0   , 1   ]])
population = np.array([[0.85, 0.1, 0.05, 0]])

In [33]: population @ hiv
Out[33]: array([[0.765 , 0.1525, 0.0645, 0.018 ]])

So the outcomes after 1 year ($t = 1$) are 76.5% asymptomatic, 15.25% symptomatic, 6.45% AIDS patients, and 1.8% dead.

Python-izing a Markov chain

Maybe this is kind of overkill, but we can implement our system as a class.

class MarkovChain:
    def __init__(self, matrix, init_state):
        # Validate the matrix
        if type(matrix) != np.ndarray or type(init_state) != np.ndarray:
            raise TypeError('The Markov matrix and state vector must be NumPy arrays.')
        elif matrix.shape[0] != matrix.shape[1]:
            raise ValueError('Markov matrices must be square.')
        elif (matrix < 0).any() or (init_state < 0).any():
            raise ValueError('Probabilities cannot be negative.')
        elif sum([np.sum(row) != 1 for row in matrix]) > 0:
            raise ValueError('Each row in a Markov matrix must sum to 1.')
        # Validate the init_state
        if type(init_state) != np.ndarray:
            raise TypeError('The initial state vector must be a NumPy array.')
        elif init_state.shape != (1, matrix.shape[1]):
            raise ValueError(f'The state vector must have a shape of 1 x {matrix.shape[1]}.')
        elif np.sum(init_state) != 1:
            raise ValueError('The values in the state vector must sum to 1.')

        self.matrix = matrix
        self.init_state = init_state
        self.time = 0

        self.probs = init_state
        self.evolve()

    def evolve(self, times=1):            
        for _ in range(times):
            self.probs = self.probs @ self.matrix
            self.time += 1
        self.status()

    def status(self):
        probs_list = [str(round(p * 100, 2)) + "%" for p in self.probs[0]]
        print(f'At time {self.time}, the transition probabilities are')
        print(probs_list)

    def restart(self, init_state=None):
          if not init_state:
              init_state = self.init_state
          # call with no args to start over with same initial state
          self.__init__(self.matrix, init_state)

Now we can see what the population will be like after many years with just a few keystrokes:

In [40]: m = MarkovChain(hiv, population)
At time 1, the transition probabilities are ['76.5%', '15.25%', '6.45%', '1.8%'].

In [41]: m.evolve(10)
At time 11, the transition probabilities are ['26.67%', '31.53%', '13.58%', '28.21%'].

In [42]: m.evolve(10)
At time 21, the transition probabilities are ['9.3%', '23.68%', '11.67%', '55.34%'].

In [43]: m.evolve(10)
At time 31, the transition probabilities are ['3.24%', '14.4%', '7.71%', '74.65%'].

75% dead after 30 years—grim outlook, to say the least. (Though to be fair, the model takes into account that people may and do die of things other than HIV/AIDS. This is represented in the matrix by the nonzero probabilities of asymptomatic people transitioning right to death.)

…aaaand on that note, I’ll end this post here for now. I will write about Markov chains again in the near future.

References

Concepts of Markov Chains (video), Paul Harper
Stats 325, Chapter 8: Markov Chains, Rachel Fewster
Markov Chain: Simple example with Python, Balamurali M

Go Top

Justin Douglas