The Dirichlet distribution for cell-cycle flow cytometry statistics

July 8, 2012 • Category: Science

I recently had to analyze some cell cycle flow cytometry data in the form:

Condition	G₀/G₁	S	G₂/M
A	5432	142	503
B	2004	501	142
A	4014	234	599
B	3010	503	120
...

Surprisingly, there isn’t a lot of software written to do this correctly, so I wrote my own, which you can install from my github account using the following command:

pip install git+https://github.com/ericsuh/dirichlet.git

If you want to see if the above data from the two conditions A and B are different from each other, you might think to try is a $\chi^2$ or a G-test, but both of these tests, at least out of the box, test the wrong kind of variation.

To a biologist, the numbers above (if that was all the data there is) would indicate a sample size of 4, whereas to the $\chi^2$ or G-test, the sample size is on the order of the sum of the rows. In other words, the biologist is interested in how large the variation is from experiment to experiment, but the $\chi^2$ or G-tests would give back how much variation there is within a particular experiment.^[That is, the $\chi^2$ and G-tests measure the “technical” variation.] If you aren’t careful, you can get ridiculous p-values like $10^{-300}$. In comparison, the number of atoms in the universe is somewhere near $10^{80}$, so a test giving $p < 10^{-80}$ essentially says that an experiment is so unlikely under the null hypothesis that it would never happen to you in the history of the universe. Anyone who has worked at the lab bench can agree this is monumentally not true.

The alternative to the $\chi^2$ or G-tests is a Dirichlet likelihood ratio test. The main idea is that when you divide each row of the above data by its sum to give percentages or fractions, those numbers can be modeled by a Dirichlet distribution, which is one of the few probability distributions defined over the numbers ${x_0, \ldots, x_K}$ where for all $i$, $x_i \in [0,1]$ and $\sum_i x_i = 1$.^[It’s a multidimensional form of the Beta distribution, as defined on that simplex.] This is perfect for data sets like the fractions of cells in different cell cycle phases.

Unfortunately, there’s no easy solution to the equation to fit a Dirichlet model to data, so the Python package I wrote uses a methods outlined by Thomas Minka in his paper and his MATLAB code. You can then see if the fit with two different Dirichlet distributions is sufficiently better than fitting just one Dirichlet distribution to justify claiming that you have two different data sets.

At some point, I will be making a web interface for the code so that the non-technical folks in my lab (1) don’t have to learn Python, and (2) even if they learn Python, don’t have to go through the hell that is configuring and installing numpy and scipy.