Math FAQ
Table of contents
- What is Bayesian statistics?
- What is a beta distribution?
- How do beta distributions differ from computing statistical significance?
- What is the equation that generates the graph?
- How do you compute the probabilities using the beta distribution?
- How was the math coded for this website?
- Why use a beta distribution rather than a normal distribution?
- Why use an uninformative prior for this beta distribution?
- What is probability density, a.k.a. the comparative likelihood?
- Why do more data points result in narrower intervals?
What is Bayesian statistics?
Bayesian statistics is a mathematical approach to data analysis that uses probability to update beliefs about random events based on new evidence. It's named after English mathematician Thomas Bayes.
The equation is
Posterior = Likelihood Γ Prior Γ· Evidence
Or, in mathematical terms
\(P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\)
What is a beta distribution?
A beta distribution is a continuous probability distribution defined on the interval from 0% to 100%. The area under the curve represents all possible βtrueβ percentages, useful for considering a set of people you wish to serve.
What is probability density, a.k.a. the comparative likelihood?
The y axis is probability density for a probability density function. In plain English, probability density represents comparative likelihood.
Example: For a beta distribution for 4 out of 5 participants, the peak is at 80% and the probability density is 2.46. The 10th percentile at 49 has a probability density of 0.88. The comparative likelihood is therefore 2.46/0.88 = 2.8. This means that the real-world percentage is 2.8 times more likely (comparatively) to be the true value.
How do posterior distributions differ from statistical significance tests?
Posterior distributions produce plausible real-world percentage rates based off of limited data. Because qualitative research naturally deals with small n, Bayesian methods are most applicable due to the sample sizes.
When if comes ot analyzing quantitative data, many researchers may be familiar with statistical tools like t-tests or chi-squared tests. These are in the frequentist class of methods and assert the probability the data were generated by chance than by a defined trend. Many such tests produce confidence intervals to arrive at that probability, but they are a byproduct of an algorithm rather than the central object of concern.
What is the equation that generates the graph?
The equation for a beta distribution is as follows:
\(f(x) = \frac{x^{(πΌβ1)} \cdot (1βx)^{(π½β1)}}{B(πΌ, π½)} for xβ[0,1]\)
Where
- πΌ = number of successes + 1
- π½ = number of failures + 1
- B = normalization constant, known at the beta function. This ensures that the total area under the curve is 1 (i.e. 100%).
- x = the individual value of the x axis for which we are computing the y axis value
- f(x) = the y axis value we are computing for each x value
- xβ[0,1] = compute for values of x between 0 and 1, such as 0, 0.01, 0.02, 0.03...0.98, 0.99, 1.00
The beta function
\(B(πΌ, π½) = \frac{Ξ(πΌ) \cdot Ξ(π½)}{Ξ(πΌ+π½)} = \frac{(πΌ-1) \cdot (π½-1)!}{(πΌ+π½-1)!}\)
How do you compute the probabilities using the beta distribution equation?
Imagine you have a coin and want to determine the likelihood it is a fair coin which comes up heads and tails at a 50/50 rate. You flip the coin 10 times, and it comes up heads 8 of those times.
- πΌ = number of successes (heads) + 1 = 8 + 1 = 9
- π½ = number of failures (tails) + 1 = 2 + 1 = 3
Letβs get our beta distribution equation
\(f(x) = \frac{x^{(πΌβ1)} \cdot (1βx)^{(π½β1)}}{B(πΌ, π½)}\)
Replace normalization constant with the beta function
\(f(x) = \frac{x^{(πΌβ1)} \cdot (1βx)^{(π½β1)}}{\frac{(πΌ-1) \cdot (π½-1)!}{(πΌ+π½-1)!}}\)
Replace πΌ and π½ with our data points
\(f(x) = \frac{x^{(9β1)} \cdot (1βx)^{(3β1)}}{\frac{(9-1) \cdot (3-1)!}{(9+3-1)!}}\)
Simplify
\(f(x) = \frac{x^{(8)} \cdot (1βx)^{(2)}}{\frac{8! \cdot 2!}{11!}}\)
Replace the beta normalization constant
\(f(x) = \frac{x^{(8)} \cdot (1βx)^{(2)}}{\frac{1}{495}}\)
Simplify
\(f(x) = x^{(8)} \cdot (1βx)^{(2)} \cdot 495\)
Now that we have our core equation, we input data points for values of x between 0% and 100% (e.g. 0%, 10%, 20%, 30%, all the way up to 100%).
- \(f(0.0) = 0.0^{(8)} \cdot (1β0.0)^{(2)} \cdot 495\) = 0
- \(f(0.1) = 0.1^{(8)} \cdot (1β0.1)^{(2)} \cdot 495\) = 0.000
- \(f(0.2) = 0.2^{(8)} \cdot (1β0.2)^{(2)} \cdot 495\) = 0.001
- \(f(0.3) = 0.3^{(8)} \cdot (1β0.3)^{(2)} \cdot 495\) = 0.016
- \(f(0.4) = 0.4^{(8)} \cdot (1β0.4)^{(2)} \cdot 495\) = 0.117
- \(f(0.5) = 0.5^{(8)} \cdot (1β0.5)^{(2)} \cdot 495\) = 0.483
- \(f(0.6) = 0.6^{(8)} \cdot (1β0.6)^{(2)} \cdot 495\) = 1.330
- \(f(0.7) = 0.7^{(8)} \cdot (1β0.7)^{(2)} \cdot 495\) = 2.568
- \(f(0.8) = 0.8^{(8)} \cdot (1β0.8)^{(2)} \cdot 495\) = 3.322
- \(f(0.9) = 0.9^{(8)} \cdot (1β0.9)^{(2)} \cdot 495\) = 2.131
- \(f(1.0) = 1.0^{(8)} \cdot (1β1.0)^{(2)} \cdot 495\) = 0
Plot the data points. This compares exactly to what the website produces, except that it calculates for many more values of x and therefore produces a smoother curve.
Reviewing the results, it is unlikely that this is a fair coin.
How was the math coded for this website?
The math proof of concept was built in Python with scipy.stats.beta and uses numpy and matplotlib.
However, the core equation for this relies on factorials (e.g. 10! = 10*9*8*7*6*5*4*3*2*1). JavaScript has a cap on numerical values at 1.8E+308, treating those above the cap as infinity; this breaks when using factorials of big numbers (e.g. 100!100!) when using large N samples.
Roy Hung refactors the equation using a log beta function so it does not break for large numbers. His implementation was used for this website because it works elegantly and avoids the JavaScript infinity problem.
Why use a beta distribution rather than a normal distribution?
Beta distributions can be skewed, which differ from normal distributions that are centered.
As we are dealing with the interval between 0 and 1 -- also known as 0% to 100% -- anything that is not centered around the 50% mark would result in impossible findings. For example, if a wide distribution was centered around the 90% mark, that would show that >100% would be a possible result but an impossible answer when talking about people in the real world.
Therefore, we use a skewed or beta distribution to ensure all probabilities remain between 0% and 100%.
Why use an uninformative prior for this beta distribution?
For a pure usability study in which we have no prior knowledge of how well or poorly users would be able to use the product in the real world, we would consider whether there are usability issues with a design unknown (neither usable nor unusable) until tested with participants.
In this case, our uninformative prior is π½(1, 1).
Why do more data points result in narrower intervals?
Imagine multiple tests are run with different sample sizes, and the results are as follows:
- 3 out of 5 participants experienced a phenomenon.
- 6 out of 10 participants experienced a phenomenon.
- 12 out of 20 participants experienced a phenomenon.
The magic of Bayesian statistics is that more data gets us closer to the truth, while still giving reasonable approximations for smaller-N data. Let's take a look at how these mathematically same percentages (everything above is 60%) give us different beta probability density functions:
- 3 out of 5 graph: 33-80%
- 6 out of 10 graph: 40-76%
- 12 out of 20 graph: 46-72%
Early on we can already see the shape that this is taking and get a sense of the likely probability.
But for qualitative research, doubling the number of participants for each test (approximately doubling the work for collecting, reviewing and analyzing the data) does not result in a meaningful difference in end results which would inform the team's decision.
This in fact shows us that there are diminishing returns for greatly expanding the number of participants from which to collect data.