Where will the stat wars lead us?
Posted Dec 06, 2017
I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do. ~ Hal
The Bayesians want us to be Bayesians (e.g, Krueger, 2017). This is just as well because much human learning can be modeled as a process of belief updating. As evidence presents itself, beliefs can and should drift toward that evidence. Bayesian methods provide the math to work out this process quantitatively. However, human cognition also requires inference and action. Interest in inference, decision, and action is usually associated with the frequentist schools of statistics. They have given us significance tests, which allow tentative inferences given the data.
Nonstatisticians, god bless them, find it difficult to appreciate the ideological differences between the Bayesians and the frequentist; and once they understand the differences conceptually, they still find it difficult to understand why the school-bound statisticians get so worked up about the evils of the other school. Is there, perhaps, a religious element of faith in first assumptions, like the idea that hypotheses vary randomly and data are what they are (Bayesians) vs. the inverse idea that hypotheses are selected and set, while data vary with random sampling (frequentists); or the idea that probabilities are the quantitative input of interest (frequentists) or that it is likelihoods (Bayesians)?
Why can’t they all just get along? Some of us – I (Krueger, 2001) count myself in good company (Cohen, 1994; Nickerson, 2000) – have attempted a synthesis, but – as history has taught – the radicals will continue to insist on having their way and their way only. The current debate over best statistical practices, where the Bayesians do most of the yelling, will not be resolved mathematically or by a victory validated by nature, but by sociological forces. Fleck (1935; cf. Krueger, 2012) and Kuhn (1962) taught this. A sea change in science ‘may’ involve progress in the sense of greater 'verisimilitude' (Meehl, 1990) between the models of nature and nature, but the processes leading to that change are rife with rhetoric, bullying, and death (of the old-timers).
I like Bayesian statistics but I hate the way they are being forced upon us as the only way of knowing God. And I think frequentist statistics have much to recommend themselves, even when embedded in a larger Bayesian context (Krueger & Heck, 2017). For one thing, frequentist methods are easier and more intuitive. People naturally think in terms of frequencies, once they learn how to count. They can, on a good day, wrap their heads around the concept of probability, but 'likelihood!?' A likelihood is a numerical value on a density function, which has no meaning other than being larger or smaller than some other point on that function. Hence, the Bayesian instance on the 'relative strength' of the evidence. With likelihoods, absolute strength is undefined. Although you may think you may have seen the term p(H|D), the probability of the hypothesis given the data, in Bayes’ theorem, contemporary Bayesians don’t want you to think that way. Instead, they ask you to contemplate the so-called Bayes Factor, or BF, which is a double ratio. Now, I am going to take you there, but otherwise, I will retain the notation of probability. Probabilities are nearly perfectly log-log correlated with likelihoods (Krueger & Heck, 2017), so it does not make no never mind.
The BF takes the ratio of the probability of hypothesis H given the observed data D over the probability of the alternative hypothesis (non-H or ~H) given the data and divides it by the ratio of the prior probability of H over the prior of ~H. I am taking it for granted that p(~H|D) = 1 – p(H|D) and that p(~H) = 1 – p(H); that is, we are setting up two hypotheses that are exhaustive and mutually exclusive so that ~H is fully determined by – and perfectly inversely correlated with – H. This means that the terms p(~H) and p(~H|D) do not contribute any information. As the BF is meant to tell us how much we have learned from the data about the hypotheses, that is, how much our beliefs ought to change in response to the evidence, we might as well restrict ourselves to contemplating the ratio of p(H|D) over p(H). How much more (or less) likely has the target hypothesis become as a result of our study?
The term p(H|D) can be recovered from Bayes’ theorem as the product of the hypothesis’ prior probability p(H) and the so-called likelihood ratio (for me, a probability ratio), which is p(D|H) over p(D). The probability of the data under the hypothesis, p(D|H) is what significance testing delivers. The probability of the data, p(D), is p(D|H) times p(H) plus p(D|~H) times p(~H). We can now see that the belief updating ratio p(H|D) over p(H) is equal to the ratio of p(D|H) over p(D). Why use a double ratio (like the BF) when a simple ratio does all the work? And in that simple ratio, the numerator (i.e., the analog to the frequentist p-value) plays a strong role in driving the Bayesian outcome. The denominator will vary depending on which alternative hypothesis the researcher sets up. When two specific hypotheses are in play, they can both be tested with frequentist methods. There will be two p values and a comparison between them will inform inference. Taking their ratio is just a ritual. When there is no specific alternative, researchers may assume diffuse or distributed alternatives to the specific hypothesis being tested, but that in fact undermines the original Bayesian intention of forcing researchers to commit to a minimum or two precise and theory-driven hypotheses. Indeed, the results obtained with so-called Bayesian default tests are nearly perfectly log-log correlated with conventional p values (Krueger & Heck, 2017).
The Bayesian insistence on capturing the 'relative' strength of the evidence for this or that hypothesis creates another headache (for them), a curious contradictio in adjecto. One of the major complaints Bayesians lob at the frequentist significance testers is that they, the testers, can only reject a hypothesis, and that is often the hypothesis of nothing (the null). Significance testing does not permit a positive inference regarding the null. Yet, such inferences seem desirable in a world that does have some invariances. Bayesians say that their method can support a belief in sameness (between conditions) or in nothing (no difference; no phenomenon). Now, if all belief updating must be relative, we can only conclude that given the data, the idea of ‘nothing’ has stronger support than the idea of ‘not nothing.’ But what does it do for us if we can say that now in light of the evidence the absence of a putative phenomenon (say telepathy) has become five times more likely or that the data support absence five times more strongly than they support presence? If all you have is a measure of the degree to which your belief should change, do you then know what to believe? Hint: you do if you compute p(H|D) with the probability version of Bayes’ theorem.
With likelihoods, you’re on your own.
Nachspiel in the Bayes Factory
My description of the Bayesian view is simplified. There are several Bayesian views. Not all Bayesians favor the use of Bayes factors; some prefer to contemplate the posterior hypothetical probability distribution, estimate the modes of these distribution and credibility intervals. Of those who prefer Bayes factors, some use them in the way I describe above, that is, as odds ratios, whereas others treat the likelihood ratio as the Bayes factors. The real Bayesians will not stand up. This lack of consensus reminds us of the lack of consensus among the frequentists (classic statisticians, significance [hypothesis] testers). We might end up with a hybrid Bayesian scheme, which none of the sub-school advocates will endorse.
The idea of elevating the posterior distribution to the status of the most significant (i.e., important; informative) statistical item still puzzles me. A probability distribution presents itself as a density distribution, where each point on the x axis has a value on the y axis, which is meaningful only in comparison with the values associated with other points, hence the need for likelihood ratios. When we behold an entire density function, we can do so holistically, or compute likelihood ratios. Any particular choice of a likelihood ratio needs to be theoretically justified with considerations outside of the Bayesian framework of operation, lest we tumble into the paralysis of computing an infinite number of possible likelihood ratios.
Perhaps more significantly (in the nontechnical sense), I cannot (yet) bring myself to accept the idea that humans, or other animals, think in terms of density functions or selected ratios of individually meaningless density values. I can see (see above) how we contemplate frequencies and simple probabilities. If we have any sympathy for the view that statistics is at base a formalization of cognition (of which Bayesian models often presume to be a representation), then classic frequentism has still something going for it.
Meanwhile, statistics and our selective use thereof remain a patchwork (a visual metaphor to the left).
. . . & if you look closely at the paper in the first photo, do you notice anything?
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
Fleck, L. (1935). Entstehung und Entwicklung einer wissenschaftlichen Tatsache. Einführung in die Lehre vom Denkstil und Denkkollektiv. Benno Schwabe & Co.
Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16-26.
Krueger, J. I. (2012) Fleck is back. Psychology Today Online. https://www.psychologytoday.com/blog/one-among-many/201207/fleck-is-back
Krueger, J. I. (2017). The kingdom of Bayes. Psychology Today Online. https://www.psychologytoday.com/blog/one-among-many/201712/the-kingdom-bayes
Krueger, J. I., & Heck, P. R. (2017). The heuristic value of p in inductive statistical inference. Frontiers in psychology: Educational Psychology [Research Topic: Epistemological and ethical aspects of research in the social sciences]. https://doi.org/10.3389/fpsyg.2017.00908
Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago, IL: University of Chicago Press.
Meehl, P. E. (1990). Corroboration and verisimilitude: Against Lakatos’ “sheer leap of faith” (working paper, MCPS-90-01. Minneapolis: University of Minnesota, Center for Philosophy of Science.
Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychological Methods, 5, 241–301.