Comparing different methods to calculate Bayes factors for a simple model
 Introduction
 Comparing the four BF methods
 Evidence for null hypothesis:
brms
vs.ttestBF()
 How much more conservative is the
bmrs
approach?  Directed vs. nondirected hypothesis
 Conclusion
Introduction
In this post, I’d like to examine how different ways to calcualte Bayes
factors (BF) compare with each other for a simple model. This simulation
was run on my department’s HPC using the rslurm
package: the
corresponding R code, data and .Rmd for this post are available.
For this simulation, I generated data of a withinsubject model with two levels. The simulation comprised 10,000 iteration per effect size Cohen’s d (0.0, 0.5 and 0.8). Samples of 60 were generated with this function:
dataGenerator < function(id_index, n, d){
data_len < length(id_index:(n + id_index  1))
x < rep(c(0, 1), data_len)
b0 < rep(rnorm(data_len), each = 2)
y < b0 + x * rnorm(data_len*2, d)
y < scale(y)
df < data.frame(id = rep(id_index:(n + id_index  1), each = 2),
x = x,
y = y)
return(df)
}
The equation of this model can also be written as:
$$y_{ji} = b_{0j} + x_{i}N(d, 1)$$
where $y_{ji}$ is the value for the jth subject and the ith level,
$b_{0j}$ a subject specific intercept and $x_{i}$ is the level
dummycoded (0 & 1) that is multiplied with a value drawn from normal
distribution with a mean of $d$ (0.0, 0.5 or 0.8) and a standard
deviation (SD) of 1. For the brms
model, we scaled the data to a mean
of 0 and an SD of 1.
I then compared 4 ways to calculate a BF to examine the mean difference of the two levels.
ttestBF()
: For the first method, I usedttestBF()
from theBayesFactor
package. This is a widely used Bayesian implementation for the classical ttests (Rouder et al., 2009). For this simulation, I used the standard medium rscale ofsqrt(2)/2
≈ 0.707. Note that this approach uses different priors (Cauchy/Jeffry’s prior) and likelihood functions than thebrms
models below. So, I expect differences but it still can be used as a reference point.
The reaming four methods are all based on the same brms
model.
model_full < brm(y ~ x + (1  id), data = df, prior = priors_normal_full)
For the intercept and the beta value, I placed a normal prior, N(0, 1):
priors_normal_full < c(prior(normal(0, 1), class = "Intercept"),
prior(normal(0, 1), class = "b"))
To make full use of our HPC, I only used 1 chain per model. Each run used 36000 samples in total of which 3000 were for warmup.
bayes_factor()
: For this method, I compared the full model with a null model that did not include x.
The other two methods are based on an approximation of the BF that does not calculate the marginal likelihood but instead the point density at zero for the prior and the posterior distribution, which is known as the SavageDickey ratio.

hypothesis()
: This function is a SavageDickey ratio implementation from thebrms
package. Can be used like thishypothesis(model_full, 'x = 0')
. 
manual: This approach estimates the point density via Logspline Density Estimation. This is adapted from code from Wagenmakers et al. (2010) and can be found on his website. This method used
logspline()
from thepolspline
package. Note that instead of also estimating the point density from the prior samples as generated bybrms
, I just used the respective base functiondnorm()
.
fit.posterior < logspline(posterior_samples(model_full)$b_x)
posterior < dlogspline(0, fit.posterior)
prior < dnorm(0, 0, 1)
bf_manual < prior/posterior
This method by the way also allows to easily calculate the BF for directed hypotheses (i.e. d = 0 vs. d > 0):
areaPosterior < sum(posterior_samples(model_full)$b_x > 0)/length(posterior_samples(model_full)$b_x)
posterior.OR < posterior/areaPosterior
prior.OR < prior/0.5
bf_manual_OR < prior.OR/posterior.OR
See Wagenmakers et al. (2010) for more information on this. This will also examined at the end of this post.
Comparing the four BF methods
Below you find the scatterplots comparing the different ways of estimating a BF. To aid illustration, the data is presented in natural logspace. Each plot contains up to 10,000 data points.
The the scatterplots reveal when d = 0.0 that all methods compare well
with each other. In contrast for d = 0.5 and d = 0.8, hypothesis()
function starts to break down at around BF10 = exp(5) ≈ 150 and only
returns extremely inflated estimates. The manual method via Logspline
Density Estimation leads to more stable estimates, while the spread
(e.g. manual vs. bayes_factor()
) also becomes larger at around BF10 =
exp(7) ≈ 1100. However, the estimations are not skewed as the ones
returned by hypothesis()
. In that sense, the manual method based on
Logspline Density Estimation is superior. The reason for that is that
hypothesis()
does not perform well if the estimation is only based on
very few values around zero, which happens when the distribution is
shifted away from zero.
As already noted above, there ttestBF()
and the brms
model have
different priors and likelihood functions, but comparison with the BF
from ttestBF is still useful to examine if they point in the same
general direction and indeed they do. In fact, the correlation for d =
0.5 and d = 0.8 is extremely linear. However, we also see that in the
top right corner that for d = 0.0 there is an interesting increasing
spread towards low values (e.g. see ttestBF()
and bayes_factor()
).
The reason for this spread reflect the different performance of the
methods to produce evidence for H0 (see next section).
All in all, the manual method does a good job with the advantage that it
is a lot after than bayes_factor()
providing sensible estimates in
cases that matter most to use as psychologists/cognitive neurosciences.
In most circumstances, it doesn’t really matter if a BF10 is 1500 or
only 1100.
Evidence for null hypothesis: brms
vs. ttestBF()
To illustrate the difference between brms
and ttestBF()
, I binned
the BF into 8 categories and compared their
frequencies.
A look at the two distributions shows that the brms
model with unit
prior of N(0,1) leads to stronger evidence for the null when the null
hypothesis is true than ttestBF()
with its Cauchy/Jeffrey’s prior
implementation does. Zero percent of ttestBF()
’s BF10 were < 1/10.
This highlights the more conservative nature of the brms
model. As a
comparison, we can look at the distribution of BF10 when effect size is
d =
0.5.
In this case the percentages are very similar despite being more
conservative, the bmrs
model often leads to the same conclusion. Note
that for d = 0.8 all BF10 were above 10, hence they are not displayed
here but this illustrated that brms
is doing well in these cases as
well.
How much more conservative is the bmrs
approach?
The conservative nature can be illustrated by examining the distribution
of the ratios between both BFs
(ttestBF()
/bayes_factor()
).
The distributions of ratios show that the brms
model more conservative
than ttestBF()
but as the effect size increase the max ratio drops
from 1.8 to 1.4. Median values are 1.308, 1.162 and 1.113 for a d of
0.0, 0.5 and 0.8, respectively. It can be seen that the unitprior in
brms
models regularises much more for lower dvalues than ttestBF()
,
which is a good thing if the aim is to provide evidence in favour of the
null hypothesis.
Directed vs. nondirected hypothesis
On each run in these simulations, I also estimated the orderrestricted BF (i.e. BF for directed hypothesis) that d = 0 vs d > 0.
This shows that when the null is true, the orderrestricted BF is much lower (median of 0.989) compared to d = 0.8 because follows are uniform distribution. For positive effect sizes (d = 0.5 & d = 0.8), the ratio is very close to 2.
Conclusion
All in all, that simulation showed that the brms
model with unitprior
are more conservative than ttestBF()
but lead to similar conclusions
if d > 0. With respect to the four models, I would that the manual
implementation is usually enough as it and easy, quick and intuitive way
to calculate a BF. It definitely out performs the hypothesis()
function that easily breaks down if the posterior distribution has
little overlap with zero. Furthermore, hypothesis()
does not work well
for priors that are hard to sample e.g. Cauchy prior. This is an issue
that has been previously
discussed
but my tests show that the manual method gives quite reasonable
estimates in this scenario.