Unlike the previous homework assignments, this one does not have separate short answer questions that are unrelated to coding. Thus, you will primarily turn in your R code. However, some components do also include short answer questions. These are marked by the word Question in bold. You can submit these answers in a separate pdf OR answer them in the comments of your R script at the appropriate positions.
For this assignment, we’ll make use of a dataset of lexical decision responses. As you may be aware, lexical decision is a task in which participants see a string of characters such as ‘flutter’ or ‘blutcher’ and have to rapidly push a button indicating whether it is a real word or not. Your dataset is the set of responses (‘word’ or ‘nonword’) made to actual words by 17 participants and is available as bicknell_et_al_2010_lexdec_ling300.csv
download here.
In particular, we’re interested in using the binary answers (correct / incorrect) to make inferences about the probability of getting a trial correct in this task, which we’ll call \(\pi\). (Remember that ‘correct’ means responding ‘word’, since this dataset is all actual words.) We’ll use two different Bayesian methods to estimate the posterior distribution on \(\pi\). Note that by itself, of course, this probability isn’t very interesting. However, these models are a simpler piece of an interesting analysis that examining whether \(\pi\) differs across conditions (for example, following related or unrelated prime words).
First steps:
dplyr
, ggplot2
, rjags
)dat
. (Do not use file.choose()
.)dplyr
package, add a column to this data frame called correct
that has this property.To start analyzing our beliefs about \(\pi\), we’ll use a very simple model, much like the one we discussed in class, and then perform inference with JAGS.
[Note: before completing this problem, you’ll need to install the Hmisc
package if it’s not already installed in R. No need to load it in your script with the library()
function, though.] To get a sense of the overall data, plot a 95% bootstrapped confidence interval on the overall probability of responding correct. Since this is a single overall probability, not split by condition, the simplest way to get ggplot
to do this is to specify a constant number for the x-axis aesthetic, e.g., aes(1, correct)
. To get this type of interval out of stat_summary
, use mean_cl_boot
for the fun.data
argument instead of mean_se
that you were using in the previous assignment. (Note that bootstrapped confidence intervals such as these – which we haven’t discussed in class – are usually much better behaved for parameters bounded between 0 and 1 than the more common standard-error-based intervals.) Questions: What are the endpoints of that interval? How much uncertainty does this (frequentist) analysis reveal about the probability of getting a trial correct (i.e., about how wide is the interval)?
The first step of creating this simple model is to create a variable called model_string_simple
that describes this simple model in JAGS. The model should specify that the variable pi
is distributed according to a uniform Beta distribution, and that each element of the variable correct
is drawn from a Bernoulli trial with parameter pi
.
The next step is to initialize variables named num_samples_to_keep
, num_samples_to_toss
, and num_chains
. For now, keep just 50 samples, fix the number of samples to toss to be the same as the number of samples to keep, and use two Markov chains.
Now, define the initialization values for the two chains. The first should start at a value of pi
at 0.01 and the other at 0.99. The first chain should use a random number seed of 13 and the second of 1871. Both chains should use the Mersenne Twister random number generator.
Next, initialize your JAGS model, using the model string, initialization values, and number of chains you’ve defined. You should also link the correct
variable in your JAGS model to the correct
variable in your data frame.
Run burn-in for num_samples_to_toss
iterations.
Now, get your samples from the model, specifying that pi
is the variable for which we are interested in obtaining samples. Store your samples in a variable named mcmc_samples_simple
.
As is usually the first thing you should do to check mcmc output, run gelman.diag
on your samples. Questions: Is the Upper C.I. of \(\hat{R}\) for your variable of interest below 1.05 as desired?
Another useful thing to check is how many effectively independent samples from the posterior the mcmc samples contain about a given variable of interest. Check that from this mcmc output with the effectiveSize
function. If a variable is mixing very well, then practically every mcmc iterate will be an independent sample from the posterior, and thus the number of effective samples per iterate will be near 1. Conversely, if a variable is mixing poorly, this ratio could be very close to zero. Question: Is pi
mixing well or poorly in this model? Justify your answer.
as.matrix()
function. Then, convert that matrix into a data frame with the as.data.frame()
function. Now, extract the pi
column from that data frame into a vector called samples_vector_simple
. Using this vector:
quantile()
function.The generative model used in part 1 made the assumption that the correct
responses were all independently generated. Because these data were produced by 17 participants, that isn’t a good assumption. For example, one participant’s overall probability of answering correct is likely to be different than another participant’s. In this part, you’ll extend the model from part 1 into one that doesn’t make this incorrect assumption.
To see the extent to which this is the case in the data, first plot bootstrapped confidence intervals on each participant’s probability of answering a trial correctly. This should be as simple as swapping out the constant x-axis from your plot above into using participant
for the x-axis. Question: How much do the individual participants’ probabilities of answering correct seem to differ from each other?
model_string_hier
that describes this model in JAGS. The model should specify that:
pi_overall
\(\sim \mbox{Beta}(1,1)\)num_ppts
: pi[i]
\(\sim \mbox{Beta}(k\pi, k(1-\pi))\),i
from 1 to length(correct)
: correct[i] ~ dbern(pi[participant[i]])
The next step is to set variables for the number of samples to keep to 100, the number of samples to toss to be equal to the number of samples to keep, and the number of chains to 2. Additionally, create a new variable num_ppts
, which is equal to 17.
Create variables that will be used to initialize the two chains to relatively disparate parts of the parameter space. Make sure to initialize all of the variables, and to set the random number generators and seeds to ensure reproducibility. Note that pi
should be initialized to a vector of length num_ppts
.
Initialize your JAGS model, using the model string, initialization values, and number of chains you’ve defined. You should also link the correct
and participant
variables in your JAGS model to the those in your data frame, and the num_ppts
variable in your JAGS model to the variable in R.
Run burn-in and obtain your samples, ensuring to keep all three unknown variables as part of your samples.
Check whether your model appears to have converged. Questions: Are all \(\hat{R}\) Upper C.I. values already below 1.05? If not, double the number of samples (ensuring that burn-in is also doubled) and repeat until all values are below 1.05. How many samples did it require?
Now that you have some basis for believing that your model has converged, the next thing to check is whether our upper bound on the prior for k
(of 1000) influenced our posterior distribution at all. To do this, first convert your samples object into a data frame (as described above). Now, you can use the max()
function to find the maximum of the k
column of this dataframe to see the largest sample of k
. Questions: What is this value? Is it safe to assume that our upper bound of 1000 was not influencing the random walk?
As another diagnostic, use the effectiveSize()
function on your mcmc object to determine the approximate number of effectively independent samples your mcmc samples contain for each variable. Questions: Which two variables are mixing the slowest? What are their efficiencies? (i.e., the ratio of effective samples to actual iterates. Remember to sum across all chains when calculating the number of actual iterates.) Question: How does this compare to the efficiency of pi
in the simple non-hierarchical model? (You may have already computed this as part of #10.)
Say our goal is to estimate a 95% central probability interval for pi
(in part 1) and pi_overall
(in part 2), and say that to estimate a good interval, we want to have 100 independent samples in each tail. Questions: How many independent samples of the whole distribution would we need to obtain in order to have 100 in each tail? Given the efficiencies of pi
(in part 1) and pi_overall
(in part 2), how many actual samples should we obtain about each of them in order to get about that number of independent samples?
Change the number of samples for the simple and hierarchical models now to get as many as determined in problem 21, and confirm that effectiveSize()
now shows at least the appropriate number of effective samples for having 100 in each tail. If it does not, increase the number of samples again. Questions: How many samples did it end up taking to get the appropriate number of effectively independent samples in each model? What has happened to the \(\hat{R}\) values in these models?
Now, with these final sets of samples, estimate the means and 95% central intervals of the posterior in each model. Using the bounds, calculate the width of the two posterior intervals. Questions: Which model has a broader posterior distribution? Why would this be the case?