This homework assignment has two parts. The first part tests your understanding of some basic concepts in probability theory and is short answer. The second part requires programming in R, and requires turning in a script you’ll create, in addition to answering some short answer questions.
Probability mass. Say we have a probability mass function \(p(x)\) that puts probability on just three outcomes: 1, 3, and 5. We know that \(p(1) = 0.4\), and we know that \(p(3) = p(5)\). What must \(p(3)\) and \(p(5)\) be? Why?
Probability density. Say we have a uniform distribution on the range from 3.0 to 3.1. This distribution, like all uniform distributions, has the same probability density \(p(x)\) at every point \(x\) within its range \([3.0, 3.1]\). What is the probability density \(p(x)\) at each of these points? Why do you know this? Is it a problem that this probability density is greater than 1? Why or why not?
Bernoulli trial distribution. We often say that the Bernoulli trial has a single parameter, the probability of a ‘success’, often denoted \(\pi\). In terms of \(\pi\), what is the probability of a failure? Why?
Mean and variance. Say we have a multinomial trial distribution \(p(x)\) over three outcomes 1, 2, and 4, where \(p(1) = 0.3\), \(p(2) = 0.4\), and \(p(4) = 0.3\). What is the expectation or mean of this distribution? What is the variance? What is the standard deviation? (show how you got these results)
For this part of the homework, you’ll be making use of a data file called schilling_ling300.txt
, download here. This file was constructed from a freely available dataset of eye movements in reading called the Schilling corpus (Schilling, Rayner, & Chumbley, 1998). This corpus was created when participants read a number of short sentences while their eyes were being tracked. The file I’ve provided you with describes, for a number of words and for each of 9 subjects, three binary eye movement measures that relate to ongoing language processing:
In general, words that take less time to process are associated with more skipping, and words that take longer to process are associated with more instances of multiple fixations and more regressions. For the purposes of this homework, we’ll be looking at the relationships between these three eye movement measures and one measure of how long it takes to process a word: its length in characters. On average, words that are longer take longer to process, so the expectation is that as length increases, words will be associated with less skipping, more instances of multiple fixations, and more regressions.
head
command is a great way to look at a dataset. If not, just open this file in your favorite text editor, or just open this file in RStudio with File -> Open File… to look at it. (Make sure not to accidentally save any changes to the file.)
hw1.R
, which you’ll be modifying in the rest of this homework. As you develop this script, remember to add comments beginning with the #
character whenever you think it would be useful if you wanted to read this script again in a month or two. The first thing you’ll add to the script is a line to read the data file schilling_ling300.txt
into R. Before you can do this, remember that you’ll need to change the working directory to match the directory that the data file is in. To load the data, you’ll need to choose the appropriate function in the read_delim()
family from readr
(i.e., read_delim()
, read_csv()
, etc.) and pass it the appropriate arguments. Look at the help for this family of functions by typing ?read_delim
to see the available options. Remember that the most relevant arguments are whether or not the file has a header and what character is used to separate the columns. In case you haven’t seen it before, '\t'
is a common representation in computer programming languages for the tab character in a string. Assign the result of this function to a new variable called dat
.
dat
now store? (Hint: Think what type of object the read_table()
family of functions return.)summary()
function on the new variable dat
to summarize the data in each of its columns. Of the vector types we discussed in class (logical
, numeric
, boolean
, character
, factor
), what type is each of the six columns?dat
(skip
, mfix
, and fp.reg
) are represented as numeric variables that just happen to only take two values (0 and 1), but in actuality represent a logical value that is either true or false. It is often a good idea to convert these variables into formal logical
vectors to make this explicit, so this is what you’ll do next. Since this is something that we’ll want to do to all three of these columns, this is a good candidate for creating a function (so that we don’t need to copy and paste code).
to_logical
that takes a single vector argument that is 0s and 1s (and possibly NAs) and returns its corresponding single logical
vector of TRUE
s and FALSE
s (and possibly NAs).At the beginning of the function, check to ensure that the input meets the function’s assumptions, and emit an error if not. As in class, this can be done by combining an if
statement with a stop()
function call in the following form:
if (FALSE) {
stop("Error message")
}
FALSE
in the conditional statement above with an appropriate test of whether every element in the vector is 0, 1, or NA
. Testing whether each element in the vector is equal to 1 can be done with the ==
operator, and similarly for 0. Testing whether each element in a vector is NA
, however, requires the special function is.na()
. Thus, you can create three logical vectors that denote whether each element is 0, 1, or NA
. Then, you can put these together with the logical-or operator |
, to produce one vector that tells whether each element of the input vector is either 0, 1, or NA
. Finally, to collapse this vector into a single vector that specifies whether all of the elements of the vector are TRUE
, you can use the all()
function. Note: you’re likely to also want to use the logical-not operator !
to perform the test you want."Error message"
with something more informative about the error.TRUE
, 0 to FALSE
, and leaves NA
as NA
. For this step we can assume that those are the only three possible inputs (thanks to the checking we’ve already implemented). There are two ways to achieve this goal. One is to use the ifelse()
function (see ?ifelse
), which can check whether each element meets a certain condition (say, being 1), and if so, specifies one resulting value (say, TRUE
), and if not, another resulting value (say, FALSE
). (The ifelse()
function, like most elementwise functions in R, leaves NA
values as NA
.) A more direct approach, only valid because of our test, is simply to use the logical operator ==
, which returns TRUE
s and FALSE
s (and also leaves NA
values as NA
).c(0, 1, 1, NA, 0, 1)
to see whether it returns c(FALSE, TRUE, TRUE, NA, FALSE, TRUE)
.Finally, use this function to transform the three binary columns of dat
. Using mutate()
in the dplyr
package, transform the mfix
, skip
, and fp.reg
columns into logical vectors using your new to_logical()
function. Make sure to assign the results of the transformation back to the dat
variable.
dplyr
. As mentioned above, for this dataset, the main question of interest is how the eye movement measures of ongoing language processing (skip rate, multiple fixation rate, and regression rate) vary with word difficulty, as assessed by word length. In the next problem, we’ll be visualizing these relationships using ggplot2
. However, it is customary (and very reasonable) in behavioral research on multiple subjects to not plot raw data, but to first aggregate the data to produce means for each condition of interest for each subject, and then plot summary statistics of those means. So, in this problem, you’ll aggregate the data to produce a new data frame dat_subj
where there is a single mean for each of these three measures for each subject for each possible value of word length.
subj
and wlen
using dplyr
’s group_by()
function. Assign this new grouped data frame to your new variable dat_subj
.summarise()
function (note the spelling!) to create one “new” variable skip
that is the mean of the old skip
variable, e.g., summarise(skip = mean(skip))
. Because the data is already grouped, this will calculate the mean for each group, which is exactly what we want. And since this is chained to the group_by()
function, it will now be the result of this summarise
function that is assigned to the new dat_subj
variable.summarise()
function. Note that because the column fp.reg
includes NA
values, you’ll need to add a second argument to the mean()
function, na.rm = TRUE
(see ?mean
), to denote that you want to remove the NA
values prior to averaging (otherwise, the average will also be NA
).head()
function in R to look at the first few rows of the new data frame dat_subj
. Verify that the columns skip
, mfix
, and fp.reg
are no longer binary, but now are proportions between 0 and 1.ggplot2
, we’ll plot our transformed data frame dat_subj
that we created in the previous problem.
ggplot
function, set the data argument to dat_subj
and use the aesthetics function aes()
to use wlen
for the x axis and skip
for the y axis. Add geom_point()
to get a scatter-plot of the subject means. The result gives a vague sense of the range of the data, but isn’t yet a very useful visualization.wlen
is not an arbitrary numeric value, but takes discrete levels, i.e., a factor
. To fix this, add to the previous part of the script, which used mutate()
to transform skip
, etc., into logical values code to mutate wlen
into a factor
using the factor()
function. Re-run the script and make sure that the x axis now explicitly shows only possible values of word length in this dataset.color = subj
and group = subj
. Do this and make sure the result now shows different subjects in different colors.subj
variable only takes 9 possible values (one for each subject), but is assuming that it can take any value in its range (e.g., 1.42), and so it’s assigning a gradient of colors to this range. To fix this, add more code to the previous part of the script that transformed the variables with mutate()
to change subj
into a factor
as well. Re-run the script and make sure that the colors chosen are now more reasonable.ggplot2
. Create a new plot, with the same basic aesthetics as before (x = wlen, y = skip
), but omitting the bit about subj
. Now, replace the geom_point()
bit that was added, which just plotted every subject mean as a scatter-plot point, with stat_summary(fun.data = mean_se)
. Now, instead of directly visualizing the data (what the geom
did), this function will calculate a statistic of the y values (their mean and standard error of the mean) for each of the possible values of the x variable (wlen
). Each statistic is associated with a default visualization scheme (here a point with a simple line around it to indicate a range), so this code is all that’s needed to see the means and standard errors of the skip rates for each word length in the dataset. Describe the relationship between word length and skip rate that seems apparent.mfix
and fp.reg
) and describe the relationship between word length and each of those two variables that are apparent there.