Welcome to Lab 0 of Stat 131A!
R
, a
statistical programming language, and RStudio
, the most
popular software for interacting with R
.We expect that parts of Lab 0 will be review for some folks, and other parts will be brand new.
R
and
RStudio
are likely brand new for you. You are likely
familiar with many of the concepts in the lab, but the syntax is
new.R
and RStudio
, but there may be coding
concepts in the lab that are new to you.Remember, you do not need to know anything about
R
before taking 131A.
This lab is self-paced.
The first time you read this, you are probably looking at the pretty HTML version of the lab.
.Rmd
) version..Rmd
version, you will be able to run
R
code.Notice the table of contents on the left-hand side of the page.
RStudio
is an open-source Integrated Development
Environment (IDE) designed specifically for R
.
RStudio can be run from your computer, or in the cloud via DataHub.
The course staff will use the DataHub version of
RStudio
in 131A, but you are free to use the local version
if you prefer.
Here’s what RStudio
looks like in DataHub:
The files you see in the lower right corner may look a little different.
131a-labs-fall-2024
.It’s time to transition to the interactive version of the lab!
Two important notes:
Lab00.Rmd
file, which is located inside the
lab00
directory, which is inside the
131a-labs-fall-2024
directory.After Step 1, you should notice the Source
and
Visual
buttons at the top left of the screen.
Visual
to make the lab easier to read!❗❗❗Did you read the two instructions above? If not, do so before clicking the link below.❗❗❗
At this point in Lab 0, we will assume you are working from
RStudio
in DataHub, and not looking at the HTML version of
the lab.
The short block of code below is called a code chunk or code cell.
RMarkdown
notebooks, like this lab, function very similarly
to Jupyter notebooks.To run the code cell, press the green ▶️ button on the right side of the code cell.
a = 1:5
a
## [1] 1 2 3 4 5
The output should be [1] 1 2 3 4 5
.
1:5
produces a vector of the integers 1 through
5.
For now, ignore the [1]
on the left of the
output.
More on R
syntax shortly!
To add a new code cell, press the button towards the top right of the
RStudio
window.
R
code chunk!R
syntax 🤖Now that you’re familiar with the basics of RMarkdown
notebooks, let’s dive into R
syntax.
The basics of variable assignment are quite similar to
Python
.
However, in R
, the assignment operator
<-
can be used for variable assignment, in addition to
=
.
In general, we recommend sticking to =
for your own
code, though you may see <-
in other people’s
code.
val = 3
print(val)
## [1] 3
new_val <- 7
print(new_val)
## [1] 7
val-new_val
## [1] -4
Note that the last line is printed, even though we never explicitly
called print()
.
R
will print non-assignment lines of code chunks by
default.# four function calculator
2 + 3
## [1] 5
2 - 3
## [1] -1
2 * 3
## [1] 6
2 / 3
## [1] 0.6666667
# exponentiation
3^4
## [1] 81
# square root
sqrt(16)
## [1] 4
# logarithm with base e
log(10)
## [1] 2.302585
# exponential
exp(2)
## [1] 7.389056
# absolute value
abs(-2)
## [1] 2
R has similar data types as in python: numeric values, integer values, characters (i.e., strings), and logicals (i.e., booleans, TRUE/FALSE).
Python Users: In R the boolean value is
TRUE
orFALSE
(all caps), while in Python it would beTrue
orFalse
In R
, we can often test the data type with
is.XXXXX
is.character("my name is")
## [1] TRUE
is.numeric(5)
## [1] TRUE
is.logical(TRUE)
## [1] TRUE
We can cast to new data types with as.XXXX
as.character(10)
## [1] "10"
as.numeric('100')
## [1] 100
as.logical(1)
## [1] TRUE
The basic numerical comparisons are <
,
<=
, >
, >=
.
We use &
for AND , |
for OR, and
!
for NEGATE.
Python Users: In Python, these are
and
,or
, andnot
.
1 > 0
## [1] TRUE
TRUE | FALSE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
! TRUE
## [1] FALSE
Vectors store a series of values of the same data type.
Vectors can be created with c()
, which stands for
concatenate.
a = c(0.125, 4.75, -1.3)
a
## [1] 0.125 4.750 -1.300
Whats the [1]
on the left?
100:150
## [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
## [20] 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
## [39] 138 139 140 141 142 143 144 145 146 147 148 149 150
The bracketed numbers provide the index of the value immediately to the right.
[1]
means that 100
is the
1
st number in the sequence.
Important note: Unlike Python, which is
zero-indexed, R
is one-indexed. In other words, the first
element of an R
vector (or any other object) is indexed
with [1]
.
We can also use c()
to combine existing vectors
b = c(0, 1, -1)
b
## [1] 0 1 -1
new_vec = c(a, b)
new_vec
## [1] 0.125 4.750 -1.300 0.000 1.000 -1.000
All elements of a vector must have the same type.
list()
.list(TRUE, 1, 'hello')
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] "hello"
seq
🛤The function seq
generates sequences of numbers with a
similar pattern.
seq1 = seq(from=4, to=9, by=1)
seq1
## [1] 4 5 6 7 8 9
You can also write function seq(from=a, to=b, by=1)
as
a:b
seq2 = 1:6
seq2
## [1] 1 2 3 4 5 6
To see the documentation of a function, put ?
in front
of it.
?seq
In R
, mathematical operations on vectors are usually
done element-wise, meaning the operation is done on each element of the
vector.
Let \(x=(x_1,x_2,\ldots,x_n\) and \(y=(y_1,y_2,\ldots,y_n)\).
Then x*y
will return the vector \((x_1 y_1, x_2 y_2,\ldots,x_n y_n)\)
vec1 = c(1,2,3)
vec2 = c(3,4,5)
vec1 * vec2
## [1] 3 8 15
This vectorization works with other functions, too:
vec1 + vec2
## [1] 4 6 8
vec1 / vec2
## [1] 0.3333333 0.5000000 0.6000000
vec1 > 2
## [1] FALSE FALSE TRUE
Python Users: Python and
R
indexing are different. Pay extra attention to this section!
You can index elements of a vector with [ ]
.
vector1 <- 11:20
# the first element
vector1[1]
## [1] 11
# the tenth element
vector1[10]
## [1] 20
A negative index will exclude the indicated element.
vector1
## [1] 11 12 13 14 15 16 17 18 19 20
# notice the 11 is now missing!
vector1[-1]
## [1] 12 13 14 15 16 17 18 19 20
# notice the 11 and 20 are missing!
vector1[-c(1,10)]
## [1] 12 13 14 15 16 17 18 19
Python Users: Python uses negative indexing differently. For example,
v[-1]
refers to the last element ofv
.
You can grab elements of multiple indices by passing a vector of indices.
vector1[3:6]
## [1] 13 14 15 16
Python Users: When subsettting an vector using
1:10
, R subsets the 1st through 10th elements, while Python would subset the 1st through 9th elements.
You can also subset using logicals.
vec = 1:10
# prints out 1 to 10
vec
## [1] 1 2 3 4 5 6 7 8 9 10
# shows which values are greater than 5
vec > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
# grab only the values greater than 5
vec[vec > 5]
## [1] 6 7 8 9 10
You can also give names to each entry of your vector
vec = c(a = 1, b = 2, c = 3)
vec
## a b c
## 1 2 3
You can index by name.
vec["a"]
## a
## 1
You can grab the names of a vector with names()
.
names(vec)
## [1] "a" "b" "c"
length()
provides the length of a vector.
length(11:20)
## [1] 10
There are lots of built-in summary statistics in R
, such
as max
, min
, mean
,
median
, and sum
.
vec = c(1,2,3,4,100)
min(vec)
## [1] 1
max(vec)
## [1] 100
mean(vec)
## [1] 22
median(vec)
## [1] 3
sum(vec)
## [1] 110
We can randomly sample from a vector with sample()
.
For example, we can randomly sample 5 days out of the year, which has 366 possible days.
sample(x = 1:366, size = 5)
## [1] 187 247 244 156 200
If we set a random seed, the random sampling will always give the same result.
Re-run the code cell below several times to see how the 5 chosen days no longer change.
Setting a seed is useful for reproducing analyses across different computers.
set.seed(131)
sample(x = 1:366, size = 5)
## [1] 218 276 52 171 197
By default, sample()
chooses without
replacement.
We can sample with replacement by setting the
argument replace = TRUE
.
sample(1:3, size = 10, replace = TRUE)
## [1] 1 2 2 3 3 1 3 2 1 1
Try running the code above with replace = FALSE
. Why
does it give an error?
Here’s how to write a function that calculates the length of the
hypotenuse of a triangle with side lengths a
and
b
:
get_hypotenuse = function(a, b) {
c = sqrt(a^2 + b^2)
return(c)
}
# prints the function definition
get_hypotenuse
## function(a, b) {
## c = sqrt(a^2 + b^2)
## return(c)
## }
# prints the hypotenuse of a triangle with sides 3 and 4
get_hypotenuse(a = 3, b = 4)
## [1] 5
# prints the hypotenuse of a triangle with sides 5 and 12
get_hypotenuse(a = 5, b = 12)
## [1] 13
Python Users: Unlike python,
R
does not care about indentation. But, for readability, it is good practice to indent.
Here’s how to modify the hypotenuse function so that it prints a message if the side lengths are invalid.
NA
is a special value in R
that
represents a placeholder for a missing value.
NULL
is a special value in R
that
represents a state of emptiness.
In 131A, we will generally avoid using NULL
, but it
may show up while you code.
get_hypotenuse = function(a, b) {
if (a <= 0 | b <= 0){
print("Invalid side lengths.")
return(NA)
}
if (! is.numeric(a) | ! is.numeric(b)){
print("Side lengths must be numeric.")
return(NA)
}
c = sqrt(a^2 + b^2)
return(c)
}
get_hypotenuse(a=3, b=4)
## [1] 5
get_hypotenuse(a=0, b=-1)
## [1] "Invalid side lengths."
## [1] NA
get_hypotenuse(a="3", b="4")
## [1] "Side lengths must be numeric."
## [1] NA
NA
is short for “Not Available”. In R
,
NA
is used to represent missing values.
NA
is contagious: most operations involving
NA
generally results in NA
.
NA + 1
## [1] NA
mean(c(NA, 1, 2, 3))
## [1] NA
NA & TRUE
## [1] NA
NA == NA
## [1] NA
You can check if a value is NA
using the
is.na()
function.
is.na(c(1, 2, NA))
## [1] FALSE FALSE TRUE
Many functions have arguments to exclude NA
values.
mean(c(NA, 1, 2, 3))
## [1] NA
mean(c(NA, 1, 2, 3), na.rm = TRUE)
## [1] 2
Why does the code below return TRUE
even though there’s
an NA
?
NA | TRUE
## [1] TRUE
Here is how to write a for-loop in R
to iterate over a
set of values:
for (animal in c('cat', 'dog', 'rabbit')){
print(animal)
}
## [1] "cat"
## [1] "dog"
## [1] "rabbit"
Here’s how to use a for-loop to add all the numbers from 1 to 100:
running_sum = 0
#loop over the integers 1-10:
for (i in 1:100){
running_sum = running_sum + i
}
print(running_sum)
## [1] 5050
# Double checking that we get the same answer!
sum(1:100)
## [1] 5050
Claims of voter fraud are widespread. Here’s one example:
“Probably over a million people voted twice in [the 2012 presidential] election.”
Dick Morris, in 2014 on Fox News
Voter fraud can take place in a number of ways, including tampering with voting machines 📠, destroying ballots 🗳️, and impersonating voters 🤖. Today, though, we will explore double voting, which occurs when a single person illegally casts more than one vote in an election.
To start, consider this fact:
In the 2012 election, there were 141 individuals named “John Smith” who were born in 1970, and 27 of those individuals had exactly the same birthday.
Were there 27 fraudulent “John Smith” ballots in the 2012 election? Let’s find out.
The code below defines another function, num_pairs
.
num_pairs
function!# don't worry at all about studying/understanding this specific function!
# But, if you're looking for an extension problem, try to figure out how this function works.
num_pairs <- function(v) {
duplicated_indices <- duplicated(v)
duplicated_values <- v[duplicated_indices]
duplicated_counts <- table(duplicated_values) + 1
sum(choose(duplicated_counts, 2))
}
num_pairs
returns the number of pairs that can formed
with duplicated values.
# There are 365 days in a standard year! So we represent 12/30 with 364 and 12/31 with 365.
num_pairs(c(364, 364, 365))
## [1] 1
If I had two votes from a “John Smith” born on December 30th, and two votes from “John Smith”’s born on December 31st, I could make two pairs of potential double votes.
num_pairs(c(364, 364, 365, 365))
## [1] 2
If I had four votes from “John Smith”’s born on December 30th, I could make six pairs of potential double votes.
num_pairs(c(364, 364, 364, 364))
## [1] 6
Think back to the John Smith example:
In the 2012 election, there were 141 individuals named “John Smith” who were born in 1970. From those 141 individuals, we can make 27 pairs with exactly the same birthday. Are these double votes, or would we expect to see this many pairs by chance?
Generate a vector of 141 random birthdays using a:b
and
sample()
.
Then, use num_pairs()
to determine how many
duplicate pairs can be formed from elements in the vector.
Run the code chunk repeatedly to see how the results can change due to randomness.
# Your code here!
Use a for-loop to repeat the exercise above 10,000 times.
Keep track of the total number of duplicate pairs across all 10,000 iterations.
At the end, divide the total number of duplicate pairs by 10,000 to get the average number of duplicate birthdays you would expect to see in a group of 141 “John Smiths” born in 1970.
Are you surprised by the result?
# Your code here!
Fun fact: The method you implemented above was used, in part, to explain approximately three million so-called double votes in the 2012 election: You can read more here!