<- c("tidyverse", "babynames", "broom",
course_packages "gapminder", "here", "janitor", "naniar",
"palmerpenguins", "skimr", "slider", "socviz",
"usethis", "visdat","reprex", "remotes")
install.packages(course_packages, repos = "http://cran.rstudio.com")
<- c("kjhealy/covdata", "kjhealy/congress", "kjhealy/nycdogs",
data_packages "kjhealy/ukelection2019", "kjhealy/uscenpops")
::install_github(data_packages) remotes
Example 01: Up and Running with R
We will be working with the most recent stable versions of R and RStudio, as well as with a number of additional packages. You will need to install R, RStudio, and the necessary packages on your own computer.
1. Install R on your computer
Begin by installing R (http://cloud.r-project.org). Choose the version appropriate for your computing platform:
If you use macOS with an Apple Silicon processor (i.e. an M1, M2, or M3 chip), then install R for macOS’s Apple Silicon build. This version does not work on older, Intel-based Macs.
If you use macOS with an Intel processor, then install R for macOS’s Intel build.
If you use Microsoft Windows, then install R for Windows.
If you use Linux, choose a distribution and install it.
2. Install RStudio on your computer
- Follow this link and download RStudio Desktop for your computer. You will have already completed Step 1.
3. Installing some additional packages
- Once R and RStudio are installed, launch RStudio. Either carefully type in or copy-and-paste the following lines of code at R’s command prompt, located in the RStudio window named “Console”, and then hit return. To copy this chunk of code, mouse over the code click the clipboard icon, , that appears in the top right corner of the chunk.
Installing these packages may take a little time. Once you have completed this step, you’ll be ready to begin.
4. Examples from the slides
library(tidyverse)
Arithmetic:
31 * 12) / 2^4 (
[1] 23.25
sqrt(25)
[1] 5
log(100)
[1] 4.60517
log10(100)
[1] 2
Logic:
4 < 10
[1] TRUE
4 > 2 & 1 > 0.5 # The "&" means "and"
[1] TRUE
4 < 2 | 1 > 0.5 # The "|" means "or"
[1] TRUE
4 < 2 | 1 < 0.5
[1] FALSE
## A logical test
2 == 2 # Write `=` twice
[1] TRUE
## This will cause an error, because R will think you are trying to assign a value
2 = 2
## Error in 2 = 2 : invalid (do_set) left-hand side to assignment
3 != 7 # Write `!` and then `=` to make `!=`
[1] TRUE
Take care:
3 < 5 & 7
[1] TRUE
But now try 3 < 5 & 1
, where your intention is “Three is less than five and also less than one [True or False?]”
3 < 5 & 1
[1] TRUE
Instead:
3 < 5 & 3 < 1
[1] FALSE
You have to make your comparisons explicit.
Objects:
## We made this before
<- c(1, 1, 2, 4, 1, 3, 1, 5)
my_numbers my_numbers
[1] 1 1 2 4 1 3 1 5
# This one is built-in letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
# Different! LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
# Also built-in pi
[1] 3.141593
Functions are objects too.
mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x1094e1ed0>
<environment: namespace:base>
Assignment:
## name... gets ... this stuff
<- c(1, 2, 3, 1, 3, 5, 25, 10)
my_numbers
## name ... gets ... the output of the function `c()`
<- c(5, 31, 71, 1, 3, 21, 6, 52) your_numbers
Assignment with equals:
= c(1, 2, 3, 1, 3, 5, 25)
my_numbers
my_numbers
[1] 1 2 3 1 3 5 25
On the other hand, =
has a different meaning when used in functions.
I’m going to use <-
for assignment throughout.
Just be consistent either way.
Special operators
For example, matrix multiplication is %*%
<- matrix(c(2,3,3,4,1,8), ncol = 2)
x x
[,1] [,2]
[1,] 2 4
[2,] 3 1
[3,] 3 8
<- matrix(c(1,2,3,4), nrow = 2)
y y
[,1] [,2]
[1,] 1 3
[2,] 2 4
%*% y x
[,1] [,2]
[1,] 10 22
[2,] 5 13
[3,] 19 41
Why %*%
? In R the notation %<SOMETHING>%
is used for some operators, including custom operators.
But the thing in between the % %
can be lots of things. E.g.,
<- letters[1:10]
x <- letters[5:15]
y
x
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
y
[1] "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
%in% y x
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
And we can define our own, too
## Need to refer to the operator in a special way with backticks
`%nin%` <- Negate(`%in%`)
# Now we have "not in"
%nin% y x
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Packages
Packages are loaded into your working environment using the library()
function:
## A package containing a dataset rather than functions
library(gapminder)
gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
You need only install a package once (and occasionally update it):
## Do at least once for each package. Once done, not needed each time.
install.packages("palmerpenguins", repos = "http://cran.rstudio.com")
## Needed sometimes, especially after an R major version upgrade.
update.packages(repos = "http://cran.rstudio.com")
But you must load the package in each R session before you can access its contents:
## To load a package, usually at the start of your RMarkdown document or script file
library(palmerpenguins)
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
You can “reach in” to an unloaded package and grab a function directly, using <package>::<function>
## A little glimpse of what we'll do soon
|>
penguins select(species, body_mass_g, sex) |>
::tbl_summary(by = species) gtsummary
Characteristic | Adelie N = 152 |
Chinstrap N = 68 |
Gentoo N = 124 |
---|---|---|---|
body_mass_g, Median (IQR) | 3,700 (3,350 – 4,000) | 3,700 (3,475 – 3,950) | 5,000 (4,700 – 5,500) |
Unknown | 1 | 0 | 1 |
sex, n (%) | |||
female | 73 (50) | 34 (50) | 58 (49) |
male | 73 (50) | 34 (50) | 61 (51) |
Unknown | 6 | 0 | 5 |
The scope of names
<- c(1:10)
x <- c(90:100)
y
x
[1] 1 2 3 4 5 6 7 8 9 10
y
[1] 90 91 92 93 94 95 96 97 98 99 100
mean()
## Error in mean.default() : argument "x" is missing, with no default
mean(x) # argument names are internal to functions
[1] 5.5
mean(x = x)
[1] 5.5
mean(x = y)
[1] 95
x
[1] 1 2 3 4 5 6 7 8 9 10
y
[1] 90 91 92 93 94 95 96 97 98 99 100
Types and Classes
The object inspector in RStudio is your friend.
You can ask an object what it is at the console, too:
class(my_numbers)
[1] "numeric"
typeof(my_numbers)
[1] "double"
Objects can have more than one (nested) class:
summary(my_numbers)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.500 3.000 5.714 4.000 25.000
<- summary(my_numbers) # remember, outputs can be assigned to a name, creating an object
my_smry
class(summary(my_numbers)) # functions can be nested, and are evaluated from the inside out
[1] "summaryDefault" "table"
class(my_smry) # equivalent to the previous line
[1] "summaryDefault" "table"
typeof(my_smry)
[1] "double"
attributes(my_smry)
$names
[1] "Min." "1st Qu." "Median" "Mean" "3rd Qu." "Max."
$class
[1] "summaryDefault" "table"
## In this case, the functions extract the corresponding attribute
class(my_smry)
[1] "summaryDefault" "table"
names(my_smry)
[1] "Min." "1st Qu." "Median" "Mean" "3rd Qu." "Max."
Kinds of vector; and coercion
<- c(1, 3, 5, 6, 10)
my_int is.integer(my_int)
[1] FALSE
is.double(my_int)
[1] TRUE
<- as.integer(my_int)
my_int is.integer(my_int)
[1] TRUE
<- c("Mary", "had", "a", "little", "lamb")
my_chr is.character(my_chr)
[1] TRUE
<- c(TRUE, FALSE, TRUE)
my_lgl is.logical(my_lgl)
[1] TRUE
Factors:
## Factors are for storing undordered or ordered categorical variables
<- factor(c("Yes", "No", "No", "Maybe", "Yes", "Yes", "Yes", "No"))
x x
[1] Yes No No Maybe Yes Yes Yes No
Levels: Maybe No Yes
summary(x) # Alphabetical order by default
Maybe No Yes
1 3 4
typeof(x) # Underneath, a factor is a type of integer ...
[1] "integer"
attributes(x) # ... with labels for its numbers, or "levels"
$levels
[1] "Maybe" "No" "Yes"
$class
[1] "factor"
levels(x)
[1] "Maybe" "No" "Yes"
is.ordered(x)
[1] FALSE
Vector types can’t be heterogeneous. Objects can be manually or automatically coerced from one class to another. Take care.
class(my_numbers)
[1] "numeric"
<- c(my_numbers, "Apple")
my_new_vector
# vectors are homogeneous/atomic my_new_vector
[1] "1" "2" "3" "1" "3" "5" "25" "Apple"
class(my_new_vector)
[1] "character"
<- c(2.1, 4.77, 30.111, 3.14519)
my_dbl is.double(my_dbl)
[1] TRUE
<- as.integer(my_dbl)
my_dbl
my_dbl
[1] 2 4 30 3
Lists, data frames, tibbles
A table of data is a kind of list
# tibbles and data frames can contain vectors of different types gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
class(gapminder)
[1] "tbl_df" "tbl" "data.frame"
typeof(gapminder) # hmm
[1] "list"
- Lists can be heterogeneous, in the sense of containing vectors of different types. Underneath, most complex R objects are some kind of list with different components.
- A data frame is a list of vectors of the same length, where the vectors can be of different types (e.g. numeric, character, logical, etc)
- A tibble is an enhanced data frame
Some classes are versions of others
- Base R’s trusty
data.frame
library(socviz)
Attaching package: 'socviz'
The following object is masked _by_ '.GlobalEnv':
%nin%
titanic
fate sex n percent
1 perished male 1364 62.0
2 perished female 126 5.7
3 survived male 367 16.7
4 survived female 344 15.6
class(titanic)
[1] "data.frame"
## The `$` idiom picks out a named column here;
## more generally, the named element of a list
$percent titanic
[1] 62.0 5.7 16.7 15.6
The Tidyverse’s enhanced tibble
## tibbles are build on data frames
<- as_tibble(titanic)
titanic_tb titanic_tb
# A tibble: 4 × 4
fate sex n percent
<fct> <fct> <dbl> <dbl>
1 perished male 1364 62
2 perished female 126 5.7
3 survived male 367 16.7
4 survived female 344 15.6
class(titanic_tb)
[1] "tbl_df" "tbl" "data.frame"
Recycling rules for vectors
Arithmetic on vectors
In R, all numbers are vectors of different sorts. Even single numbers (“scalars”) are conceptually vectors of length 1.
Arithmetic on vectors (and arrays generally) follows a series of recycling rules that favor ease of expression of vectorized, “elementwise” operations.
See if you can predict what the following operations do:
my_numbers
[1] 1 2 3 1 3 5 25
<- my_numbers + 1 result1
result1
[1] 2 3 4 2 4 6 26
<- my_numbers + my_numbers result2
result2
[1] 2 4 6 2 6 10 50
<- c(5, 10)
two_nums
<- my_numbers + two_nums result3
Warning in my_numbers + two_nums: longer object length is not a multiple of
shorter object length
result3
[1] 6 12 8 11 8 15 30
<- c(1, 5, 10)
three_nums
<- my_numbers + three_nums result4
Warning in my_numbers + three_nums: longer object length is not a multiple of
shorter object length
result4
[1] 2 7 13 2 8 15 26
Note that you get a warning here. It’ll still do it, though! Don’t ignore warnings until you understand what they mean.