Example 01: Up and Running with R

We will be working with the most recent stable versions of R and RStudio, as well as with a number of additional packages. You will need to install R, RStudio, and the necessary packages on your own computer.

1. Install R on your computer

Begin by installing R (http://cloud.r-project.org). Choose the version appropriate for your computing platform:

2. Install RStudio on your computer

3. Installing some additional packages

  • Once R and RStudio are installed, launch RStudio. Either carefully type in or copy-and-paste the following lines of code at R’s command prompt, located in the RStudio window named “Console”, and then hit return. To copy this chunk of code, mouse over the code click the clipboard icon, , that appears in the top right corner of the chunk.
course_packages <- c("tidyverse", "babynames", "broom",
    "gapminder", "here", "janitor", "naniar", 
    "palmerpenguins", "skimr", "slider", "socviz",
    "usethis", "visdat","reprex", "remotes")

install.packages(course_packages, repos = "http://cran.rstudio.com")

data_packages <- c("kjhealy/covdata", "kjhealy/congress", "kjhealy/nycdogs", 
                   "kjhealy/ukelection2019", "kjhealy/uscenpops")


remotes::install_github(data_packages)

Installing these packages may take a little time. Once you have completed this step, you’ll be ready to begin.

4. Examples from the slides

library(tidyverse)

Arithmetic:

(31 * 12) / 2^4
[1] 23.25
sqrt(25)
[1] 5
log(100)
[1] 4.60517
log10(100)
[1] 2

Logic:

4 < 10
[1] TRUE
4 > 2 & 1 > 0.5 # The "&" means "and"
[1] TRUE
4 < 2 | 1 > 0.5 # The "|" means "or"
[1] TRUE
4 < 2 | 1 < 0.5
[1] FALSE
## A logical test
2 == 2 # Write `=` twice
[1] TRUE
## This will cause an error, because R will think you are trying to assign a value
2 = 2

## Error in 2 = 2 : invalid (do_set) left-hand side to assignment
3 != 7 # Write `!` and then `=` to make `!=`
[1] TRUE

Take care:

3 < 5 & 7
[1] TRUE

But now try 3 < 5 & 1, where your intention is “Three is less than five and also less than one [True or False?]”

3 < 5 & 1
[1] TRUE

Instead:

3 < 5 & 3 < 1
[1] FALSE

You have to make your comparisons explicit.

Objects:

## We made this before
my_numbers <- c(1, 1, 2, 4, 1, 3, 1, 5) 
my_numbers
[1] 1 1 2 4 1 3 1 5
letters  # This one is built-in
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
LETTERS # Different!
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
pi  # Also built-in
[1] 3.141593

Functions are objects too.

mean
function (x, ...) 
UseMethod("mean")
<bytecode: 0x1094e1ed0>
<environment: namespace:base>

Assignment:

## name... gets ... this stuff
my_numbers <- c(1, 2, 3, 1, 3, 5, 25, 10)

## name ... gets ... the output of the function `c()`
your_numbers <- c(5, 31, 71, 1, 3, 21, 6, 52)

Assignment with equals:

my_numbers = c(1, 2, 3, 1, 3, 5, 25)

my_numbers
[1]  1  2  3  1  3  5 25

On the other hand, = has a different meaning when used in functions.

I’m going to use <- for assignment throughout.

Just be consistent either way.

Special operators

For example, matrix multiplication is %*%

x <- matrix(c(2,3,3,4,1,8), ncol = 2)
x
     [,1] [,2]
[1,]    2    4
[2,]    3    1
[3,]    3    8
y <- matrix(c(1,2,3,4), nrow = 2)
y
     [,1] [,2]
[1,]    1    3
[2,]    2    4
x %*% y
     [,1] [,2]
[1,]   10   22
[2,]    5   13
[3,]   19   41

Why %*%? In R the notation %<SOMETHING>% is used for some operators, including custom operators.

But the thing in between the % % can be lots of things. E.g.,

x <- letters[1:10]
y <- letters[5:15]

x
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
y
 [1] "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
x %in% y
 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

And we can define our own, too

## Need to refer to the operator in a special way with backticks
`%nin%` <- Negate(`%in%`)

# Now we have "not in"
x %nin% y
 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Packages

Packages are loaded into your working environment using the library() function:

## A package containing a dataset rather than functions
library(gapminder)

gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

You need only install a package once (and occasionally update it):

## Do at least once for each package. Once done, not needed each time.
install.packages("palmerpenguins", repos = "http://cran.rstudio.com")

## Needed sometimes, especially after an R major version upgrade.
update.packages(repos = "http://cran.rstudio.com")

But you must load the package in each R session before you can access its contents:

## To load a package, usually at the start of your RMarkdown document or script file
library(palmerpenguins)
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

You can “reach in” to an unloaded package and grab a function directly, using <package>::<function>

## A little glimpse of what we'll do soon
penguins |> 
  select(species, body_mass_g, sex) |> 
  gtsummary::tbl_summary(by = species) 
Characteristic Adelie
N = 152
Chinstrap
N = 68
Gentoo
N = 124
body_mass_g, Median (IQR) 3,700 (3,350 – 4,000) 3,700 (3,475 – 3,950) 5,000 (4,700 – 5,500)
    Unknown 1 0 1
sex, n (%)


    female 73 (50) 34 (50) 58 (49)
    male 73 (50) 34 (50) 61 (51)
    Unknown 6 0 5

The scope of names

x <- c(1:10)
y <- c(90:100)

x
 [1]  1  2  3  4  5  6  7  8  9 10
y
 [1]  90  91  92  93  94  95  96  97  98  99 100
mean()

## Error in mean.default() : argument "x" is missing, with no default
mean(x) # argument names are internal to functions
[1] 5.5
mean(x = x)
[1] 5.5
mean(x = y)
[1] 95
x
 [1]  1  2  3  4  5  6  7  8  9 10
y
 [1]  90  91  92  93  94  95  96  97  98  99 100

Types and Classes

The object inspector in RStudio is your friend.

You can ask an object what it is at the console, too:

class(my_numbers)
[1] "numeric"
typeof(my_numbers)
[1] "double"

Objects can have more than one (nested) class:

summary(my_numbers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.500   3.000   5.714   4.000  25.000 
my_smry <- summary(my_numbers) # remember, outputs can be assigned to a name, creating an object

class(summary(my_numbers)) # functions can be nested, and are evaluated from the inside out
[1] "summaryDefault" "table"         
class(my_smry) # equivalent to the previous line
[1] "summaryDefault" "table"         
typeof(my_smry)
[1] "double"
attributes(my_smry)
$names
[1] "Min."    "1st Qu." "Median"  "Mean"    "3rd Qu." "Max."   

$class
[1] "summaryDefault" "table"         
## In this case, the functions extract the corresponding attribute
class(my_smry)
[1] "summaryDefault" "table"         
names(my_smry)
[1] "Min."    "1st Qu." "Median"  "Mean"    "3rd Qu." "Max."   

Kinds of vector; and coercion

my_int <- c(1, 3, 5, 6, 10)
is.integer(my_int)
[1] FALSE
is.double(my_int)
[1] TRUE
my_int <- as.integer(my_int)
is.integer(my_int)
[1] TRUE
my_chr <- c("Mary", "had", "a", "little", "lamb")
is.character(my_chr)
[1] TRUE
my_lgl <- c(TRUE, FALSE, TRUE)
is.logical(my_lgl)
[1] TRUE

Factors:

## Factors are for storing undordered or ordered categorical variables
x <- factor(c("Yes", "No", "No", "Maybe", "Yes", "Yes", "Yes", "No"))
x
[1] Yes   No    No    Maybe Yes   Yes   Yes   No   
Levels: Maybe No Yes
summary(x) # Alphabetical order by default
Maybe    No   Yes 
    1     3     4 
typeof(x)       # Underneath, a factor is a type of integer ...
[1] "integer"
attributes(x)   # ... with labels for its numbers, or "levels" 
$levels
[1] "Maybe" "No"    "Yes"  

$class
[1] "factor"
levels(x)
[1] "Maybe" "No"    "Yes"  
is.ordered(x)
[1] FALSE

Vector types can’t be heterogeneous. Objects can be manually or automatically coerced from one class to another. Take care.

class(my_numbers)
[1] "numeric"
my_new_vector <- c(my_numbers, "Apple")

my_new_vector # vectors are homogeneous/atomic
[1] "1"     "2"     "3"     "1"     "3"     "5"     "25"    "Apple"
class(my_new_vector)
[1] "character"
my_dbl <- c(2.1, 4.77, 30.111, 3.14519)
is.double(my_dbl)
[1] TRUE
my_dbl <- as.integer(my_dbl)

my_dbl
[1]  2  4 30  3

Lists, data frames, tibbles

A table of data is a kind of list

gapminder # tibbles and data frames can contain vectors of different types
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows
class(gapminder)
[1] "tbl_df"     "tbl"        "data.frame"
typeof(gapminder) # hmm
[1] "list"
  • Lists can be heterogeneous, in the sense of containing vectors of different types. Underneath, most complex R objects are some kind of list with different components.
  • A data frame is a list of vectors of the same length, where the vectors can be of different types (e.g. numeric, character, logical, etc)
  • A tibble is an enhanced data frame

Some classes are versions of others

  • Base R’s trusty data.frame
library(socviz)

Attaching package: 'socviz'
The following object is masked _by_ '.GlobalEnv':

    %nin%
titanic
      fate    sex    n percent
1 perished   male 1364    62.0
2 perished female  126     5.7
3 survived   male  367    16.7
4 survived female  344    15.6
class(titanic)
[1] "data.frame"
## The `$` idiom picks out a named column here; 
## more generally, the named element of a list
titanic$percent  
[1] 62.0  5.7 16.7 15.6

The Tidyverse’s enhanced tibble

## tibbles are build on data frames 
titanic_tb <- as_tibble(titanic) 
titanic_tb
# A tibble: 4 × 4
  fate     sex        n percent
  <fct>    <fct>  <dbl>   <dbl>
1 perished male    1364    62  
2 perished female   126     5.7
3 survived male     367    16.7
4 survived female   344    15.6
class(titanic_tb)
[1] "tbl_df"     "tbl"        "data.frame"

Recycling rules for vectors

Arithmetic on vectors

In R, all numbers are vectors of different sorts. Even single numbers (“scalars”) are conceptually vectors of length 1.

Arithmetic on vectors (and arrays generally) follows a series of recycling rules that favor ease of expression of vectorized, “elementwise” operations.

See if you can predict what the following operations do:

my_numbers
[1]  1  2  3  1  3  5 25
result1 <- my_numbers + 1
result1
[1]  2  3  4  2  4  6 26
result2 <- my_numbers + my_numbers
result2
[1]  2  4  6  2  6 10 50
two_nums <- c(5, 10)

result3 <- my_numbers + two_nums
Warning in my_numbers + two_nums: longer object length is not a multiple of
shorter object length
result3
[1]  6 12  8 11  8 15 30
three_nums <- c(1, 5, 10)

result4 <- my_numbers + three_nums
Warning in my_numbers + three_nums: longer object length is not a multiple of
shorter object length
result4
[1]  2  7 13  2  8 15 26

Note that you get a warning here. It’ll still do it, though! Don’t ignore warnings until you understand what they mean.