Example 01: How R Thinks

We will be working with the most recent stable versions of R and RStudio, as well as with a number of additional packages. You will need to install R, RStudio, and the necessary packages on your own computer.

1. Install R on your computer

Begin by installing R (http://cloud.r-project.org). Choose the version appropriate for your computing platform:

2. Install RStudio on your computer

3. Installing some additional packages

  • Once R and RStudio are installed, launch RStudio. Either carefully type in or copy-and-paste the following lines of code at R’s command prompt, located in the RStudio window named “Console”, and then hit return. To copy this chunk of code, mouse over the code click the clipboard icon, , that appears in the top right corner of the chunk.
Code
course_packages <- c("tidyverse", "babynames", "broom",
    "gapminder", "here", "janitor", "naniar", 
    "palmerpenguins", "skimr", "slider", "socviz",
    "usethis", "visdat","reprex", "remotes")

install.packages(course_packages, repos = "http://cran.rstudio.com")

data_packages <- c("covdata", "congress", "nycdogs", 
                   "ukelection2019", "uscenpops")


remotes::install_github(data_packages)

Installing these packages may take a little time. Once you have completed this step, you’ll be ready to begin.

4. Examples from the slides

Code
library(tidyverse)

Arithmetic:

Code
(31 * 12) / 2^4
[1] 23.25
Code
sqrt(25)
[1] 5
Code
log(100)
[1] 4.60517
Code
log10(100)
[1] 2

Logic:

Code
4 < 10
[1] TRUE
Code
4 > 2 & 1 > 0.5 # The "&" means "and"
[1] TRUE
Code
4 < 2 | 1 > 0.5 # The "|" means "or"
[1] TRUE
Code
4 < 2 | 1 < 0.5
[1] FALSE
Code
## A logical test
2 == 2 # Write `=` twice
[1] TRUE
Code
## This will cause an error, because R will think you are trying to assign a value
2 = 2

## Error in 2 = 2 : invalid (do_set) left-hand side to assignment
Code
3 != 7 # Write `!` and then `=` to make `!=`
[1] TRUE

Take care:

Code
3 < 5 & 7
[1] TRUE

But now try 3 < 5 & 1, where your intention is “Three is less than five and also less than one [True or False?]”

Code
3 < 5 & 1
[1] TRUE

Instead:

Code
3 < 5 & 3 < 1
[1] FALSE

You have to make your comparisons explicit.

Floating point math

Floating point arithmetic interacts badly with logical evaluation:

Code
0.6 + 0.2 == 0.8
[1] TRUE

Now let’s try 0.6 + 0.3 == 0.9

Code
0.6 + 0.3 == 0.9
[1] FALSE
Code
print(.1 + .2)
[1] 0.3
Code
print(.1 + .2, digits=18)
[1] 0.300000000000000044
Code
all.equal(.1 + .2, 0.3)
[1] TRUE

Objects:

Code
## We made this before
my_numbers <- c(1, 1, 2, 4, 1, 3, 1, 5) 
my_numbers
[1] 1 1 2 4 1 3 1 5
Code
letters  # This one is built-in
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
Code
LETTERS # Different!
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
Code
pi  # Also built-in
[1] 3.141593

Functions are objects too.

Code
mean
function (x, ...) 
UseMethod("mean")
<bytecode: 0x114cb8318>
<environment: namespace:base>

Assignment:

Code
## name... gets ... this stuff
my_numbers <- c(1, 2, 3, 1, 3, 5, 25, 10)

## name ... gets ... the output of the function `c()`
your_numbers <- c(5, 31, 71, 1, 3, 21, 6, 52)

Assignment with equals:

Code
my_numbers = c(1, 2, 3, 1, 3, 5, 25)

my_numbers
[1]  1  2  3  1  3  5 25

On the other hand, = has a different meaning when used in functions.

I’m going to use <- for assignment throughout.

Just be consistent either way.

Special operators

For example, matrix multiplication is %*%

Code
x <- matrix(c(2,3,3,4,1,8), ncol = 2)
x
     [,1] [,2]
[1,]    2    4
[2,]    3    1
[3,]    3    8
Code
y <- matrix(c(1,2,3,4), nrow = 2)
y
     [,1] [,2]
[1,]    1    3
[2,]    2    4
Code
x %*% y
     [,1] [,2]
[1,]   10   22
[2,]    5   13
[3,]   19   41

Why %*%? In R the notation %<SOMETHING>% is used for some operators, including custom operators.

But the thing in between the % % can be lots of things. E.g.,

Code
x <- letters[1:10]
y <- letters[5:15]

x
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
Code
y
 [1] "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
Code
x %in% y
 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

And we can define our own, too

Code
## Need to refer to the operator in a special way with backticks
`%nin%` <- Negate(`%in%`)

# Now we have "not in"
x %nin% y
 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Packages

Packages are loaded into your working environment using the library() function:

Code
## A package containing a dataset rather than functions
library(gapminder)

gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

You need only install a package once (and occasionally update it):

Code
## Do at least once for each package. Once done, not needed each time.
install.packages("palmerpenguins", repos = "http://cran.rstudio.com")

## Needed sometimes, especially after an R major version upgrade.
update.packages(repos = "http://cran.rstudio.com")

But you must load the package in each R session before you can access its contents:

Code
## To load a package, usually at the start of your RMarkdown document or script file
library(palmerpenguins)
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

You can “reach in” to an unloaded package and grab a function directly, using <package>::<function>

Code
## A little glimpse of what we'll do soon
penguins |> 
  select(species, body_mass_g, sex) |> 
  gtsummary::tbl_summary(by = species) 
Characteristic Adelie, N = 152 Chinstrap, N = 68 Gentoo, N = 124
body_mass_g, Median (IQR) 3,700 (3,350 – 4,000) 3,700 (3,488 – 3,950) 5,000 (4,700 – 5,500)
    Unknown 1 0 1
sex, n (%)


    female 73 (50) 34 (50) 58 (49)
    male 73 (50) 34 (50) 61 (51)
    Unknown 6 0 5

The scope of names

Code
x <- c(1:10)
y <- c(90:100)

x
 [1]  1  2  3  4  5  6  7  8  9 10
Code
y
 [1]  90  91  92  93  94  95  96  97  98  99 100
mean()

## Error in mean.default() : argument "x" is missing, with no default
Code
mean(x) # argument names are internal to functions
[1] 5.5
Code
mean(x = x)
[1] 5.5
Code
mean(x = y)
[1] 95
Code
x
 [1]  1  2  3  4  5  6  7  8  9 10
Code
y
 [1]  90  91  92  93  94  95  96  97  98  99 100

Types and Classes

The object inspector in RStudio is your friend.

You can ask an object what it is at the console, too:

Code
class(my_numbers)
[1] "numeric"
Code
typeof(my_numbers)
[1] "double"

Objects can have more than one (nested) class:

Code
summary(my_numbers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.500   3.000   5.714   4.000  25.000 
Code
my_smry <- summary(my_numbers) # remember, outputs can be assigned to a name, creating an object

class(summary(my_numbers)) # functions can be nested, and are evaluated from the inside out
[1] "summaryDefault" "table"         
Code
class(my_smry) # equivalent to the previous line
[1] "summaryDefault" "table"         
Code
typeof(my_smry)
[1] "double"
Code
attributes(my_smry)
$names
[1] "Min."    "1st Qu." "Median"  "Mean"    "3rd Qu." "Max."   

$class
[1] "summaryDefault" "table"         
Code
## In this case, the functions extract the corresponding attribute
class(my_smry)
[1] "summaryDefault" "table"         
Code
names(my_smry)
[1] "Min."    "1st Qu." "Median"  "Mean"    "3rd Qu." "Max."   

Kinds of vector; and coercion

Code
my_int <- c(1, 3, 5, 6, 10)
is.integer(my_int)
[1] FALSE
Code
is.double(my_int)
[1] TRUE
Code
my_int <- as.integer(my_int)
is.integer(my_int)
[1] TRUE
Code
my_chr <- c("Mary", "had", "a", "little", "lamb")
is.character(my_chr)
[1] TRUE
Code
my_lgl <- c(TRUE, FALSE, TRUE)
is.logical(my_lgl)
[1] TRUE

Factors:

Code
## Factors are for storing undordered or ordered categorical variables
x <- factor(c("Yes", "No", "No", "Maybe", "Yes", "Yes", "Yes", "No"))
x
[1] Yes   No    No    Maybe Yes   Yes   Yes   No   
Levels: Maybe No Yes
Code
summary(x) # Alphabetical order by default
Maybe    No   Yes 
    1     3     4 
Code
typeof(x)       # Underneath, a factor is a type of integer ...
[1] "integer"
Code
attributes(x)   # ... with labels for its numbers, or "levels" 
$levels
[1] "Maybe" "No"    "Yes"  

$class
[1] "factor"
Code
levels(x)
[1] "Maybe" "No"    "Yes"  
Code
is.ordered(x)
[1] FALSE

Vector types can’t be heterogeneous. Objects can be manually or automatically coerced from one class to another. Take care.

Code
class(my_numbers)
[1] "numeric"
Code
my_new_vector <- c(my_numbers, "Apple")

my_new_vector # vectors are homogeneous/atomic
[1] "1"     "2"     "3"     "1"     "3"     "5"     "25"    "Apple"
Code
class(my_new_vector)
[1] "character"
Code
my_dbl <- c(2.1, 4.77, 30.111, 3.14519)
is.double(my_dbl)
[1] TRUE
Code
my_dbl <- as.integer(my_dbl)

my_dbl
[1]  2  4 30  3

Lists, data frames, tibbles

A table of data is a kind of list

Code
gapminder # tibbles and data frames can contain vectors of different types
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows
Code
class(gapminder)
[1] "tbl_df"     "tbl"        "data.frame"
Code
typeof(gapminder) # hmm
[1] "list"
  • Lists can be heterogeneous, in the sense of containing vectors of different types. Underneath, most complex R objects are some kind of list with different components.
  • A data frame is a list of vectors of the same length, where the vectors can be of different types (e.g. numeric, character, logical, etc)
  • A tibble is an enhanced data frame

Some classes are versions of others

  • Base R’s trusty data.frame
Code
library(socviz)

Attaching package: 'socviz'
The following object is masked _by_ '.GlobalEnv':

    %nin%
Code
titanic
      fate    sex    n percent
1 perished   male 1364    62.0
2 perished female  126     5.7
3 survived   male  367    16.7
4 survived female  344    15.6
Code
class(titanic)
[1] "data.frame"
Code
## The `$` idiom picks out a named column here; 
## more generally, the named element of a list
titanic$percent  
[1] 62.0  5.7 16.7 15.6

The Tidyverse’s enhanced tibble

::::{.smallcode}

Code
## tibbles are build on data frames 
titanic_tb <- as_tibble(titanic) 
titanic_tb
# A tibble: 4 × 4
  fate     sex        n percent
  <fct>    <fct>  <dbl>   <dbl>
1 perished male    1364    62  
2 perished female   126     5.7
3 survived male     367    16.7
4 survived female   344    15.6
Code
class(titanic_tb)
[1] "tbl_df"     "tbl"        "data.frame"

Recycling rules for vectors

Arithmetic on vectors

In R, all numbers are vectors of different sorts. Even single numbers (“scalars”) are conceptually vectors of length 1.

Arithmetic on vectors (and arrays generally) follows a series of recycling rules that favor ease of expression of vectorized, “elementwise” operations.

See if you can predict what the following operations do:

Code
my_numbers
[1]  1  2  3  1  3  5 25
Code
result1 <- my_numbers + 1
Code
result1
[1]  2  3  4  2  4  6 26
Code
result2 <- my_numbers + my_numbers
Code
result2
[1]  2  4  6  2  6 10 50
Code
two_nums <- c(5, 10)

result3 <- my_numbers + two_nums
Warning in my_numbers + two_nums: longer object length is not a multiple of
shorter object length
Code
result3
[1]  6 12  8 11  8 15 30
Code
three_nums <- c(1, 5, 10)

result4 <- my_numbers + three_nums
Warning in my_numbers + three_nums: longer object length is not a multiple of
shorter object length
Code
result4
[1]  2  7 13  2  8 15 26

Note that you get a warning here. It’ll still do it, though! Don’t ignore warnings until you understand what they mean.