## Inside code chunks, lines beginning with a # character are comments
## Comments are ignored by R
<- c(1, 1, 2, 4, 1, 3, 1, 5) # Anything after a # character is ignored as well my_numbers
Soc 690S: Week 02
Duke University
January 2025
We want to draw graphs reproducibly
Easy things are awkward
Hard things are straightforward
Really hard things are possible
Easy things are trivial
Hard things are awkward
Really hard things are impossible
Raw data |> Read, Clean, Analyse |> Tidy table |> Make figures
Raw data |> Read, Clean, Analyse |> Tidy table |> Make figures
Stata/SAS/etc |> Tidy table |> Read in to R |> Make figures
RStudio at startup
RStudio schematic overview
RStudio schematic overview
Think in terms of Data + Transformations, written out as code, rather than a series of point-and-click steps
Our starting data + our code is what’s “real” in our projects, not the final output or any intermediate objects
RStudio at startup
RStudio at startup
RStudio at startup
RStudio at startup
RStudio at startup
Use Quarto to produce and reproduce your work
PDF out
HTML out
Word out
This way of doing things is called a Literate Programming or Notebook approach.
Markdown document
Markdown document annotated
Notebook-style documents like Quarto files are great as part of larger projects. The more complex your project, the less likely it will straightforwardly fit into a single notebook. More likely you will find yourself, first, splitting parts of a complex project up into different notebooks; and then, second, writing R scripts that programatically clean and pre-process data, run analyses, and produce some outputs—such as key tables and figures—that you then incorporate into a Quarto document indirectly. Not by copying and pasting, but by pointing to those outputs.
Desired style | Use the following Markdown annotation |
---|---|
Heading 1 | # Heading 1 |
Heading 2 | ## Heading 2 |
Heading 3 | ### Heading 3 (Actual heading styles will vary.) |
Paragraph | Just start typing |
Bold | **Bold** |
Italic | *Italic* |
Images | [Alternate text for image](path/image.jpg) |
Hyperlinks | [Link text](https://www.visualizingsociety.com/) |
Unordered Lists | |
- First | - First |
- Second. | - Second |
- Third | - Third |
Ordered Lists | |
1. First | 1. First |
2. Second. | 2. Second |
3. Third | 3. Third |
Footnote.¹ | Footnote[^notelabel] |
¹The note’s content. | [^notelabel] The note's content. |
TYPE OUT
YOUR CODE
BY HAND
Samuel Beckett
GETTING ORIENTED
library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ──────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
library
(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
<|
Draw graphs<|
Nicer data tables<|
Tidy your data<|
Get data into R<|
Fancy Iteration<|
Action verbs for tablesCode you can type and run:
Output:
This is equivalent to running the code above, typing my_numbers
at the console, and hitting enter.
By convention, code output in documents is prefixed by ##
Also by convention, outputting vectors, etc, gets a counter keeping track of the number of elements. For example,
Logical equality and inequality (yielding a TRUE
or FALSE
result) is done with ==
and !=
. Other logical operators include <
, >
, <=
, >=
, and !
for negation.
Or it’s a really bad idea to try to use them
There are a few built-in objects:
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
In fact, this is mostly what we will be doing.
Objects are created by assigning a thing to a name:
The c()
function combines or concatenates things
option
and -
on a Macalt
and -
on Windows=
my_numbers)=
my_numbers)If you don’t name the arguments, R assumes you are providing them in the order the function expects.
What arguments? Which order? Read the function’s help page
[1] NA
[1] 32.44444
Or select from one of several options
There are all kinds of functions. They return different things.
You can assign the output of a function to a name, which turns it into an object. (Otherwise it’ll send its output to the console.)
Objects hang around in your work environment until they are overwritten by you, or are deleted.
Nested functions are evaluated from the inside out.
Instead of deeply nesting functions in parentheses, we can use the pipe operator:
Read this operator as “and then”
Better, vertical space is free in R:
eggs |>
get_from_fridge() |>
crack_eggs(into = "bowl") |>
whisk(len = 40) |>
pour_in_pan(temp = "med-high") |>
stir() |>
serve()
Packages are loaded into your working environment using the library()
function:
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
You need only install a package once (and occasionally update it):
But you must load the package in each R session before you can access its contents:
## To load a package, usually at the start of your RMarkdown document or script file
library(palmerpenguins)
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
::
# A little glimpse of what we'll do soon
penguins |>
count(species, sex, year) |>
pivot_wider(names_from = year, values_from = n) |>
tinytable::tt()
species | sex | 2007 | 2008 | 2009 |
---|---|---|---|---|
Adelie | female | 22 | 25 | 26 |
Adelie | male | 22 | 25 | 26 |
Adelie | NA | 6 | NA | NA |
Chinstrap | female | 13 | 9 | 12 |
Chinstrap | male | 13 | 9 | 12 |
Gentoo | female | 16 | 22 | 20 |
Gentoo | male | 17 | 23 | 21 |
Gentoo | NA | 1 | 1 | 3 |
library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ──────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
I’m going to speak somewhat loosely here for now, and gloss over some distinctions between object classes and data structures, as well as kinds of objects and their attributes.
Objects are made of one or more vectors. A vector can, in effect, only have a single type: integer, double, logical, character, factor, date, etc. That is, vectors are “atomic”. Complex objects are mostly made of lists of vectors of different sorts. Or they are made of nested lists of other simpler objects that are themselves ultimately made up of vectors.
The object inspector in RStudio is your friend.
You can ask an object what it is at the console, too:
## Factors are for storing undordered or ordered categorical variables
x <- factor(c("Yes", "No", "No", "Maybe", "Yes", "Yes", "Yes", "No"))
x
[1] Yes No No Maybe Yes Yes Yes No
Levels: Maybe No Yes
Maybe No Yes
1 3 4
[1] "integer"
$levels
[1] "Maybe" "No" "Yes"
$class
[1] "factor"
[1] "Maybe" "No" "Yes"
[1] FALSE
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
[1] "tbl_df" "tbl" "data.frame"
[1] "list"
Lists are collections of vectors of possibly different types and lengths, or collections of more complex objects that are themselves ultimately made out of vectors. Underneath, most complex R objects are some kind of list with different components that can be accessed by some function that knows the names of the things inside the list.
A data frame is a list of vectors of the same length, where the vectors can be of different types (e.g. numeric, character, logical, etc).
A data frame is a natural representation of what most real tables of data look like. Having it be a basic sort of entity in the programming language IS ONE OF R’s BEST IDEAS AND EASILY UNDERRATED!
A tibble is an enhanced data frame
data.frame
fate sex n percent
1 perished male 1364 62.0
2 perished female 126 5.7
3 survived male 367 16.7
4 survived female 344 15.6
[1] "data.frame"
data.frame
fate sex n percent
1 perished male 1364 62.0
2 perished female 126 5.7
3 survived male 367 16.7
4 survived female 344 15.6
[1] "data.frame"
tibble
# A tibble: 2,867 × 32
year id ballot age childs sibs degree race sex region income16
<dbl> <dbl> <labelled> <dbl> <dbl> <labe> <fct> <fct> <fct> <fct> <fct>
1 2016 1 1 47 3 2 Bache… White Male New E… $170000…
2 2016 2 2 61 0 3 High … White Male New E… $50000 …
3 2016 3 3 72 2 3 Bache… White Male New E… $75000 …
4 2016 4 1 43 4 3 High … White Fema… New E… $170000…
5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000…
6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 …
7 2016 7 1 50 2 2 High … White Male New E… $170000…
8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 …
9 2016 9 1 45 3 5 High … Black Male Middl… $60000 …
10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
# agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
# bigregion <fct>, partners_rc <fct>, obama <dbl>
In R, all numbers are vectors of different sorts. Even single numbers (“scalars”) are conceptually vectors of length 1.
Arithmetic on vectors (and arrays generally) follows a series of recycling rules that favor ease of expression of vectorized, “elementwise” operations.
See if you can predict what the following operations do:
Warning in my_numbers + three_nums: longer object length is not a multiple of
shorter object length
Note that you get a warning here. It’ll still do it, though! Don’t ignore warnings until you understand what they mean.
Like before:
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
tidyverse
and gapminder
p
gets
the output of the ggplot()
function, given these argumentsmapping
, is itself taking the output of a function named aes()
p
object and the geom_point()
function.+
here acts just like the |>
pipe, but for ggplot functions only. (This is an accident of history.)R objects are just lists of stuff to use or things to do
The p
object
Peek in with the object inspector
Peek in with the object inspector
Here’s a gotcha. You might think you could write 3 < 5 & 7
and have it be interpreted as “Three is less than five and also less than seven [True or False?]”:
It seems to work!
But now try 3 < 5 & 1
, where your intention is “Three is less than five and also less than one [True or False?]”
3 < 5
is evaluated first, and resolves to TRUE, leaving us with the expression TRUE
& 1
.TRUE
& as.logical(1)
.1
resolves to TRUE
. Any other number is FALSE
. So,Let’s evaluate 0.6 + 0.2 == 0.8
Let’s evaluate 0.6 + 0.2 == 0.8
Let’s evaluate 0.6 + 0.2 == 0.8
Now let’s try 0.6 + 0.3 == 0.9
Let’s evaluate 0.6 + 0.2 == 0.8
Now let’s try 0.6 + 0.3 == 0.9
Er. That’s not right.
In Base 10, you can’t precisely express fractions like \(\frac{1}{3}\) and \(\frac{1}{9}\). They come out as repeating decimals: 0.3333… or 0.1111… You can cleanly represent fractions that use a prime factor of the base, which in the case of Base 10 are 2 and 5.
In Base 10, you can’t precisely express fractions like \(\frac{1}{3}\) and \(\frac{1}{9}\). They come out as repeating decimals: 0.3333… or 0.1111… You can cleanly represent fractions that use a prime factor of the base, which in the case of Base 10 are 2 and 5.
Computers represent numbers as binary (i.e. Base 2) floating-points. In Base 2, the only prime factor is 2. So \(\frac{1}{5}\) or \(\frac{1}{10}\) in binary would be repeating.
When you do binary math on repeating numbers and convert back to decimals you get tiny leftovers, and this can mess up logical comparisons of equality. The all.equal()
function exists for this purpose.
[1] 0.3
[1] 0.300000000000000044
[1] TRUE
See e.g. https://0.30000000000000004.com
More later on why this might bite you, and how to deal with it
=
=
as well as <-
for assignment.=
has a different meaning when used in functions.<-
for assignment throughout.=
%>%
|>
is a relatively recent addition to R.magrittr
, where it took the form %>%
%>%
|>
is a relatively recent addition to R.magrittr
, where it took the form %>%
%>%
in every case.%>%
|>
is a relatively recent addition to R.magrittr
, where it took the form %>%
%>%
in every case. We’ll use the Base R pipe in this course, but you’ll see the Magrittr pipe a lot out in the world.Objects can have more than one (nested) class:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.500 3.000 5.714 4.000 25.000
my_smry <- summary(my_numbers) # remember, outputs can be assigned to a name, creating an object
class(summary(my_numbers)) # functions can be nested, and are evaluated from the inside out
[1] "summaryDefault" "table"
[1] "summaryDefault" "table"
[1] "double"
$names
[1] "Min." "1st Qu." "Median" "Mean" "3rd Qu." "Max."
$class
[1] "summaryDefault" "table"
[1] "summaryDefault" "table"
[1] "Min." "1st Qu." "Median" "Mean" "3rd Qu." "Max."