Group and Summarize with dplyr

Soc 690S: Week 04

Kieran Healy

Duke University

February 2025

Work with dplyr and ggplot

Load our libraries

library(here)      # manage file paths
library(socviz)    # data and some useful functions
library(tidyverse) # your friend and mine

Tidyverse components

  • library(tidyverse)
  • Loading tidyverse: ggplot2
  • Loading tidyverse: tibble
  • Loading tidyverse: tidyr
  • Loading tidyverse: readr
  • Loading tidyverse: purrr
  • Loading tidyverse: dplyr
  • Load the package and …
  • <| Draw graphs
  • <| Nicer data tables
  • <| Tidy your data
  • <| Get data into R
  • <| Fancy Iteration
  • <| Action verbs for tables

Other tidyverse components

  • forcats
  • haven
  • lubridate
  • readxl
  • stringr
  • reprex
  • <| Deal with factors
  • <| Import Stata, SPSS, etc
  • <| Dates, Durations, Times
  • <| Import from spreadsheets
  • <| Strings and Regular Expressions
  • <| Make reproducible examples

Not all of these are attached when we do library(tidyverse)

ggplot’s flow of action

Thinking in terms of layers

Thinking in terms of layers

Thinking in terms of layers

Feeding data
to ggplot

Transform and summarize first.
Then send your clean tables to ggplot.

Crosstabulation and beyond

U.S. General Social Survey data: gss_sm

gss_sm  
# A tibble: 2,867 × 32
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

We often want summary tables or graphs of data like this.

Two-way tables: Row percents

bigregion Protestant Catholic Jewish None Other Total
Northeast 32.4 33.3 5.5 23.0 5.7 100.0
Midwest 47.1 24.9 0.4 22.8 4.8 100.0
South 62.4 15.4 1.1 16.3 4.8 100.0
West 37.7 24.6 1.6 28.5 7.6 100.0

Two-way tables: Column percents

bigregion Protestant Catholic Jewish None Other
Northeast 11.5 25.0 52.9 18.1 17.6
Midwest 23.7 26.5 5.9 25.4 20.8
South 47.4 24.7 21.6 27.5 31.4
West 17.4 23.9 19.6 29.1 30.2
Total 100.0 100.0 100.0 100.0 100.0

Two-way tables: Full marginals

bigregion Protestant Catholic Jewish None Other
Northeast 5.5 5.7 0.9 3.9 1.0
Midwest 11.4 6.0 0.1 5.5 1.2
South 22.8 5.6 0.4 6.0 1.8
West 8.4 5.4 0.4 6.3 1.7

dplyr lets you work with tibbles

  • Remember, tibbles are tables of data where the columns can be of different types, such as numeric, logical, character, factor, etc.
  • We’ll use dplyr to transform and summarize our data.
  • We’ll use the pipe operator, |>, to chain together sequences of actions on our tables.

dplyr’s core verbs

dplyr draws on the logic and language of database queries

Some actions to take on a single table

  • Group the data at the level we want, such as “Religion by Region” or “Children by School”.

  • Subset either the rows or columns of or table—i.e. remove them before doing anything.

  • Mutate the data. That is, change something at the current level of grouping. Mutating adds new columns to the table, or changes the content of an existing column. It never changes the number of rows.

  • Summarize or aggregate the data. That is, make something new at a higher level of grouping. E.g., calculate means or counts by some grouping variable. This will generally result in a smaller, summary table. Usually this will have the same number of rows as there are groups being summarized.

For each action there’s a function

  • Group using group_by().
  • Subset has one action for rows and one for columns. We filter() rows and select() columns.
  • Mutate tables (i.e. add new columns, or re-make existing ones) using mutate().
  • Summarize tables (i.e. perform aggregating calculations) using summarize().

Group and Summarize

General Social Survey data: gss_sm

## library(socviz) # if not loaded
gss_sm
# A tibble: 2,867 × 32
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Notice how the tibble already tells us a lot.

Summarizing a Table

  • Here’s what we’re going to do:

Summarizing a Table

gss_sm |> 
  select(id, bigregion, religion)
# A tibble: 2,867 × 3
      id bigregion religion  
   <dbl> <fct>     <fct>     
 1     1 Northeast None      
 2     2 Northeast None      
 3     3 Northeast Catholic  
 4     4 Northeast Catholic  
 5     5 Northeast None      
 6     6 Northeast None      
 7     7 Northeast None      
 8     8 Northeast Catholic  
 9     9 Northeast Protestant
10    10 Northeast None      
# ℹ 2,857 more rows

We’re just taking a look at the relevant columns here.

Group by one column or variable

gss_sm |> 
  group_by(bigregion)
# A tibble: 2,867 × 32
# Groups:   bigregion [4]
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Grouping just changes the logical structure of the tibble.

Group and summarize by one column

gss_sm
# A tibble: 2,867 × 32
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Group and summarize by one column

gss_sm |>
  group_by(bigregion)
# A tibble: 2,867 × 32
# Groups:   bigregion [4]
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Group and summarize by one column

gss_sm |>
  group_by(bigregion) |>
  summarize(total = n())
# A tibble: 4 × 2
  bigregion total
  <fct>     <int>
1 Northeast   488
2 Midwest     695
3 South      1052
4 West        632
  • The function n() counts up the rows within each group.
  • All the other columns are dropped in the summary operation
  • Your original gss_sm table is untouched

Group and summarize by two columns

gss_sm
# A tibble: 2,867 × 32
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Group and summarize by two columns

gss_sm |>
  group_by(bigregion, religion)
# A tibble: 2,867 × 32
# Groups:   bigregion, religion [24]
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Group and summarize by two columns

gss_sm |>
  group_by(bigregion, religion) |>
  summarize(total = n())
# A tibble: 24 × 3
# Groups:   bigregion [4]
   bigregion religion   total
   <fct>     <fct>      <int>
 1 Northeast Protestant   158
 2 Northeast Catholic     162
 3 Northeast Jewish        27
 4 Northeast None         112
 5 Northeast Other         28
 6 Northeast <NA>           1
 7 Midwest   Protestant   325
 8 Midwest   Catholic     172
 9 Midwest   Jewish         3
10 Midwest   None         157
# ℹ 14 more rows
  • The function n() counts up the rows within the innermost (i.e. the rightmost) group.

Calculate frequencies

gss_sm
# A tibble: 2,867 × 32
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Calculate frequencies

gss_sm |>
  group_by(bigregion, religion)
# A tibble: 2,867 × 32
# Groups:   bigregion, religion [24]
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>

Calculate frequencies

gss_sm |>
  group_by(bigregion, religion) |>
  summarize(total = n())
# A tibble: 24 × 3
# Groups:   bigregion [4]
   bigregion religion   total
   <fct>     <fct>      <int>
 1 Northeast Protestant   158
 2 Northeast Catholic     162
 3 Northeast Jewish        27
 4 Northeast None         112
 5 Northeast Other         28
 6 Northeast <NA>           1
 7 Midwest   Protestant   325
 8 Midwest   Catholic     172
 9 Midwest   Jewish         3
10 Midwest   None         157
# ℹ 14 more rows

Calculate frequencies

gss_sm |>
  group_by(bigregion, religion) |>
  summarize(total = n()) |>
  mutate(freq = total / sum(total),
           pct = round((freq*100), 1))
# A tibble: 24 × 5
# Groups:   bigregion [4]
   bigregion religion   total    freq   pct
   <fct>     <fct>      <int>   <dbl> <dbl>
 1 Northeast Protestant   158 0.324    32.4
 2 Northeast Catholic     162 0.332    33.2
 3 Northeast Jewish        27 0.0553    5.5
 4 Northeast None         112 0.230    23  
 5 Northeast Other         28 0.0574    5.7
 6 Northeast <NA>           1 0.00205   0.2
 7 Midwest   Protestant   325 0.468    46.8
 8 Midwest   Catholic     172 0.247    24.7
 9 Midwest   Jewish         3 0.00432   0.4
10 Midwest   None         157 0.226    22.6
# ℹ 14 more rows
  • The function n() counts up the rows
  • Which rows? The ones fed down the pipeline
  • The innermost (i.e. the rightmost) group.

Pipelines carry assumptions forward

gss_sm |> 
  group_by(bigregion, religion) |> 
  summarize(total = n()) |> 
  mutate(freq = total / sum(total),
           pct = round((freq*100), 1))
# A tibble: 24 × 5
# Groups:   bigregion [4]
   bigregion religion   total    freq   pct
   <fct>     <fct>      <int>   <dbl> <dbl>
 1 Northeast Protestant   158 0.324    32.4
 2 Northeast Catholic     162 0.332    33.2
 3 Northeast Jewish        27 0.0553    5.5
 4 Northeast None         112 0.230    23  
 5 Northeast Other         28 0.0574    5.7
 6 Northeast <NA>           1 0.00205   0.2
 7 Midwest   Protestant   325 0.468    46.8
 8 Midwest   Catholic     172 0.247    24.7
 9 Midwest   Jewish         3 0.00432   0.4
10 Midwest   None         157 0.226    22.6
# ℹ 14 more rows
  • Groups are carried forward till summarized or explicitly ungrouped
  • Summary calculations are done on the innermost group, which then “disappears”.

Pipelines carry assumptions forward

gss_sm |> 
  group_by(bigregion, religion) |> 
  summarize(total = n()) |> 
  mutate(freq = total / sum(total),
           pct = round((freq*100), 1)) 
# A tibble: 24 × 5
# Groups:   bigregion [4]
   bigregion religion   total    freq   pct
   <fct>     <fct>      <int>   <dbl> <dbl>
 1 Northeast Protestant   158 0.324    32.4
 2 Northeast Catholic     162 0.332    33.2
 3 Northeast Jewish        27 0.0553    5.5
 4 Northeast None         112 0.230    23  
 5 Northeast Other         28 0.0574    5.7
 6 Northeast <NA>           1 0.00205   0.2
 7 Midwest   Protestant   325 0.468    46.8
 8 Midwest   Catholic     172 0.247    24.7
 9 Midwest   Jewish         3 0.00432   0.4
10 Midwest   None         157 0.226    22.6
# ℹ 14 more rows

mutate() is quite clever. See how we can immediately use freq, even though we are creating it in the same mutate() expression.

Convenience functions

gss_sm |> 
  group_by(bigregion, religion) |> 
  summarize(total = n()) |> 
  mutate(freq = total / sum(total),
           pct = round((freq*100), 1)) 
# A tibble: 24 × 5
# Groups:   bigregion [4]
   bigregion religion   total    freq   pct
   <fct>     <fct>      <int>   <dbl> <dbl>
 1 Northeast Protestant   158 0.324    32.4
 2 Northeast Catholic     162 0.332    33.2
 3 Northeast Jewish        27 0.0553    5.5
 4 Northeast None         112 0.230    23  
 5 Northeast Other         28 0.0574    5.7
 6 Northeast <NA>           1 0.00205   0.2
 7 Midwest   Protestant   325 0.468    46.8
 8 Midwest   Catholic     172 0.247    24.7
 9 Midwest   Jewish         3 0.00432   0.4
10 Midwest   None         157 0.226    22.6
# ℹ 14 more rows

We’re going to be doing this group_by()n() step a lot. Some shorthand for it would be useful.

Three options for counting up rows

  • Use n()
gss_sm |> 
  group_by(bigregion, religion) |> 
  summarize(n = n()) 
# A tibble: 24 × 3
# Groups:   bigregion [4]
   bigregion religion       n
   <fct>     <fct>      <int>
 1 Northeast Protestant   158
 2 Northeast Catholic     162
 3 Northeast Jewish        27
 4 Northeast None         112
 5 Northeast Other         28
 6 Northeast <NA>           1
 7 Midwest   Protestant   325
 8 Midwest   Catholic     172
 9 Midwest   Jewish         3
10 Midwest   None         157
# ℹ 14 more rows
  • Group it yourself; result is grouped.
  • Use tally()
gss_sm |> 
  group_by(bigregion, religion) |> 
  tally() 
# A tibble: 24 × 3
# Groups:   bigregion [4]
   bigregion religion       n
   <fct>     <fct>      <int>
 1 Northeast Protestant   158
 2 Northeast Catholic     162
 3 Northeast Jewish        27
 4 Northeast None         112
 5 Northeast Other         28
 6 Northeast <NA>           1
 7 Midwest   Protestant   325
 8 Midwest   Catholic     172
 9 Midwest   Jewish         3
10 Midwest   None         157
# ℹ 14 more rows
  • More compact; result is grouped.
  • Use count()
gss_sm |> 
  count(bigregion, religion) 
# A tibble: 24 × 3
   bigregion religion       n
   <fct>     <fct>      <int>
 1 Northeast Protestant   158
 2 Northeast Catholic     162
 3 Northeast Jewish        27
 4 Northeast None         112
 5 Northeast Other         28
 6 Northeast <NA>           1
 7 Midwest   Protestant   325
 8 Midwest   Catholic     172
 9 Midwest   Jewish         3
10 Midwest   None         157
# ℹ 14 more rows
  • One step; result is not grouped.

Pass results on to … a table

gss_sm |> 
  count(bigregion, religion) |> 
  pivot_wider(names_from = bigregion, values_from = n) |>  
  knitr::kable()  
religion Northeast Midwest South West
Protestant 158 325 650 238
Catholic 162 172 160 155
Jewish 27 3 11 10
None 112 157 170 180
Other 28 33 50 48
NA 1 5 11 1
  • More on pivot_wider() and kable() soon …

Pass results on to … a graph

gss_sm |> 
  group_by(bigregion, religion) |> 
  tally() |> 
  mutate(pct = round((n/sum(n))*100), 1) |> 
  drop_na() |> 
  ggplot(mapping = aes(x = pct, y = reorder(religion, -pct), fill = religion)) + 
  geom_col() + 
    labs(x = "Percent", y = NULL) +
    guides(fill = "none") + 
    facet_wrap(~ bigregion, nrow = 1)

Check by summarizing

rel_by_region <- gss_sm |> 
  count(bigregion, religion) |> 
  mutate(pct = round((n/sum(n))*100, 1)) 

rel_by_region
# A tibble: 24 × 4
   bigregion religion       n   pct
   <fct>     <fct>      <int> <dbl>
 1 Northeast Protestant   158   5.5
 2 Northeast Catholic     162   5.7
 3 Northeast Jewish        27   0.9
 4 Northeast None         112   3.9
 5 Northeast Other         28   1  
 6 Northeast <NA>           1   0  
 7 Midwest   Protestant   325  11.3
 8 Midwest   Catholic     172   6  
 9 Midwest   Jewish         3   0.1
10 Midwest   None         157   5.5
# ℹ 14 more rows

Hm, did I sum over right group?

Check by summarizing

rel_by_region <- gss_sm |> 
  count(bigregion, religion) |> 
  mutate(pct = round((n/sum(n))*100, 1)) 

rel_by_region
# A tibble: 24 × 4
   bigregion religion       n   pct
   <fct>     <fct>      <int> <dbl>
 1 Northeast Protestant   158   5.5
 2 Northeast Catholic     162   5.7
 3 Northeast Jewish        27   0.9
 4 Northeast None         112   3.9
 5 Northeast Other         28   1  
 6 Northeast <NA>           1   0  
 7 Midwest   Protestant   325  11.3
 8 Midwest   Catholic     172   6  
 9 Midwest   Jewish         3   0.1
10 Midwest   None         157   5.5
# ℹ 14 more rows

Hm, did I sum over right group?

## Each region should sum to ~100
rel_by_region |> 
  group_by(bigregion) |> 
  summarize(total = sum(pct)) 
# A tibble: 4 × 2
  bigregion total
  <fct>     <dbl>
1 Northeast  17  
2 Midwest    24.3
3 South      36.7
4 West       22  

No! What has gone wrong here?

Check by summarizing

rel_by_region <- gss_sm |> 
  count(bigregion, religion) |> 
  mutate(pct = round((n/sum(n))*100, 1)) 
  • count() returns ungrouped results, so there are no groups carry forward to the mutate() step.
rel_by_region |> 
  summarize(total = sum(pct))
# A tibble: 1 × 1
  total
  <dbl>
1   100
  • With count(), the pct values here are the marginals for the whole table.

Check by summarizing

rel_by_region <- gss_sm |> 
  count(bigregion, religion) |> 
  mutate(pct = round((n/sum(n))*100, 1)) 
  • count() returns ungrouped results, so there are no groups carry forward to the mutate() step.
rel_by_region |> 
  summarize(total = sum(pct))
# A tibble: 1 × 1
  total
  <dbl>
1   100
  • With count(), the pct values here are the marginals for the whole table.
rel_by_region <- gss_sm |> 
  group_by(bigregion, religion) |> 
  tally() |> 
  mutate(pct = round((n/sum(n))*100, 1)) 
# Check
rel_by_region |> 
  group_by(bigregion) |> 
  summarize(total = sum(pct))
# A tibble: 4 × 2
  bigregion total
  <fct>     <dbl>
1 Northeast 100  
2 Midwest    99.9
3 South     100  
4 West      100. 
  • We get some rounding error because we used round() after summing originally.

Two lessons

Check your tables!

  • Pipelines feed their content forward, so you need to make sure your results are not incorrect.
  • Often, complex tables and graphs can be disturbingly plausible even when wrong.
  • So, figure out what the result should be and test it!
  • Starting with simple or toy cases can help with this process.

Two lessons

Inspect your pipes!

  • Understand pipelines by running them forward or peeling them back a step at a time.
  • This is a very effective way to understand your own and other people’s code.

Use dplyr to make summary tables.
Then send your clean tables to ggplot.

Facets are often
better than
Guides

Let’s put that table in an object

rel_by_region <- gss_sm |> 
  group_by(bigregion, religion) |> 
  tally() |> 
  mutate(pct = round((n/sum(n))*100, 1)) |> 
  drop_na()


head(rel_by_region)
# A tibble: 6 × 4
# Groups:   bigregion [2]
  bigregion religion       n   pct
  <fct>     <fct>      <int> <dbl>
1 Northeast Protestant   158  32.4
2 Northeast Catholic     162  33.2
3 Northeast Jewish        27   5.5
4 Northeast None         112  23  
5 Northeast Other         28   5.7
6 Midwest   Protestant   325  46.8

We might write …

p <- ggplot(data = rel_by_region, 
                mapping = aes(x = bigregion, 
                              y = pct, 
                              fill = religion))
p_out <- p + geom_col(position = "dodge") +
    labs(x = "Region",
         y = "Percent", 
         fill = "Religion") 

We might write …

Is this an effective graph? Not really!

Try faceting instead

p <- ggplot(data = rel_by_region, 
                mapping = aes(x = pct, 
                              y = reorder(religion, -pct), 
                              fill = religion))
p_out_facet <- p + geom_col() +
  guides(fill = "none") + 
  facet_wrap(~ bigregion, nrow = 1) +
  labs(x = "Percent",
       y = NULL) 
  • Putting categories on the y-axis is a very useful trick.
  • Faceting reduces the number of guides the viewer needs to consult.

Try faceting instead

Try faceting instead

Try putting categories on the y-axis. (And reorder them by x.)

Try faceting variables instead of mapping them to color or shape.

Try to minimize the need for guides and legends.

Two kinds of facet

Facet Children vs Age, by Race

p <-  ggplot(data = gss_sm,
             mapping = aes(x = age, y = childs))

p + geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_wrap(~ race)

Facet by more than one variable

p <-  ggplot(data = gss_sm,
             mapping = aes(x = age, y = childs))

p + geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_wrap(~ sex + race) 

Arrange facet_wrap() quite freely

p <-  ggplot(data = gss_sm,
             mapping = aes(x = age, y = childs))

p + geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_wrap(~ sex + race, nrow = 1) 

facet_grid() is more like a true crosstab

p + geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_grid(sex ~ race) 

Extend both to multi-way views

p_out <- p + geom_point(alpha = 0.2) + 
  geom_smooth() +
  facet_grid(bigregion ~ race + sex) 

What we’ve
built-up

Core Grammar

Core grammar

Grouped data; faceting

  • Along with a few peeks at scale transformations, guide adjustments, and theme adjustment

All basic steps

dplyr and Pipelining

The elements of filtering and summarizing

gss_sm |> 
  group_by(bigregion, religion) |> 
  tally() |> 
  mutate(freq = n / sum(n),
         pct = round((freq*100), 1)) 
# A tibble: 24 × 5
# Groups:   bigregion [4]
   bigregion religion       n    freq   pct
   <fct>     <fct>      <int>   <dbl> <dbl>
 1 Northeast Protestant   158 0.324    32.4
 2 Northeast Catholic     162 0.332    33.2
 3 Northeast Jewish        27 0.0553    5.5
 4 Northeast None         112 0.230    23  
 5 Northeast Other         28 0.0574    5.7
 6 Northeast <NA>           1 0.00205   0.2
 7 Midwest   Protestant   325 0.468    46.8
 8 Midwest   Catholic     172 0.247    24.7
 9 Midwest   Jewish         3 0.00432   0.4
10 Midwest   None         157 0.226    22.6
# ℹ 14 more rows

Example and extension:
Organ Donation data

organdata is in the socviz package

organdata
# A tibble: 238 × 21
   country   year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>     <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Australia NA          NA    17065    0.220 16774   16591   1300       1224
 2 Australia 1991-01-01  12.1  17284    0.223 17171   16774   1379       1300
 3 Australia 1992-01-01  12.4  17495    0.226 17914   17171   1455       1379
 4 Australia 1993-01-01  12.5  17667    0.228 18883   17914   1540       1455
 5 Australia 1994-01-01  10.2  17855    0.231 19849   18883   1626       1540
 6 Australia 1995-01-01  10.2  18072    0.233 21079   19849   1737       1626
 7 Australia 1996-01-01  10.6  18311    0.237 21923   21079   1846       1737
 8 Australia 1997-01-01  10.3  18518    0.239 22961   21923   1948       1846
 9 Australia 1998-01-01  10.5  18711    0.242 24148   22961   2077       1948
10 Australia 1999-01-01   8.67 18926    0.244 25445   24148   2231       2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

First look

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_point()

First look

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_line() 

First look

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_line(aes(group = country)) 

First look

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_line() + 
  facet_wrap(~ country, nrow = 3)

First look

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_line() + 
  facet_wrap(~ reorder(country, donors, na.rm = TRUE), nrow = 3)

First look

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_line() + 
  facet_wrap(~ reorder(country, -donors, na.rm = TRUE), nrow = 3)

Summarize better
with dplyr

Conditional selection

Conditionals in select() & filter()

# library(socviz)
organdata
# A tibble: 238 × 21
   country   year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>     <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Australia NA          NA    17065    0.220 16774   16591   1300       1224
 2 Australia 1991-01-01  12.1  17284    0.223 17171   16774   1379       1300
 3 Australia 1992-01-01  12.4  17495    0.226 17914   17171   1455       1379
 4 Australia 1993-01-01  12.5  17667    0.228 18883   17914   1540       1455
 5 Australia 1994-01-01  10.2  17855    0.231 19849   18883   1626       1540
 6 Australia 1995-01-01  10.2  18072    0.233 21079   19849   1737       1626
 7 Australia 1996-01-01  10.6  18311    0.237 21923   21079   1846       1737
 8 Australia 1997-01-01  10.3  18518    0.239 22961   21923   1948       1846
 9 Australia 1998-01-01  10.5  18711    0.242 24148   22961   2077       1948
10 Australia 1999-01-01   8.67 18926    0.244 25445   24148   2231       2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

Conditionals in select() & filter()

organdata |> 
  filter(consent_law == "Informed" & donors > 15) 
# A tibble: 30 × 21
   country year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>   <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Canada  2000-01-01   15.3 30770    0.309 28472   26658   2541       2400
 2 Denmark 1992-01-01   16.1  5171   12.0   19644   19126   1660       1603
 3 Ireland 1991-01-01   19    3534    5.03  13495   12917    884        791
 4 Ireland 1992-01-01   19.5  3558    5.06  14241   13495   1005        884
 5 Ireland 1993-01-01   17.1  3576    5.09  14927   14241   1041       1005
 6 Ireland 1994-01-01   20.3  3590    5.11  15990   14927   1119       1041
 7 Ireland 1995-01-01   24.6  3609    5.14  17789   15990   1208       1119
 8 Ireland 1996-01-01   16.8  3636    5.17  19245   17789   1269       1208
 9 Ireland 1997-01-01   20.9  3673    5.23  22017   19245   1417       1269
10 Ireland 1998-01-01   23.8  3715    5.29  23995   22017   1487       1417
# ℹ 20 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

Conditionals in select() & filter()

organdata |> 
  select(country, year, where(is.integer)) 
# A tibble: 238 × 8
   country   year         pop   gdp gdp_lag cerebvas assault external
   <chr>     <date>     <int> <int>   <int>    <int>   <int>    <int>
 1 Australia NA         17065 16774   16591      682      21      444
 2 Australia 1991-01-01 17284 17171   16774      647      19      425
 3 Australia 1992-01-01 17495 17914   17171      630      17      406
 4 Australia 1993-01-01 17667 18883   17914      611      18      376
 5 Australia 1994-01-01 17855 19849   18883      631      17      387
 6 Australia 1995-01-01 18072 21079   19849      592      16      371
 7 Australia 1996-01-01 18311 21923   21079      576      17      395
 8 Australia 1997-01-01 18518 22961   21923      525      17      385
 9 Australia 1998-01-01 18711 24148   22961      516      16      410
10 Australia 1999-01-01 18926 25445   24148      493      15      409
# ℹ 228 more rows

Use where() to test columns.

Conditionals in select() & filter()

When telling where() to use is.integer() to test each column, we don’t put parentheses at the end of its name. If we did, R would try to evaluate is.integer() right then, and fail:

> organdata |> 
+   select(country, year, where(is.integer()))
Error: 0 arguments passed to 'is.integer' which requires 1
Run `rlang::last_error()` to see where the error occurred.

This is true in similar situations elsewhere as well.

Conditionals in select() & filter()

organdata |> 
  select(country, year, where(is.character))
# A tibble: 238 × 8
   country  year       world opt   consent_law consent_practice consistent ccode
   <chr>    <date>     <chr> <chr> <chr>       <chr>            <chr>      <chr>
 1 Austral… NA         Libe… In    Informed    Informed         Yes        Oz   
 2 Austral… 1991-01-01 Libe… In    Informed    Informed         Yes        Oz   
 3 Austral… 1992-01-01 Libe… In    Informed    Informed         Yes        Oz   
 4 Austral… 1993-01-01 Libe… In    Informed    Informed         Yes        Oz   
 5 Austral… 1994-01-01 Libe… In    Informed    Informed         Yes        Oz   
 6 Austral… 1995-01-01 Libe… In    Informed    Informed         Yes        Oz   
 7 Austral… 1996-01-01 Libe… In    Informed    Informed         Yes        Oz   
 8 Austral… 1997-01-01 Libe… In    Informed    Informed         Yes        Oz   
 9 Austral… 1998-01-01 Libe… In    Informed    Informed         Yes        Oz   
10 Austral… 1999-01-01 Libe… In    Informed    Informed         Yes        Oz   
# ℹ 228 more rows

We have functions like e.g. is.character(), is.numeric(), is.logical(), is.factor(), etc. All return either TRUE or FALSE.

Conditionals in select() & filter()

Sometimes we don’t pass a function, but do want to use the result of one:

organdata |> 
  select(country, year, starts_with("gdp")) 
# A tibble: 238 × 4
   country   year         gdp gdp_lag
   <chr>     <date>     <int>   <int>
 1 Australia NA         16774   16591
 2 Australia 1991-01-01 17171   16774
 3 Australia 1992-01-01 17914   17171
 4 Australia 1993-01-01 18883   17914
 5 Australia 1994-01-01 19849   18883
 6 Australia 1995-01-01 21079   19849
 7 Australia 1996-01-01 21923   21079
 8 Australia 1997-01-01 22961   21923
 9 Australia 1998-01-01 24148   22961
10 Australia 1999-01-01 25445   24148
# ℹ 228 more rows

We have starts_with(), ends_with(), contains(), matches(), and num_range(). Collectively these are “tidy selectors”.

Conditionals in select() & filter()

organdata |> 
  filter(country == "Australia" | country == "Canada") 
# A tibble: 28 × 21
   country   year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>     <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Australia NA          NA    17065    0.220 16774   16591   1300       1224
 2 Australia 1991-01-01  12.1  17284    0.223 17171   16774   1379       1300
 3 Australia 1992-01-01  12.4  17495    0.226 17914   17171   1455       1379
 4 Australia 1993-01-01  12.5  17667    0.228 18883   17914   1540       1455
 5 Australia 1994-01-01  10.2  17855    0.231 19849   18883   1626       1540
 6 Australia 1995-01-01  10.2  18072    0.233 21079   19849   1737       1626
 7 Australia 1996-01-01  10.6  18311    0.237 21923   21079   1846       1737
 8 Australia 1997-01-01  10.3  18518    0.239 22961   21923   1948       1846
 9 Australia 1998-01-01  10.5  18711    0.242 24148   22961   2077       1948
10 Australia 1999-01-01   8.67 18926    0.244 25445   24148   2231       2077
# ℹ 18 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

This could get cumbersome fast.

Use %in% for multiple selections

my_countries <- c("Australia", "Canada", "United States", "Ireland")

organdata |> 
  filter(country %in% my_countries) 
# A tibble: 56 × 21
   country   year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>     <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Australia NA          NA    17065    0.220 16774   16591   1300       1224
 2 Australia 1991-01-01  12.1  17284    0.223 17171   16774   1379       1300
 3 Australia 1992-01-01  12.4  17495    0.226 17914   17171   1455       1379
 4 Australia 1993-01-01  12.5  17667    0.228 18883   17914   1540       1455
 5 Australia 1994-01-01  10.2  17855    0.231 19849   18883   1626       1540
 6 Australia 1995-01-01  10.2  18072    0.233 21079   19849   1737       1626
 7 Australia 1996-01-01  10.6  18311    0.237 21923   21079   1846       1737
 8 Australia 1997-01-01  10.3  18518    0.239 22961   21923   1948       1846
 9 Australia 1998-01-01  10.5  18711    0.242 24148   22961   2077       1948
10 Australia 1999-01-01   8.67 18926    0.244 25445   24148   2231       2077
# ℹ 46 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

Negating %in%

my_countries <- c("Australia", "Canada", "United States", "Ireland")

organdata |> 
  filter(!(country %in% my_countries)) 
# A tibble: 182 × 21
   country year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>   <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Austria NA           NA    7678     9.16 18914   17425   1344       1255
 2 Austria 1991-01-01   27.6  7755     9.25 19860   18914   1419       1344
 3 Austria 1992-01-01   23.1  7841     9.35 20601   19860   1551       1419
 4 Austria 1993-01-01   26.2  7906     9.43 21119   20601   1674       1551
 5 Austria 1994-01-01   21.4  7936     9.46 21940   21119   1739       1674
 6 Austria 1995-01-01   21.5  7948     9.48 22817   21940   1865       1739
 7 Austria 1996-01-01   24.7  7959     9.49 23798   22817   1986       1865
 8 Austria 1997-01-01   19.5  7968     9.50 24364   23798   1848       1986
 9 Austria 1998-01-01   20.7  7977     9.51 25423   24364   1953       1848
10 Austria 1999-01-01   25.9  7992     9.53 26513   25423   2069       1953
# ℹ 172 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

Also a bit awkward. There’s no built-in “Not in” operator.

A custom operator

`%nin%` <- Negate(`%in%`) # this operator is included in the socviz package
organdata |> 
  filter(country %nin% my_countries) 
# A tibble: 182 × 21
   country year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>   <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Austria NA           NA    7678     9.16 18914   17425   1344       1255
 2 Austria 1991-01-01   27.6  7755     9.25 19860   18914   1419       1344
 3 Austria 1992-01-01   23.1  7841     9.35 20601   19860   1551       1419
 4 Austria 1993-01-01   26.2  7906     9.43 21119   20601   1674       1551
 5 Austria 1994-01-01   21.4  7936     9.46 21940   21119   1739       1674
 6 Austria 1995-01-01   21.5  7948     9.48 22817   21940   1865       1739
 7 Austria 1996-01-01   24.7  7959     9.49 23798   22817   1986       1865
 8 Austria 1997-01-01   19.5  7968     9.50 24364   23798   1848       1986
 9 Austria 1998-01-01   20.7  7977     9.51 25423   24364   1953       1848
10 Austria 1999-01-01   25.9  7992     9.53 26513   25423   2069       1953
# ℹ 172 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

Using across()

Do more than one thing

Earlier we saw this:

gss_sm |> 
  group_by(race, sex, degree) |> 
  summarize(n = n(), 
            mean_age = mean(age, na.rm = TRUE), 
            mean_kids = mean(childs, na.rm = TRUE))
# A tibble: 34 × 6
# Groups:   race, sex [6]
   race  sex    degree             n mean_age mean_kids
   <fct> <fct>  <fct>          <int>    <dbl>     <dbl>
 1 White Male   Lt High School    96     52.9      2.45
 2 White Male   High School      470     48.8      1.61
 3 White Male   Junior College    65     47.1      1.54
 4 White Male   Bachelor         208     48.6      1.35
 5 White Male   Graduate         112     56.0      1.71
 6 White Female Lt High School   101     55.4      2.81
 7 White Female High School      587     51.9      1.98
 8 White Female Junior College   101     48.2      1.91
 9 White Female Bachelor         218     49.2      1.44
10 White Female Graduate         138     53.6      1.38
# ℹ 24 more rows

Do more than one thing

Similarly for organdata we might want to do:

organdata |>  
  group_by(consent_law, country) |>
  summarize(donors_mean = mean(donors, na.rm = TRUE),
            donors_sd = sd(donors, na.rm = TRUE),
            gdp_mean = mean(gdp, na.rm = TRUE),
            health_mean = mean(health, na.rm = TRUE),
            roads_mean = mean(roads, na.rm = TRUE))
# A tibble: 17 × 7
# Groups:   consent_law [2]
   consent_law country     donors_mean donors_sd gdp_mean health_mean roads_mean
   <chr>       <chr>             <dbl>     <dbl>    <dbl>       <dbl>      <dbl>
 1 Informed    Australia          10.6     1.14    22179.       1958.      105. 
 2 Informed    Canada             14.0     0.751   23711.       2272.      109. 
 3 Informed    Denmark            13.1     1.47    23722.       2054.      102. 
 4 Informed    Germany            13.0     0.611   22163.       2349.      113. 
 5 Informed    Ireland            19.8     2.48    20824.       1480.      118. 
 6 Informed    Netherlands        13.7     1.55    23013.       1993.       76.1
 7 Informed    United Kin…        13.5     0.775   21359.       1561.       67.9
 8 Informed    United Sta…        20.0     1.33    29212.       3988.      155. 
 9 Presumed    Austria            23.5     2.42    23876.       1875.      150. 
10 Presumed    Belgium            21.9     1.94    22500.       1958.      155. 
11 Presumed    Finland            18.4     1.53    21019.       1615.       93.6
12 Presumed    France             16.8     1.60    22603.       2160.      156. 
13 Presumed    Italy              11.1     4.28    21554.       1757       122. 
14 Presumed    Norway             15.4     1.11    26448.       2217.       70.0
15 Presumed    Spain              28.1     4.96    16933        1289.      161. 
16 Presumed    Sweden             13.1     1.75    22415.       1951.       72.3
17 Presumed    Switzerland        14.2     1.71    27233        2776.       96.4

This works, but it’s really tedious. Also error-prone.

Use across()

Instead, use across() to apply a function to more than one column.

my_vars <- c("gdp", "donors", "roads")

## nested parens again, but it's worth it
organdata |> 
  group_by(consent_law, country) |>
  summarize(across(all_of(my_vars),           
                   list(avg = \(x) mean(x, na.rm = TRUE))
                  )
           )     
# A tibble: 17 × 5
# Groups:   consent_law [2]
   consent_law country        gdp_avg donors_avg roads_avg
   <chr>       <chr>            <dbl>      <dbl>     <dbl>
 1 Informed    Australia       22179.       10.6     105. 
 2 Informed    Canada          23711.       14.0     109. 
 3 Informed    Denmark         23722.       13.1     102. 
 4 Informed    Germany         22163.       13.0     113. 
 5 Informed    Ireland         20824.       19.8     118. 
 6 Informed    Netherlands     23013.       13.7      76.1
 7 Informed    United Kingdom  21359.       13.5      67.9
 8 Informed    United States   29212.       20.0     155. 
 9 Presumed    Austria         23876.       23.5     150. 
10 Presumed    Belgium         22500.       21.9     155. 
11 Presumed    Finland         21019.       18.4      93.6
12 Presumed    France          22603.       16.8     156. 
13 Presumed    Italy           21554.       11.1     122. 
14 Presumed    Norway          26448.       15.4      70.0
15 Presumed    Spain           16933        28.1     161. 
16 Presumed    Sweden          22415.       13.1      72.3
17 Presumed    Switzerland     27233        14.2      96.4

Let’s look at that again

my_vars <- c("gdp", "donors", "roads")

Let’s look at that again

my_vars <- c("gdp", "donors", "roads")

## nested parens again, but it's worth it
organdata
# A tibble: 238 × 21
   country   year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>     <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Australia NA          NA    17065    0.220 16774   16591   1300       1224
 2 Australia 1991-01-01  12.1  17284    0.223 17171   16774   1379       1300
 3 Australia 1992-01-01  12.4  17495    0.226 17914   17171   1455       1379
 4 Australia 1993-01-01  12.5  17667    0.228 18883   17914   1540       1455
 5 Australia 1994-01-01  10.2  17855    0.231 19849   18883   1626       1540
 6 Australia 1995-01-01  10.2  18072    0.233 21079   19849   1737       1626
 7 Australia 1996-01-01  10.6  18311    0.237 21923   21079   1846       1737
 8 Australia 1997-01-01  10.3  18518    0.239 22961   21923   1948       1846
 9 Australia 1998-01-01  10.5  18711    0.242 24148   22961   2077       1948
10 Australia 1999-01-01   8.67 18926    0.244 25445   24148   2231       2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

Let’s look at that again

my_vars <- c("gdp", "donors", "roads")

## nested parens again, but it's worth it
organdata |>
  group_by(consent_law, country)
# A tibble: 238 × 21
# Groups:   consent_law, country [17]
   country   year       donors   pop pop_dens   gdp gdp_lag health health_lag
   <chr>     <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
 1 Australia NA          NA    17065    0.220 16774   16591   1300       1224
 2 Australia 1991-01-01  12.1  17284    0.223 17171   16774   1379       1300
 3 Australia 1992-01-01  12.4  17495    0.226 17914   17171   1455       1379
 4 Australia 1993-01-01  12.5  17667    0.228 18883   17914   1540       1455
 5 Australia 1994-01-01  10.2  17855    0.231 19849   18883   1626       1540
 6 Australia 1995-01-01  10.2  18072    0.233 21079   19849   1737       1626
 7 Australia 1996-01-01  10.6  18311    0.237 21923   21079   1846       1737
 8 Australia 1997-01-01  10.3  18518    0.239 22961   21923   1948       1846
 9 Australia 1998-01-01  10.5  18711    0.242 24148   22961   2077       1948
10 Australia 1999-01-01   8.67 18926    0.244 25445   24148   2231       2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
#   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
#   consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>

Let’s look at that again

my_vars <- c("gdp", "donors", "roads")

## nested parens again, but it's worth it
organdata |>
  group_by(consent_law, country) |>
  summarize(across(all_of(my_vars),
                   list(avg = \(x) mean(x, na.rm = TRUE))
                  )
           )
# A tibble: 17 × 5
# Groups:   consent_law [2]
   consent_law country        gdp_avg donors_avg roads_avg
   <chr>       <chr>            <dbl>      <dbl>     <dbl>
 1 Informed    Australia       22179.       10.6     105. 
 2 Informed    Canada          23711.       14.0     109. 
 3 Informed    Denmark         23722.       13.1     102. 
 4 Informed    Germany         22163.       13.0     113. 
 5 Informed    Ireland         20824.       19.8     118. 
 6 Informed    Netherlands     23013.       13.7      76.1
 7 Informed    United Kingdom  21359.       13.5      67.9
 8 Informed    United States   29212.       20.0     155. 
 9 Presumed    Austria         23876.       23.5     150. 
10 Presumed    Belgium         22500.       21.9     155. 
11 Presumed    Finland         21019.       18.4      93.6
12 Presumed    France          22603.       16.8     156. 
13 Presumed    Italy           21554.       11.1     122. 
14 Presumed    Norway          26448.       15.4      70.0
15 Presumed    Spain           16933        28.1     161. 
16 Presumed    Sweden          22415.       13.1      72.3
17 Presumed    Switzerland     27233        14.2      96.4

Let’s look at that again

my_vars <- c("gdp", "donors", "roads")

organdata |> 
  group_by(consent_law, country) |>
  summarize(across(all_of(my_vars),           
                   list(avg = \(x) mean(x, na.rm = TRUE))
                  )
           )     
  • my_vars are selected by across()
  • We use all_of() or any_of() to be explicit
  • list() of the form result = function gives the new columns that will be calculated.
  • The thing inside the list with the “waving person”, \(x), is an anonymous function

We can calculate more than one thing

my_vars <- c("gdp", "donors", "roads")

organdata |> 
  group_by(consent_law, country) |>
  summarize(across(all_of(my_vars),           
                   list(avg = \(x) mean(x, na.rm = TRUE), 
                        sdev = \(x) sd(x, na.rm = TRUE), 
                        md = \(x) median(x, na.rm = TRUE)) 
                  )
           )
# A tibble: 17 × 11
# Groups:   consent_law [2]
   consent_law country  gdp_avg gdp_sdev gdp_md donors_avg donors_sdev donors_md
   <chr>       <chr>      <dbl>    <dbl>  <int>      <dbl>       <dbl>     <dbl>
 1 Informed    Austral…  22179.    3959.  21923       10.6       1.14       10.4
 2 Informed    Canada    23711.    3966.  22764       14.0       0.751      14.0
 3 Informed    Denmark   23722.    3896.  23548       13.1       1.47       12.9
 4 Informed    Germany   22163.    2501.  22164       13.0       0.611      13  
 5 Informed    Ireland   20824.    6670.  19245       19.8       2.48       19.2
 6 Informed    Netherl…  23013.    3770.  22541       13.7       1.55       13.8
 7 Informed    United …  21359.    3929.  20839       13.5       0.775      13.5
 8 Informed    United …  29212.    4571.  28772       20.0       1.33       20.1
 9 Presumed    Austria   23876.    3343.  23798       23.5       2.42       23.8
10 Presumed    Belgium   22500.    3171.  22152       21.9       1.94       21.4
11 Presumed    Finland   21019.    3668.  19842       18.4       1.53       19.4
12 Presumed    France    22603.    3260.  21990       16.8       1.60       16.6
13 Presumed    Italy     21554.    2781.  21396       11.1       4.28       11.3
14 Presumed    Norway    26448.    6492.  26218       15.4       1.11       15.4
15 Presumed    Spain     16933     2888.  16416       28.1       4.96       28  
16 Presumed    Sweden    22415.    3213.  22029       13.1       1.75       12.7
17 Presumed    Switzer…  27233     2153.  26304       14.2       1.71       14.4
# ℹ 3 more variables: roads_avg <dbl>, roads_sdev <dbl>, roads_md <dbl>

It’s OK to use the function names

my_vars <- c("gdp", "donors", "roads")

organdata |> 
  group_by(consent_law, country) |>
  summarize(across(all_of(my_vars),           
                   list(mean = \(x) mean(x, na.rm = TRUE), 
                        sd = \(x) sd(x, na.rm = TRUE), 
                        median = \(x) median(x, na.rm = TRUE)) 
                  )
           )
# A tibble: 17 × 11
# Groups:   consent_law [2]
   consent_law country        gdp_mean gdp_sd gdp_median donors_mean donors_sd
   <chr>       <chr>             <dbl>  <dbl>      <int>       <dbl>     <dbl>
 1 Informed    Australia        22179.  3959.      21923        10.6     1.14 
 2 Informed    Canada           23711.  3966.      22764        14.0     0.751
 3 Informed    Denmark          23722.  3896.      23548        13.1     1.47 
 4 Informed    Germany          22163.  2501.      22164        13.0     0.611
 5 Informed    Ireland          20824.  6670.      19245        19.8     2.48 
 6 Informed    Netherlands      23013.  3770.      22541        13.7     1.55 
 7 Informed    United Kingdom   21359.  3929.      20839        13.5     0.775
 8 Informed    United States    29212.  4571.      28772        20.0     1.33 
 9 Presumed    Austria          23876.  3343.      23798        23.5     2.42 
10 Presumed    Belgium          22500.  3171.      22152        21.9     1.94 
11 Presumed    Finland          21019.  3668.      19842        18.4     1.53 
12 Presumed    France           22603.  3260.      21990        16.8     1.60 
13 Presumed    Italy            21554.  2781.      21396        11.1     4.28 
14 Presumed    Norway           26448.  6492.      26218        15.4     1.11 
15 Presumed    Spain            16933   2888.      16416        28.1     4.96 
16 Presumed    Sweden           22415.  3213.      22029        13.1     1.75 
17 Presumed    Switzerland      27233   2153.      26304        14.2     1.71 
# ℹ 4 more variables: donors_median <dbl>, roads_mean <dbl>, roads_sd <dbl>,
#   roads_median <dbl>

Selection with across(where())

organdata |> 
  group_by(consent_law, country) |>
  summarize(across(where(is.numeric),           
                   list(mean = \(x) mean(x, na.rm = TRUE), 
                        sd = \(x) sd(x, na.rm = TRUE), 
                        median = \(x) median(x, na.rm = TRUE)) 
                  )
           ) |> 
    print(n = 3) # just to save slide space
# A tibble: 17 × 41
# Groups:   consent_law [2]
  consent_law country   donors_mean donors_sd donors_median pop_mean pop_sd
  <chr>       <chr>           <dbl>     <dbl>         <dbl>    <dbl>  <dbl>
1 Informed    Australia        10.6     1.14           10.4   18318.  831. 
2 Informed    Canada           14.0     0.751          14.0   29608. 1193. 
3 Informed    Denmark          13.1     1.47           12.9    5257.   80.6
# ℹ 14 more rows
# ℹ 34 more variables: pop_median <int>, pop_dens_mean <dbl>,
#   pop_dens_sd <dbl>, pop_dens_median <dbl>, gdp_mean <dbl>, gdp_sd <dbl>,
#   gdp_median <int>, gdp_lag_mean <dbl>, gdp_lag_sd <dbl>,
#   gdp_lag_median <dbl>, health_mean <dbl>, health_sd <dbl>,
#   health_median <dbl>, health_lag_mean <dbl>, health_lag_sd <dbl>,
#   health_lag_median <dbl>, pubhealth_mean <dbl>, pubhealth_sd <dbl>, …

Name new columns with .names

organdata |> 
  group_by(consent_law, country) |>
  summarize(across(where(is.numeric),           
                   list(mean = \(x) mean(x, na.rm = TRUE), 
                        sd = \(x) sd(x, na.rm = TRUE), 
                        median = \(x) median(x, na.rm = TRUE)
                        ),
                   .names = "{fn}_{col}"
                  )
            ) |> 
  print(n = 3) 
# A tibble: 17 × 41
# Groups:   consent_law [2]
  consent_law country   mean_donors sd_donors median_donors mean_pop sd_pop
  <chr>       <chr>           <dbl>     <dbl>         <dbl>    <dbl>  <dbl>
1 Informed    Australia        10.6     1.14           10.4   18318.  831. 
2 Informed    Canada           14.0     0.751          14.0   29608. 1193. 
3 Informed    Denmark          13.1     1.47           12.9    5257.   80.6
# ℹ 14 more rows
# ℹ 34 more variables: median_pop <int>, mean_pop_dens <dbl>,
#   sd_pop_dens <dbl>, median_pop_dens <dbl>, mean_gdp <dbl>, sd_gdp <dbl>,
#   median_gdp <int>, mean_gdp_lag <dbl>, sd_gdp_lag <dbl>,
#   median_gdp_lag <dbl>, mean_health <dbl>, sd_health <dbl>,
#   median_health <dbl>, mean_health_lag <dbl>, sd_health_lag <dbl>,
#   median_health_lag <dbl>, mean_pubhealth <dbl>, sd_pubhealth <dbl>, …

Name new columns with .names

In tidyverse functions, arguments that begin with a “.” generally have it in order to avoid confusion with existing items, or are “pronouns” referring to e.g. “the name of the thing we’re currently talking about as we evaluate this function”.

This all works with mutate(), too

organdata |> 
  mutate(across(where(is.character), toupper)) |> 
  select(where(is.character))
# A tibble: 238 × 7
   country   world   opt   consent_law consent_practice consistent ccode
   <chr>     <chr>   <chr> <chr>       <chr>            <chr>      <chr>
 1 AUSTRALIA LIBERAL IN    INFORMED    INFORMED         YES        OZ   
 2 AUSTRALIA LIBERAL IN    INFORMED    INFORMED         YES        OZ   
 3 AUSTRALIA LIBERAL IN    INFORMED    INFORMED         YES        OZ   
 4 AUSTRALIA LIBERAL IN    INFORMED    INFORMED         YES        OZ   
 5 AUSTRALIA LIBERAL IN    INFORMED    INFORMED         YES        OZ   
 6 AUSTRALIA LIBERAL IN    INFORMED    INFORMED         YES        OZ   
 7 AUSTRALIA LIBERAL IN    INFORMED    INFORMED         YES        OZ   
 8 AUSTRALIA LIBERAL IN    INFORMED    INFORMED         YES        OZ   
 9 AUSTRALIA LIBERAL IN    INFORMED    INFORMED         YES        OZ   
10 AUSTRALIA LIBERAL IN    INFORMED    INFORMED         YES        OZ   
# ℹ 228 more rows

Summary Data

by_country <- organdata |> 
  group_by(consent_law, country) |>
  summarize(across(where(is.numeric),           
                   list(mean = \(x) mean(x, na.rm = TRUE), 
                        sd = \(x) sd(x, na.rm = TRUE), 
                        median = \(x) median(x, na.rm = TRUE)) 
                  )
           )

by_country
# A tibble: 17 × 41
# Groups:   consent_law [2]
   consent_law country       donors_mean donors_sd donors_median pop_mean pop_sd
   <chr>       <chr>               <dbl>     <dbl>         <dbl>    <dbl>  <dbl>
 1 Informed    Australia            10.6     1.14           10.4   18318. 8.31e2
 2 Informed    Canada               14.0     0.751          14.0   29608. 1.19e3
 3 Informed    Denmark              13.1     1.47           12.9    5257. 8.06e1
 4 Informed    Germany              13.0     0.611          13     80255. 5.16e3
 5 Informed    Ireland              19.8     2.48           19.2    3674. 1.32e2
 6 Informed    Netherlands          13.7     1.55           13.8   15548. 3.73e2
 7 Informed    United Kingd…        13.5     0.775          13.5   58187. 6.26e2
 8 Informed    United States        20.0     1.33           20.1  269330. 1.25e4
 9 Presumed    Austria              23.5     2.42           23.8    7927. 1.09e2
10 Presumed    Belgium              21.9     1.94           21.4   10153. 1.09e2
11 Presumed    Finland              18.4     1.53           19.4    5112. 6.86e1
12 Presumed    France               16.8     1.60           16.6   58056. 8.51e2
13 Presumed    Italy                11.1     4.28           11.3   57360. 4.25e2
14 Presumed    Norway               15.4     1.11           15.4    4386. 9.73e1
15 Presumed    Spain                28.1     4.96           28     39666. 9.51e2
16 Presumed    Sweden               13.1     1.75           12.7    8789. 1.14e2
17 Presumed    Switzerland          14.2     1.71           14.4    7037. 1.70e2
# ℹ 34 more variables: pop_median <int>, pop_dens_mean <dbl>,
#   pop_dens_sd <dbl>, pop_dens_median <dbl>, gdp_mean <dbl>, gdp_sd <dbl>,
#   gdp_median <int>, gdp_lag_mean <dbl>, gdp_lag_sd <dbl>,
#   gdp_lag_median <dbl>, health_mean <dbl>, health_sd <dbl>,
#   health_median <dbl>, health_lag_mean <dbl>, health_lag_sd <dbl>,
#   health_lag_median <dbl>, pubhealth_mean <dbl>, pubhealth_sd <dbl>,
#   pubhealth_median <dbl>, roads_mean <dbl>, roads_sd <dbl>, …

Plot our summary data

by_country |> 
  ggplot(mapping = 
           aes(x = donors_mean, 
               y = reorder(country, donors_mean),
               color = consent_law)) + 
  geom_point(size=3) +
  labs(x = "Donor Procurement Rate",
       y = NULL, 
       color = "Consent Law")

What about faceting it instead?

The problem is that countries can only be in one Consent Law category.

by_country |> 
  ggplot(mapping = 
           aes(x = donors_mean, 
               y = reorder(country, donors_mean),
               color = consent_law)) + 
  geom_point(size=3) +
  guides(color = "none") +
  facet_wrap(~ consent_law) + 
  labs(x = "Donor Procurement Rate",
       y = NULL, 
       color = "Consent Law")

What about faceting it instead?

Restricting to one column doesn’t fix it.

by_country |> 
  ggplot(mapping = 
           aes(x = donors_mean, 
               y = reorder(country, donors_mean),
               color = consent_law)) + 
  geom_point(size=3) +
  guides(color = "none") +
  facet_wrap(~ consent_law, ncol = 1) + 
  labs(x = "Donor Procurement Rate",
       y = NULL, 
       color = "Consent Law")

Allow the y-scale to vary

Normally the point of a facet is to preserve comparability between panels by not allowing the scales to vary. But for categorical measures it can be useful to allow this.

by_country |> 
  ggplot(mapping = 
           aes(x = donors_mean, 
               y = reorder(country, donors_mean),
               color = consent_law)) + 
  geom_point(size=3) +
  guides(color = "none") +
  facet_wrap(~ consent_law, 
             ncol = 1,
             scales = "free_y") +  
  labs(x = "Donor Procurement Rate",
       y = NULL, 
       color = "Consent Law")

Again, these methods are general

by_country |> 
  ggplot(mapping = 
           aes(x = donors_mean, 
               y = reorder(country, donors_mean),
               color = consent_law)) + 
  geom_pointrange(mapping = 
                    aes(xmin = donors_mean - donors_sd, 
                        xmax = donors_mean + donors_sd)) + 
  guides(color = "none") +
  facet_wrap(~ consent_law, 
             ncol = 1,
             scales = "free_y") +  
  labs(x = "Donor Procurement Rate",
       y = NULL, 
       color = "Consent Law")

Plot text directly

geom_text() for basic labels

by_country |> 
  ggplot(mapping = aes(x = roads_mean, 
                       y = donors_mean)) + 
  geom_text(mapping = aes(label = country))

It’s not very flexible

by_country |> 
  ggplot(mapping = aes(x = roads_mean, 
                       y = donors_mean)) + 
  geom_point() + 
  geom_text(mapping = aes(label = country),
            hjust = 0)

There are tricks, but they’re limited

by_country |> 
  ggplot(mapping = aes(x = roads_mean, 
                       y = donors_mean)) + 
  geom_point() + 
  geom_text(mapping = aes(x = roads_mean + 2, 
                          label = country),
            hjust = 0)

We’ll use ggrepel instead

The ggrepel package provides geom_text_repel() and geom_label_repel()

Example: U.S. Historic
Presidential Elections

elections_historic is in socviz

elections_historic
# A tibble: 49 × 19
   election  year winner      win_party ec_pct popular_pct popular_margin  votes
      <int> <int> <chr>       <chr>      <dbl>       <dbl>          <dbl>  <int>
 1       10  1824 John Quinc… D.-R.      0.322       0.309        -0.104  1.13e5
 2       11  1828 Andrew Jac… Dem.       0.682       0.559         0.122  6.43e5
 3       12  1832 Andrew Jac… Dem.       0.766       0.547         0.178  7.03e5
 4       13  1836 Martin Van… Dem.       0.578       0.508         0.142  7.63e5
 5       14  1840 William He… Whig       0.796       0.529         0.0605 1.28e6
 6       15  1844 James Polk  Dem.       0.618       0.495         0.0145 1.34e6
 7       16  1848 Zachary Ta… Whig       0.562       0.473         0.0479 1.36e6
 8       17  1852 Franklin P… Dem.       0.858       0.508         0.0695 1.61e6
 9       18  1856 James Buch… Dem.       0.588       0.453         0.122  1.84e6
10       19  1860 Abraham Li… Rep.       0.594       0.396         0.101  1.86e6
# ℹ 39 more rows
# ℹ 11 more variables: margin <int>, runner_up <chr>, ru_part <chr>,
#   turnout_pct <dbl>, winner_lname <chr>, winner_label <chr>, ru_lname <chr>,
#   ru_label <chr>, two_term <lgl>, ec_votes <dbl>, ec_denom <dbl>

We’ll draw a plot like this

Presidential elections

Keep things neat

## The packages we'll use in addition to ggplot
library(ggrepel) 
library(scales) 

p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"

Base Layer, Lines, Points

p <- ggplot(data = elections_historic, 
            mapping = aes(x = popular_pct, 
                          y = ec_pct,
                          label = winner_label))

p + geom_hline(yintercept = 0.5, 
               linewidth = 1.4, 
               color = "gray80") +
    geom_vline(xintercept = 0.5, 
               linewidth = 1.4, 
               color = "gray80") +
    geom_point()

Add the labels

This looks terrible here because geom_text_repel() uses the dimensions of the available graphics device to iteratively figure out the labels. Let’s allow it to draw on the whole slide.

p <- ggplot(data = elections_historic, 
            mapping = aes(x = popular_pct, 
                          y = ec_pct,
                          label = winner_label))

p + geom_hline(yintercept = 0.5, 
               linewidth = 1.4, color = "gray80") +
  geom_vline(xintercept = 0.5, 
             linewidth = 1.4, color = "gray80") +
  geom_point() + 
  geom_text_repel()

Labeling is with respect to the plot size

p <- ggplot(data = elections_historic, 
            mapping  = aes(x = popular_pct, 
                           y = ec_pct,
                           label = winner_label))

p_out <- p + 
  geom_hline(yintercept = 0.5, 
             linewidth = 1.4, 
             color = "gray80") +
  geom_vline(xintercept = 0.5, 
             linewidth = 1.4, 
             color = "gray80") +
  geom_point() + 
  geom_text_repel() 

Adjust the Scales

p <- ggplot(data = elections_historic, 
            mapping  = aes(x = popular_pct, 
                           y = ec_pct,
                           label = winner_label))
p_out <- p + geom_hline(yintercept = 0.5, 
                        linewidth = 1.4, 
                        color = "gray80") +
    geom_vline(xintercept = 0.5, 
               linewidth = 1.4, 
               color = "gray80") +
    geom_point() +
    geom_text_repel() +
    scale_x_continuous(labels = label_percent()) + 
    scale_y_continuous(labels = label_percent()) 

Add the labels

p <- ggplot(data = elections_historic, 
            mapping  = aes(x = popular_pct, 
                           y = ec_pct,
                           label = winner_label))
p_out <- p + geom_hline(yintercept = 0.5, 
                        linewidth = 1.4, 
                        color = "gray80") +
  geom_vline(xintercept = 0.5, 
             linewidth = 1.4, 
             color = "gray80") +
  geom_point() +
  geom_text_repel(mapping = aes(family = "Tenso Slide")) +
  scale_x_continuous(labels = label_percent()) +
  scale_y_continuous(labels = label_percent()) +
  labs(x = x_label, y = y_label,  
       title = p_title, 
       subtitle = p_subtitle,
       caption = p_caption)   

Labeling points
of interest

Option 1: On the fly in ggplot

by_country |> 
  ggplot(mapping = aes(x = gdp_mean,
                       y = health_mean)) +
  geom_point() + 
  geom_text_repel(data = subset(by_country, gdp_mean > 25000), 
                  mapping = aes(label = country))

Option 1: On the fly inside ggplot

Stuffing everything into the subset() call might get messy

by_country |> 
  ggplot(mapping = aes(x = gdp_mean,
                       y = health_mean)) +
  geom_point() + 
  geom_text_repel(data = subset(by_country, 
                                gdp_mean > 25000 |
                                  health_mean < 1500 |
                                  country %in% "Belgium"), 
                  mapping = aes(label = country))

Option 2: Use dplyr first

df_hl <- by_country |> 
  filter(gdp_mean > 25000 | 
           health_mean < 1500 | 
           country %in% "Belgium")

df_hl
# A tibble: 6 × 41
# Groups:   consent_law [2]
  consent_law country       donors_mean donors_sd donors_median pop_mean  pop_sd
  <chr>       <chr>               <dbl>     <dbl>         <dbl>    <dbl>   <dbl>
1 Informed    Ireland              19.8      2.48          19.2    3674.   132. 
2 Informed    United States        20.0      1.33          20.1  269330. 12545. 
3 Presumed    Belgium              21.9      1.94          21.4   10153.   109. 
4 Presumed    Norway               15.4      1.11          15.4    4386.    97.3
5 Presumed    Spain                28.1      4.96          28     39666.   951. 
6 Presumed    Switzerland          14.2      1.71          14.4    7037.   170. 
# ℹ 34 more variables: pop_median <int>, pop_dens_mean <dbl>,
#   pop_dens_sd <dbl>, pop_dens_median <dbl>, gdp_mean <dbl>, gdp_sd <dbl>,
#   gdp_median <int>, gdp_lag_mean <dbl>, gdp_lag_sd <dbl>,
#   gdp_lag_median <dbl>, health_mean <dbl>, health_sd <dbl>,
#   health_median <dbl>, health_lag_mean <dbl>, health_lag_sd <dbl>,
#   health_lag_median <dbl>, pubhealth_mean <dbl>, pubhealth_sd <dbl>,
#   pubhealth_median <dbl>, roads_mean <dbl>, roads_sd <dbl>, …

Option 2: Use dplyr first

This makes things neater. A geom can be fully “autonomous”. Each one can have its own mapping call and its own data source. This can be very useful when building up plots overlaying several sources or subsets of data.

by_country |> 
  ggplot(mapping = aes(x = gdp_mean,
                       y = health_mean)) +
  geom_point() + 
  geom_text_repel(data = df_hl, 
                  mapping = aes(label = country))

Write and draw
inside the plot area

annotate() can imitate geoms

organdata |> 
  ggplot(mapping = aes(x = roads, 
                       y = donors)) + 
  geom_point() + 
  annotate(geom = "text", 
           family = "Tenso Slide",
           x = 157, 
           y = 33,
           label = "A surprisingly high \n recovery rate.",
           hjust = 0)

annotate() can imitate geoms

organdata |> 
  ggplot(mapping = aes(x = roads, 
                       y = donors)) + 
  geom_point() +
  annotate(geom = "rect", 
           xmin = 125, xmax = 155,
           ymin = 30, ymax = 35,
           fill = "red", 
           alpha = 0.2) + 
  annotate(geom = "text", 
           x = 157, y = 33,
           family = "Tenso Slide",
           label = "A surprisingly high \n recovery rate.", 
           hjust = 0)

Scales, Guides, and Themes

Every mapped variable has a scale

  • Aesthetic mappings link quantities or categories in your data to things you can see on the graph. Thus, they have a scale associated with that representation.
  • Scale functions manage this relationship. Remember: not just x and y but also color, fill, shape, size, and alpha are scales.
  • If it can represent your data, it has a scale, and a scale function to manage it.
  • This means you control things like color schemes for data mappings through scale functions
  • Because those colors are representing features of your data.

Naming conventions for scale functions

  • In general, scale functions are named like this:

scale_<MAPPING>_<KIND>()

Naming conventions

  • In general, scale functions are named like this:

scale_<MAPPING>_<KIND>()

  • We already know there are a lot of mappings
  • x, y, color, size, shape, and so on.

Naming conventions

  • In general, scale functions are named like this:

scale_<MAPPING>_<KIND>()

  • We already know there are a lot of mappings

  • x, y, color, size, shape, and so on.

  • And there are many kinds of scale as well.

  • discrete, continuous, log10, date, binned, and many others.

  • So there’s a whole zoo of scale functions.

  • The naming convention helps us keep track.

Naming conventions

scale_<MAPPING>_<KIND>()

  • scale_x_continuous()
  • scale_y_continous()
  • scale_x_discrete()
  • scale_y_discrete()
  • scale_x_log10()
  • scale_x_sqrt()

Naming conventions

scale_<MAPPING>_<KIND>()

  • scale_x_continuous()
  • scale_y_continous()
  • scale_x_discrete()
  • scale_y_discrete()
  • scale_x_log10()
  • scale_x_sqrt()
  • scale_color_discrete()
  • scale_color_gradient()
  • scale_color_gradient2()
  • scale_color_brewer()
  • scale_fill_discrete()
  • scale_fill_gradient()
  • scale_fill_gradient2()
  • scale_fill_brewer()

Scale functions in practice

  • Scale functions take arguments appropriate to their mapping and kind
organdata |> 
  ggplot(mapping = aes(x = roads,
                       y = donors,
                       color = world)) + 
  geom_point() +
  scale_y_continuous(breaks = c(5, 15, 25),
                     labels = c("Five", 
                                "Fifteen", 
                                "Twenty Five"))

More usefully …

organdata |> 
  ggplot(mapping = aes(x = roads,
                       y = donors,
                       color = world)) + 
  geom_point() +
  scale_color_discrete(labels =
                         c("Corporatist", 
                           "Liberal",
                           "Social Democratic", 
                           "Unclassified")) +
  labs(x = "Road Deaths",
       y = "Donor Procurement",
       color = "Welfare State")

The guides() function

Control overall properties of the guide labels. Most common use: turning it off.

organdata |> 
  ggplot(mapping = aes(x = roads,
                       y = donors,
                       color = consent_law)) + 
  geom_point() +
  facet_wrap(~ consent_law, ncol = 1) +
  guides(color = "none") + 
  labs(x = "Road Deaths",
       y = "Donor Procurement")

The theme() function

theme() styles parts of your plot that are not directly representing your data. Often the first thing people want to adjust; but logically it’s the last thing.

## Using the "classic" ggplot theme here
organdata |> 
  ggplot(mapping = aes(x = roads,
                       y = donors,
                       color = consent_law)) + 
  geom_point() +
  labs(title = "By Consent Law",
    x = "Road Deaths",
    y = "Donor Procurement", 
    color = "Legal Regime:") + 
  theme(legend.position = "bottom", 
        plot.title = element_text(color = "darkred",
                                  face = "bold"))