Show the Right Numbers

Soc 690S: Week 03b

Kieran Healy

Duke University

March 2025

Show the Right Numbers

Load the packages we need

library(tidyverse)      # Your friend and mine
library(gapminder)      # Gapminder data
library(here)           # Portable file paths
library(socviz)         # Handy socviz functions

`ggplot` implements a grammar of graphics

A grammar of graphics

The grammar is a set of rules for how to produce graphics from data, by mapping data to or representing it by geometric objects (like points and lines) that have aesthetic attributes (like position, color, size, and shape), together with further rules for transforming data if needed, for adjusting scales and their guides, and for projecting results onto some coordinate system.

Like other rules of syntax, the grammar
limits what you can validly say
but it doesn’t automatically make
what you say
sensible or meaningful

Grouped data and the `group` aesthetic

Try to make a lineplot

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                       y = gdpPercap))

Try to make a lineplot

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                       y = gdpPercap)) +
  geom_line()

Try to make a lineplot

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                       y = gdpPercap)) +
  geom_line()

p

Try to make a lineplot

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                       y = gdpPercap))

Try to make a lineplot

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                       y = gdpPercap)) +
  geom_line(mapping = aes(group = country))

Try to make a lineplot

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                       y = gdpPercap)) +
  geom_line(mapping = aes(group = country))

p

Try to make a lineplot

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                       y = gdpPercap)) +
  geom_line(mapping = aes(group = country))

p

Facet the plot

gapminder

# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

Faceting is very powerful

Faceting

A facet is not a geom; it’s a way of arranging repeated geoms by some additional variable
Facets use R’s “formula” syntax: facet_wrap(~ continent)
Read the ~ as “on” or “by”

Faceting

You can also use this syntax: facet_wrap(vars(continent))
This is newer, and consistent with other ways of referring to variables within tidyverse functions.

Facets in action

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                          y = gdpPercap))

p_out <- p + geom_line(color="gray70", 
              mapping=aes(group = country)) +
    geom_smooth(size = 1.1,
                method = "loess",
                se = FALSE) +
    scale_y_log10(labels=scales::label_dollar()) +
    facet_wrap(~ continent, ncol = 5) +
    labs(x = "Year",
         y = "log GDP per capita",
         title = "GDP per capita on Five Continents",
         caption = "Data: Gapminder")

One-variable summaries

The `midwest` dataset

County-level census data for Midwestern U.S. Counties

midwest

# A tibble: 437 × 28
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 19 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>

`stat_` functions behind the scenes

p <- ggplot(data = midwest, 
            mapping = aes(x = area))

p + geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here the default stat_ function for this geom has to make a choice. It is letting us know we might want to override it.

`stat_` functions behind the scenes

p <- ggplot(data = midwest, 
            mapping = aes(x = area))

p + geom_histogram(bins = 10)

We can choose either the number of bins or the binwidth

Compare two distributions

## Two state codes
oh_wi <- c("OH", "WI")

midwest |> 
  filter(state %in% oh_wi) |> 
  ggplot(mapping = aes(x = percollege, 
                       fill = state)) + 
  geom_histogram(alpha = 0.5, 
                 position = "identity")

Here we do the whole thing in a pipeline using the pipe and the dplyr verb filter() to subset rows of the data by some condition.
Experiment with leaving the position argument out, or changing it to "dodge".

geom_density()

p <- ggplot(data = midwest, 
            mapping = aes(x = area))

p + geom_density()

geom_density()

p <- ggplot(data = midwest,
            mapping = aes(x = area, 
                          fill = state, 
                          color = state))
p + geom_density(alpha = 0.3)

geom_density()

midwest |>
  filter(state %in% oh_wi) |> 
  ggplot(mapping = aes(x = area,
                       fill = state, 
                       color = state)) + 
  geom_density(mapping = aes(y = after_stat(ndensity)), 
               alpha = 0.4)

ndensity here is not in our data! It’s computed. Histogram and density geoms have default statistics, but you can ask them to do more. The after_stat functions can do this work for us.

Avoid counting up,
when necessary

Sometimes no counting is needed

titanic

      fate    sex    n percent
1 perished   male 1364    62.0
2 perished female  126     5.7
3 survived   male  367    16.7
4 survived female  344    15.6

Here we just have a summary table and want to plot a few numbers directly in a bar chart.

`geom_bar()` wants to count up

By default geom_bar() tries to count up data by category. (Really it’s the stat_count() function that does this behind the scenes.) By saying stat="identity" we explicitly tell it not to do that. This also allows us to use a y mapping. Normally this would be the result of the counting up.

p <- ggplot(data = titanic,
            mapping = aes(x = fate, 
                          y = percent, 
                          fill = sex))
p + geom_bar(stat = "identity")

`geom_bar()` stacks by default

p <- ggplot(data = titanic,
            mapping = aes(x = fate, 
                          y = percent, 
                          fill = sex))
p + geom_bar(stat = "identity", 
             position = "dodge")

Position arguments adjust whether the things drawn are placed on top of one another ("stack"), side-by-side ("dodge"), or taken as-is ("identity").

A quick `theme()` adjustment

The theme() function controls the styling of parts of the plot that don’t belong to its “grammatical” structure. That is, that are not contributing to directly representing data.

p <- ggplot(data = titanic,
            mapping = aes(x = fate, 
                          y = percent, 
                          fill = sex))
p + geom_bar(stat = "identity", 
             position = "dodge") +
  theme(legend.position = "top")

For convenience, use `geom_col()`

p <- ggplot(data = titanic,
            mapping = aes(x = fate, 
                          y = percent, 
                          fill = sex))
p + geom_col(position = "dodge") + 
  theme(legend.position = "top")

geom_col() assumes stat = "identity" by default. It’s for when you want to directly plot a table of values, rather than create a bar chart by summing over one varible categorized by another.

Using `geom_col()` for thresholds

oecd_sum

# A tibble: 57 × 5
# Groups:   year [57]
    year other   usa  diff hi_lo
   <int> <dbl> <dbl> <dbl> <chr>
 1  1960  68.6  69.9 1.30  Below
 2  1961  69.2  70.4 1.20  Below
 3  1962  68.9  70.2 1.30  Below
 4  1963  69.1  70   0.900 Below
 5  1964  69.5  70.3 0.800 Below
 6  1965  69.6  70.3 0.700 Below
 7  1966  69.9  70.3 0.400 Below
 8  1967  70.1  70.7 0.600 Below
 9  1968  70.1  70.4 0.300 Below
10  1969  70.1  70.6 0.5   Below
# ℹ 47 more rows

Data comparing U.S. average life expectancy to the rest of the OECD average.
diff is difference in years with respect to the U.S.
hi_lo is a flag saying whether the OECD is above or below the U.S.

Using geom_col() for thresholds

p <- ggplot(data = oecd_sum, 
            mapping = aes(x = year, 
                          y = diff, 
                          fill = hi_lo))

p_out <- p + geom_col() + 
  geom_hline(yintercept = 0, linewidth = 1.2) + 
  guides(fill = "none") + 
  labs(x = NULL, 
       y = "Difference in Years", 
       title = "The U.S. Life Expectancy Gap", 
       subtitle = "Difference between U.S. and 
       OECD average life expectancies, 1960-2015",
       caption = "Data: OECD.")

geom_hline() doesn’t take any data argument. It just draws a horizontal line with a given y-intercept.
x = NULL means “Don’t label the x-axis (not even with the default value, the variable name).

Show the Right Numbers

Show the Right Numbers

Load the packages we need

ggplot implements a grammar of graphics

A grammar of graphics

Grouped data and the group aesthetic

Try to make a lineplot

Try to make a lineplot

Try to make a lineplot

Try to make a lineplot

Try to make a lineplot

Try to make a lineplot

Try to make a lineplot

Facet the plot

Facet the plot

Facet the plot

Facet the plot

Faceting is very powerful

Faceting

Faceting

Facets in action

One-variable summaries

The midwest dataset

stat_ functions behind the scenes

stat_ functions behind the scenes

Compare two distributions

geom_density()

geom_density()

geom_density()

Avoid counting up,when necessary

Sometimes no counting is needed

geom_bar() wants to count up

geom_bar() stacks by default

A quick theme() adjustment

For convenience, use geom_col()

Using geom_col() for thresholds

Using geom_col() for thresholds

Using geom_col() for thresholds

`ggplot` implements a grammar of graphics

Grouped data and the `group` aesthetic

The `midwest` dataset

`stat_` functions behind the scenes

`stat_` functions behind the scenes

Avoid counting up,
when necessary

`geom_bar()` wants to count up

`geom_bar()` stacks by default

A quick `theme()` adjustment

For convenience, use `geom_col()`

Using `geom_col()` for thresholds

Using `geom_col()` for thresholds