library(tidyverse) # Your friend and mine
library(gapminder) # Gapminder data
library(here) # Portable file paths
library(socviz) # Handy socviz functions
Soc 690S: Week 03b
Duke University
January 2025
ggplot
implements a grammar of graphicsThe grammar is a set of rules for how to produce graphics from data, by mapping data to or representing it by geometric objects (like points and lines) that have aesthetic attributes (like position, color, size, and shape), together with further rules for transforming data if needed, for adjusting scales and their guides, and for projecting results onto some coordinate system.
Like other rules of syntax, the grammar
limits what you can validly say
but it doesn’t automatically make
what you say
sensible or meaningful
group
aesthetic# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
facet_wrap(~ continent)
~
as “on” or “by”facet_wrap(vars(continent))
p <- ggplot(data = gapminder,
mapping = aes(x = year,
y = gdpPercap))
p_out <- p + geom_line(color="gray70",
mapping=aes(group = country)) +
geom_smooth(size = 1.1,
method = "loess",
se = FALSE) +
scale_y_log10(labels=scales::label_dollar()) +
facet_wrap(~ continent, ncol = 5) +
labs(x = "Year",
y = "log GDP per capita",
title = "GDP per capita on Five Continents",
caption = "Data: Gapminder")
midwest
dataset# A tibble: 437 × 28
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAN… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
7 567 CALHOUN IL 0.017 5322 313. 5298 1 8
8 568 CARROLL IL 0.027 16805 622. 16519 111 30
9 569 CASS IL 0.024 13437 560. 13384 16 8
10 570 CHAMPA… IL 0.058 173025 2983. 146506 16559 331
# ℹ 427 more rows
# ℹ 19 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>
stat_
functions behind the scenes`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here the default stat_
function for this geom has to make a choice. It is letting us know we might want to override it.
stat_
functions behind the scenesbinwidth
dplyr
verb filter()
to subset rows of the data by some condition.position
argument out, or changing it to "dodge"
.ndensity
here is not in our data! It’s computed. Histogram and density geoms have default statistics, but you can ask them to do more. The after_stat
functions can do this work for us. fate sex n percent
1 perished male 1364 62.0
2 perished female 126 5.7
3 survived male 367 16.7
4 survived female 344 15.6
geom_bar()
wants to count upgeom_bar()
tries to count up data by category. (Really it’s the stat_count()
function that does this behind the scenes.) By saying stat="identity"
we explicitly tell it not to do that. This also allows us to use a y
mapping. Normally this would be the result of the counting up.geom_bar()
stacks by default"stack"
), side-by-side ("dodge"
), or taken as-is ("identity"
).theme()
adjustmentThe theme()
function controls the styling of parts of the plot that don’t belong to its “grammatical” structure. That is, that are not contributing to directly representing data.
geom_col()
geom_col()
assumes stat = "identity"
by default. It’s for when you want to directly plot a table of values, rather than create a bar chart by summing over one varible categorized by another.geom_col()
for thresholds# A tibble: 57 × 5
# Groups: year [57]
year other usa diff hi_lo
<int> <dbl> <dbl> <dbl> <chr>
1 1960 68.6 69.9 1.30 Below
2 1961 69.2 70.4 1.20 Below
3 1962 68.9 70.2 1.30 Below
4 1963 69.1 70 0.900 Below
5 1964 69.5 70.3 0.800 Below
6 1965 69.6 70.3 0.700 Below
7 1966 69.9 70.3 0.400 Below
8 1967 70.1 70.7 0.600 Below
9 1968 70.1 70.4 0.300 Below
10 1969 70.1 70.6 0.5 Below
# ℹ 47 more rows
diff
is difference in years with respect to the U.S.hi_lo
is a flag saying whether the OECD is above or below the U.S.p <- ggplot(data = oecd_sum,
mapping = aes(x = year,
y = diff,
fill = hi_lo))
p_out <- p + geom_col() +
geom_hline(yintercept = 0, linewidth = 1.2) +
guides(fill = "none") +
labs(x = NULL,
y = "Difference in Years",
title = "The U.S. Life Expectancy Gap",
subtitle = "Difference between U.S. and
OECD average life expectancies, 1960-2015",
caption = "Data: OECD.")
geom_hline()
doesn’t take any data argument. It just draws a horizontal line with a given y-intercept.x = NULL
means “Don’t label the x-axis (not even with the default value, the variable name).geom_col()
for thresholds