library(here) # manage file paths
library(socviz) # data and some useful functions
library(tidyverse) # your friend and mine
Soc 690S: Week 04
Duke University
February 2025
library
(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
<|
Draw graphs<|
Nicer data tables<|
Tidy your data<|
Get data into R<|
Fancy Iteration<|
Action verbs for tablesforcats
haven
lubridate
readxl
stringr
reprex
<|
Deal with factors<|
Import Stata, SPSS, etc<|
Dates, Durations, Times<|
Import from spreadsheets<|
Strings and Regular Expressions<|
Make reproducible examplesNot all of these are attached when we do library(tidyverse)
ggplot’s flow of action
Thinking in terms of layers
Thinking in terms of layers
Thinking in terms of layers
ggplot
Transform and summarize first.
Then send your clean tables to ggplot.
gss_sm
# A tibble: 2,867 × 32
year id ballot age childs sibs degree race sex region income16
<dbl> <dbl> <labelled> <dbl> <dbl> <labe> <fct> <fct> <fct> <fct> <fct>
1 2016 1 1 47 3 2 Bache… White Male New E… $170000…
2 2016 2 2 61 0 3 High … White Male New E… $50000 …
3 2016 3 3 72 2 3 Bache… White Male New E… $75000 …
4 2016 4 1 43 4 3 High … White Fema… New E… $170000…
5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000…
6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 …
7 2016 7 1 50 2 2 High … White Male New E… $170000…
8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 …
9 2016 9 1 45 3 5 High … Black Male Middl… $60000 …
10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
# agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
# bigregion <fct>, partners_rc <fct>, obama <dbl>
We often want summary tables or graphs of data like this.
bigregion | Protestant | Catholic | Jewish | None | Other | Total |
---|---|---|---|---|---|---|
Northeast | 32.4 | 33.3 | 5.5 | 23.0 | 5.7 | 100.0 |
Midwest | 47.1 | 24.9 | 0.4 | 22.8 | 4.8 | 100.0 |
South | 62.4 | 15.4 | 1.1 | 16.3 | 4.8 | 100.0 |
West | 37.7 | 24.6 | 1.6 | 28.5 | 7.6 | 100.0 |
bigregion | Protestant | Catholic | Jewish | None | Other |
---|---|---|---|---|---|
Northeast | 11.5 | 25.0 | 52.9 | 18.1 | 17.6 |
Midwest | 23.7 | 26.5 | 5.9 | 25.4 | 20.8 |
South | 47.4 | 24.7 | 21.6 | 27.5 | 31.4 |
West | 17.4 | 23.9 | 19.6 | 29.1 | 30.2 |
Total | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
bigregion | Protestant | Catholic | Jewish | None | Other |
---|---|---|---|---|---|
Northeast | 5.5 | 5.7 | 0.9 | 3.9 | 1.0 |
Midwest | 11.4 | 6.0 | 0.1 | 5.5 | 1.2 |
South | 22.8 | 5.6 | 0.4 | 6.0 | 1.8 |
West | 8.4 | 5.4 | 0.4 | 6.3 | 1.7 |
|>
, to chain together sequences of actions on our tables.dplyr
draws on the logic and language of database queries
Group the data at the level we want, such as “Religion by Region” or “Children by School”.
Subset either the rows or columns of or table—i.e. remove them before doing anything.
Mutate the data. That is, change something at the current level of grouping. Mutating adds new columns to the table, or changes the content of an existing column. It never changes the number of rows.
Summarize or aggregate the data. That is, make something new at a higher level of grouping. E.g., calculate means or counts by some grouping variable. This will generally result in a smaller, summary table. Usually this will have the same number of rows as there are groups being summarized.
group_by()
.filter()
rows and select()
columns.mutate()
.summarize()
.gss_sm
# A tibble: 2,867 × 32
year id ballot age childs sibs degree race sex region income16
<dbl> <dbl> <labelled> <dbl> <dbl> <labe> <fct> <fct> <fct> <fct> <fct>
1 2016 1 1 47 3 2 Bache… White Male New E… $170000…
2 2016 2 2 61 0 3 High … White Male New E… $50000 …
3 2016 3 3 72 2 3 Bache… White Male New E… $75000 …
4 2016 4 1 43 4 3 High … White Fema… New E… $170000…
5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000…
6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 …
7 2016 7 1 50 2 2 High … White Male New E… $170000…
8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 …
9 2016 9 1 45 3 5 High … Black Male Middl… $60000 …
10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
# agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
# bigregion <fct>, partners_rc <fct>, obama <dbl>
Notice how the tibble already tells us a lot.
# A tibble: 2,867 × 3
id bigregion religion
<dbl> <fct> <fct>
1 1 Northeast None
2 2 Northeast None
3 3 Northeast Catholic
4 4 Northeast Catholic
5 5 Northeast None
6 6 Northeast None
7 7 Northeast None
8 8 Northeast Catholic
9 9 Northeast Protestant
10 10 Northeast None
# ℹ 2,857 more rows
We’re just taking a look at the relevant columns here.
# A tibble: 2,867 × 32
# Groups: bigregion [4]
year id ballot age childs sibs degree race sex region income16
<dbl> <dbl> <labelled> <dbl> <dbl> <labe> <fct> <fct> <fct> <fct> <fct>
1 2016 1 1 47 3 2 Bache… White Male New E… $170000…
2 2016 2 2 61 0 3 High … White Male New E… $50000 …
3 2016 3 3 72 2 3 Bache… White Male New E… $75000 …
4 2016 4 1 43 4 3 High … White Fema… New E… $170000…
5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000…
6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 …
7 2016 7 1 50 2 2 High … White Male New E… $170000…
8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 …
9 2016 9 1 45 3 5 High … Black Male Middl… $60000 …
10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
# agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
# bigregion <fct>, partners_rc <fct>, obama <dbl>
Grouping just changes the logical structure of the tibble.
# A tibble: 2,867 × 32
year id ballot age childs sibs degree race sex region income16
<dbl> <dbl> <labelled> <dbl> <dbl> <labe> <fct> <fct> <fct> <fct> <fct>
1 2016 1 1 47 3 2 Bache… White Male New E… $170000…
2 2016 2 2 61 0 3 High … White Male New E… $50000 …
3 2016 3 3 72 2 3 Bache… White Male New E… $75000 …
4 2016 4 1 43 4 3 High … White Fema… New E… $170000…
5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000…
6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 …
7 2016 7 1 50 2 2 High … White Male New E… $170000…
8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 …
9 2016 9 1 45 3 5 High … Black Male Middl… $60000 …
10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
# agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
# bigregion <fct>, partners_rc <fct>, obama <dbl>
# A tibble: 2,867 × 32
# Groups: bigregion [4]
year id ballot age childs sibs degree race sex region income16
<dbl> <dbl> <labelled> <dbl> <dbl> <labe> <fct> <fct> <fct> <fct> <fct>
1 2016 1 1 47 3 2 Bache… White Male New E… $170000…
2 2016 2 2 61 0 3 High … White Male New E… $50000 …
3 2016 3 3 72 2 3 Bache… White Male New E… $75000 …
4 2016 4 1 43 4 3 High … White Fema… New E… $170000…
5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000…
6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 …
7 2016 7 1 50 2 2 High … White Male New E… $170000…
8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 …
9 2016 9 1 45 3 5 High … Black Male Middl… $60000 …
10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
# agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
# bigregion <fct>, partners_rc <fct>, obama <dbl>
# A tibble: 4 × 2
bigregion total
<fct> <int>
1 Northeast 488
2 Midwest 695
3 South 1052
4 West 632
n()
counts up the rows within each group.gss_sm
table is untouched# A tibble: 2,867 × 32
year id ballot age childs sibs degree race sex region income16
<dbl> <dbl> <labelled> <dbl> <dbl> <labe> <fct> <fct> <fct> <fct> <fct>
1 2016 1 1 47 3 2 Bache… White Male New E… $170000…
2 2016 2 2 61 0 3 High … White Male New E… $50000 …
3 2016 3 3 72 2 3 Bache… White Male New E… $75000 …
4 2016 4 1 43 4 3 High … White Fema… New E… $170000…
5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000…
6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 …
7 2016 7 1 50 2 2 High … White Male New E… $170000…
8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 …
9 2016 9 1 45 3 5 High … Black Male Middl… $60000 …
10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
# agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
# bigregion <fct>, partners_rc <fct>, obama <dbl>
# A tibble: 2,867 × 32
# Groups: bigregion, religion [24]
year id ballot age childs sibs degree race sex region income16
<dbl> <dbl> <labelled> <dbl> <dbl> <labe> <fct> <fct> <fct> <fct> <fct>
1 2016 1 1 47 3 2 Bache… White Male New E… $170000…
2 2016 2 2 61 0 3 High … White Male New E… $50000 …
3 2016 3 3 72 2 3 Bache… White Male New E… $75000 …
4 2016 4 1 43 4 3 High … White Fema… New E… $170000…
5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000…
6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 …
7 2016 7 1 50 2 2 High … White Male New E… $170000…
8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 …
9 2016 9 1 45 3 5 High … Black Male Middl… $60000 …
10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
# agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
# bigregion <fct>, partners_rc <fct>, obama <dbl>
# A tibble: 24 × 3
# Groups: bigregion [4]
bigregion religion total
<fct> <fct> <int>
1 Northeast Protestant 158
2 Northeast Catholic 162
3 Northeast Jewish 27
4 Northeast None 112
5 Northeast Other 28
6 Northeast <NA> 1
7 Midwest Protestant 325
8 Midwest Catholic 172
9 Midwest Jewish 3
10 Midwest None 157
# ℹ 14 more rows
n()
counts up the rows within the innermost (i.e. the rightmost) group.# A tibble: 2,867 × 32
year id ballot age childs sibs degree race sex region income16
<dbl> <dbl> <labelled> <dbl> <dbl> <labe> <fct> <fct> <fct> <fct> <fct>
1 2016 1 1 47 3 2 Bache… White Male New E… $170000…
2 2016 2 2 61 0 3 High … White Male New E… $50000 …
3 2016 3 3 72 2 3 Bache… White Male New E… $75000 …
4 2016 4 1 43 4 3 High … White Fema… New E… $170000…
5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000…
6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 …
7 2016 7 1 50 2 2 High … White Male New E… $170000…
8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 …
9 2016 9 1 45 3 5 High … Black Male Middl… $60000 …
10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
# agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
# bigregion <fct>, partners_rc <fct>, obama <dbl>
# A tibble: 2,867 × 32
# Groups: bigregion, religion [24]
year id ballot age childs sibs degree race sex region income16
<dbl> <dbl> <labelled> <dbl> <dbl> <labe> <fct> <fct> <fct> <fct> <fct>
1 2016 1 1 47 3 2 Bache… White Male New E… $170000…
2 2016 2 2 61 0 3 High … White Male New E… $50000 …
3 2016 3 3 72 2 3 Bache… White Male New E… $75000 …
4 2016 4 1 43 4 3 High … White Fema… New E… $170000…
5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000…
6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 …
7 2016 7 1 50 2 2 High … White Male New E… $170000…
8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 …
9 2016 9 1 45 3 5 High … Black Male Middl… $60000 …
10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
# agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
# bigregion <fct>, partners_rc <fct>, obama <dbl>
# A tibble: 24 × 3
# Groups: bigregion [4]
bigregion religion total
<fct> <fct> <int>
1 Northeast Protestant 158
2 Northeast Catholic 162
3 Northeast Jewish 27
4 Northeast None 112
5 Northeast Other 28
6 Northeast <NA> 1
7 Midwest Protestant 325
8 Midwest Catholic 172
9 Midwest Jewish 3
10 Midwest None 157
# ℹ 14 more rows
# A tibble: 24 × 5
# Groups: bigregion [4]
bigregion religion total freq pct
<fct> <fct> <int> <dbl> <dbl>
1 Northeast Protestant 158 0.324 32.4
2 Northeast Catholic 162 0.332 33.2
3 Northeast Jewish 27 0.0553 5.5
4 Northeast None 112 0.230 23
5 Northeast Other 28 0.0574 5.7
6 Northeast <NA> 1 0.00205 0.2
7 Midwest Protestant 325 0.468 46.8
8 Midwest Catholic 172 0.247 24.7
9 Midwest Jewish 3 0.00432 0.4
10 Midwest None 157 0.226 22.6
# ℹ 14 more rows
n()
counts up the rowsgss_sm |>
group_by(bigregion, religion) |>
summarize(total = n()) |>
mutate(freq = total / sum(total),
pct = round((freq*100), 1))
# A tibble: 24 × 5
# Groups: bigregion [4]
bigregion religion total freq pct
<fct> <fct> <int> <dbl> <dbl>
1 Northeast Protestant 158 0.324 32.4
2 Northeast Catholic 162 0.332 33.2
3 Northeast Jewish 27 0.0553 5.5
4 Northeast None 112 0.230 23
5 Northeast Other 28 0.0574 5.7
6 Northeast <NA> 1 0.00205 0.2
7 Midwest Protestant 325 0.468 46.8
8 Midwest Catholic 172 0.247 24.7
9 Midwest Jewish 3 0.00432 0.4
10 Midwest None 157 0.226 22.6
# ℹ 14 more rows
gss_sm |>
group_by(bigregion, religion) |>
summarize(total = n()) |>
mutate(freq = total / sum(total),
pct = round((freq*100), 1))
# A tibble: 24 × 5
# Groups: bigregion [4]
bigregion religion total freq pct
<fct> <fct> <int> <dbl> <dbl>
1 Northeast Protestant 158 0.324 32.4
2 Northeast Catholic 162 0.332 33.2
3 Northeast Jewish 27 0.0553 5.5
4 Northeast None 112 0.230 23
5 Northeast Other 28 0.0574 5.7
6 Northeast <NA> 1 0.00205 0.2
7 Midwest Protestant 325 0.468 46.8
8 Midwest Catholic 172 0.247 24.7
9 Midwest Jewish 3 0.00432 0.4
10 Midwest None 157 0.226 22.6
# ℹ 14 more rows
mutate()
is quite clever. See how we can immediately use freq
, even though we are creating it in the same mutate()
expression.
gss_sm |>
group_by(bigregion, religion) |>
summarize(total = n()) |>
mutate(freq = total / sum(total),
pct = round((freq*100), 1))
# A tibble: 24 × 5
# Groups: bigregion [4]
bigregion religion total freq pct
<fct> <fct> <int> <dbl> <dbl>
1 Northeast Protestant 158 0.324 32.4
2 Northeast Catholic 162 0.332 33.2
3 Northeast Jewish 27 0.0553 5.5
4 Northeast None 112 0.230 23
5 Northeast Other 28 0.0574 5.7
6 Northeast <NA> 1 0.00205 0.2
7 Midwest Protestant 325 0.468 46.8
8 Midwest Catholic 172 0.247 24.7
9 Midwest Jewish 3 0.00432 0.4
10 Midwest None 157 0.226 22.6
# ℹ 14 more rows
We’re going to be doing this group_by()
… n()
step a lot. Some shorthand for it would be useful.
n()
# A tibble: 24 × 3
# Groups: bigregion [4]
bigregion religion n
<fct> <fct> <int>
1 Northeast Protestant 158
2 Northeast Catholic 162
3 Northeast Jewish 27
4 Northeast None 112
5 Northeast Other 28
6 Northeast <NA> 1
7 Midwest Protestant 325
8 Midwest Catholic 172
9 Midwest Jewish 3
10 Midwest None 157
# ℹ 14 more rows
tally()
# A tibble: 24 × 3
# Groups: bigregion [4]
bigregion religion n
<fct> <fct> <int>
1 Northeast Protestant 158
2 Northeast Catholic 162
3 Northeast Jewish 27
4 Northeast None 112
5 Northeast Other 28
6 Northeast <NA> 1
7 Midwest Protestant 325
8 Midwest Catholic 172
9 Midwest Jewish 3
10 Midwest None 157
# ℹ 14 more rows
count()
# A tibble: 24 × 3
bigregion religion n
<fct> <fct> <int>
1 Northeast Protestant 158
2 Northeast Catholic 162
3 Northeast Jewish 27
4 Northeast None 112
5 Northeast Other 28
6 Northeast <NA> 1
7 Midwest Protestant 325
8 Midwest Catholic 172
9 Midwest Jewish 3
10 Midwest None 157
# ℹ 14 more rows
religion | Northeast | Midwest | South | West |
---|---|---|---|---|
Protestant | 158 | 325 | 650 | 238 |
Catholic | 162 | 172 | 160 | 155 |
Jewish | 27 | 3 | 11 | 10 |
None | 112 | 157 | 170 | 180 |
Other | 28 | 33 | 50 | 48 |
NA | 1 | 5 | 11 | 1 |
pivot_wider()
and kable()
soon …rel_by_region <- gss_sm |>
count(bigregion, religion) |>
mutate(pct = round((n/sum(n))*100, 1))
rel_by_region
# A tibble: 24 × 4
bigregion religion n pct
<fct> <fct> <int> <dbl>
1 Northeast Protestant 158 5.5
2 Northeast Catholic 162 5.7
3 Northeast Jewish 27 0.9
4 Northeast None 112 3.9
5 Northeast Other 28 1
6 Northeast <NA> 1 0
7 Midwest Protestant 325 11.3
8 Midwest Catholic 172 6
9 Midwest Jewish 3 0.1
10 Midwest None 157 5.5
# ℹ 14 more rows
Hm, did I sum over right group?
rel_by_region <- gss_sm |>
count(bigregion, religion) |>
mutate(pct = round((n/sum(n))*100, 1))
rel_by_region
# A tibble: 24 × 4
bigregion religion n pct
<fct> <fct> <int> <dbl>
1 Northeast Protestant 158 5.5
2 Northeast Catholic 162 5.7
3 Northeast Jewish 27 0.9
4 Northeast None 112 3.9
5 Northeast Other 28 1
6 Northeast <NA> 1 0
7 Midwest Protestant 325 11.3
8 Midwest Catholic 172 6
9 Midwest Jewish 3 0.1
10 Midwest None 157 5.5
# ℹ 14 more rows
Hm, did I sum over right group?
count()
returns ungrouped results, so there are no groups carry forward to the mutate()
step.count()
, the pct
values here are the marginals for the whole table.count()
returns ungrouped results, so there are no groups carry forward to the mutate()
step.count()
, the pct
values here are the marginals for the whole table.# A tibble: 4 × 2
bigregion total
<fct> <dbl>
1 Northeast 100
2 Midwest 99.9
3 South 100
4 West 100.
round()
after summing originally.Use dplyr
to make summary tables.
Then send your clean tables to ggplot
.
rel_by_region <- gss_sm |>
group_by(bigregion, religion) |>
tally() |>
mutate(pct = round((n/sum(n))*100, 1)) |>
drop_na()
head(rel_by_region)
# A tibble: 6 × 4
# Groups: bigregion [2]
bigregion religion n pct
<fct> <fct> <int> <dbl>
1 Northeast Protestant 158 32.4
2 Northeast Catholic 162 33.2
3 Northeast Jewish 27 5.5
4 Northeast None 112 23
5 Northeast Other 28 5.7
6 Midwest Protestant 325 46.8
What we’ve
built-up
Core grammar
All basic steps
dplyr
and PipeliningThe elements of filtering and summarizing
gss_sm |>
group_by(bigregion, religion) |>
tally() |>
mutate(freq = n / sum(n),
pct = round((freq*100), 1))
# A tibble: 24 × 5
# Groups: bigregion [4]
bigregion religion n freq pct
<fct> <fct> <int> <dbl> <dbl>
1 Northeast Protestant 158 0.324 32.4
2 Northeast Catholic 162 0.332 33.2
3 Northeast Jewish 27 0.0553 5.5
4 Northeast None 112 0.230 23
5 Northeast Other 28 0.0574 5.7
6 Northeast <NA> 1 0.00205 0.2
7 Midwest Protestant 325 0.468 46.8
8 Midwest Catholic 172 0.247 24.7
9 Midwest Jewish 3 0.00432 0.4
10 Midwest None 157 0.226 22.6
# ℹ 14 more rows
organdata
is in the socviz
package# A tibble: 238 × 21
country year donors pop pop_dens gdp gdp_lag health health_lag
<chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
1 Australia NA NA 17065 0.220 16774 16591 1300 1224
2 Australia 1991-01-01 12.1 17284 0.223 17171 16774 1379 1300
3 Australia 1992-01-01 12.4 17495 0.226 17914 17171 1455 1379
4 Australia 1993-01-01 12.5 17667 0.228 18883 17914 1540 1455
5 Australia 1994-01-01 10.2 17855 0.231 19849 18883 1626 1540
6 Australia 1995-01-01 10.2 18072 0.233 21079 19849 1737 1626
7 Australia 1996-01-01 10.6 18311 0.237 21923 21079 1846 1737
8 Australia 1997-01-01 10.3 18518 0.239 22961 21923 1948 1846
9 Australia 1998-01-01 10.5 18711 0.242 24148 22961 2077 1948
10 Australia 1999-01-01 8.67 18926 0.244 25445 24148 2231 2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
# assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
# consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>
dplyr
select()
& filter()
# A tibble: 238 × 21
country year donors pop pop_dens gdp gdp_lag health health_lag
<chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
1 Australia NA NA 17065 0.220 16774 16591 1300 1224
2 Australia 1991-01-01 12.1 17284 0.223 17171 16774 1379 1300
3 Australia 1992-01-01 12.4 17495 0.226 17914 17171 1455 1379
4 Australia 1993-01-01 12.5 17667 0.228 18883 17914 1540 1455
5 Australia 1994-01-01 10.2 17855 0.231 19849 18883 1626 1540
6 Australia 1995-01-01 10.2 18072 0.233 21079 19849 1737 1626
7 Australia 1996-01-01 10.6 18311 0.237 21923 21079 1846 1737
8 Australia 1997-01-01 10.3 18518 0.239 22961 21923 1948 1846
9 Australia 1998-01-01 10.5 18711 0.242 24148 22961 2077 1948
10 Australia 1999-01-01 8.67 18926 0.244 25445 24148 2231 2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
# assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
# consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>
select()
& filter()
# A tibble: 30 × 21
country year donors pop pop_dens gdp gdp_lag health health_lag
<chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
1 Canada 2000-01-01 15.3 30770 0.309 28472 26658 2541 2400
2 Denmark 1992-01-01 16.1 5171 12.0 19644 19126 1660 1603
3 Ireland 1991-01-01 19 3534 5.03 13495 12917 884 791
4 Ireland 1992-01-01 19.5 3558 5.06 14241 13495 1005 884
5 Ireland 1993-01-01 17.1 3576 5.09 14927 14241 1041 1005
6 Ireland 1994-01-01 20.3 3590 5.11 15990 14927 1119 1041
7 Ireland 1995-01-01 24.6 3609 5.14 17789 15990 1208 1119
8 Ireland 1996-01-01 16.8 3636 5.17 19245 17789 1269 1208
9 Ireland 1997-01-01 20.9 3673 5.23 22017 19245 1417 1269
10 Ireland 1998-01-01 23.8 3715 5.29 23995 22017 1487 1417
# ℹ 20 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
# assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
# consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>
select()
& filter()
# A tibble: 238 × 8
country year pop gdp gdp_lag cerebvas assault external
<chr> <date> <int> <int> <int> <int> <int> <int>
1 Australia NA 17065 16774 16591 682 21 444
2 Australia 1991-01-01 17284 17171 16774 647 19 425
3 Australia 1992-01-01 17495 17914 17171 630 17 406
4 Australia 1993-01-01 17667 18883 17914 611 18 376
5 Australia 1994-01-01 17855 19849 18883 631 17 387
6 Australia 1995-01-01 18072 21079 19849 592 16 371
7 Australia 1996-01-01 18311 21923 21079 576 17 395
8 Australia 1997-01-01 18518 22961 21923 525 17 385
9 Australia 1998-01-01 18711 24148 22961 516 16 410
10 Australia 1999-01-01 18926 25445 24148 493 15 409
# ℹ 228 more rows
Use where()
to test columns.
select()
& filter()
When telling where()
to use is.integer()
to test each column, we don’t put parentheses at the end of its name. If we did, R would try to evaluate is.integer()
right then, and fail:
> organdata |>
+ select(country, year, where(is.integer()))
Error: 0 arguments passed to 'is.integer' which requires 1
Run `rlang::last_error()` to see where the error occurred.
This is true in similar situations elsewhere as well.
select()
& filter()
# A tibble: 238 × 8
country year world opt consent_law consent_practice consistent ccode
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr>
1 Austral… NA Libe… In Informed Informed Yes Oz
2 Austral… 1991-01-01 Libe… In Informed Informed Yes Oz
3 Austral… 1992-01-01 Libe… In Informed Informed Yes Oz
4 Austral… 1993-01-01 Libe… In Informed Informed Yes Oz
5 Austral… 1994-01-01 Libe… In Informed Informed Yes Oz
6 Austral… 1995-01-01 Libe… In Informed Informed Yes Oz
7 Austral… 1996-01-01 Libe… In Informed Informed Yes Oz
8 Austral… 1997-01-01 Libe… In Informed Informed Yes Oz
9 Austral… 1998-01-01 Libe… In Informed Informed Yes Oz
10 Austral… 1999-01-01 Libe… In Informed Informed Yes Oz
# ℹ 228 more rows
We have functions like e.g. is.character()
, is.numeric()
, is.logical()
, is.factor()
, etc. All return either TRUE
or FALSE
.
select()
& filter()
Sometimes we don’t pass a function, but do want to use the result of one:
# A tibble: 238 × 4
country year gdp gdp_lag
<chr> <date> <int> <int>
1 Australia NA 16774 16591
2 Australia 1991-01-01 17171 16774
3 Australia 1992-01-01 17914 17171
4 Australia 1993-01-01 18883 17914
5 Australia 1994-01-01 19849 18883
6 Australia 1995-01-01 21079 19849
7 Australia 1996-01-01 21923 21079
8 Australia 1997-01-01 22961 21923
9 Australia 1998-01-01 24148 22961
10 Australia 1999-01-01 25445 24148
# ℹ 228 more rows
We have starts_with()
, ends_with()
, contains()
, matches()
, and num_range()
. Collectively these are “tidy selectors”.
select()
& filter()
# A tibble: 28 × 21
country year donors pop pop_dens gdp gdp_lag health health_lag
<chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
1 Australia NA NA 17065 0.220 16774 16591 1300 1224
2 Australia 1991-01-01 12.1 17284 0.223 17171 16774 1379 1300
3 Australia 1992-01-01 12.4 17495 0.226 17914 17171 1455 1379
4 Australia 1993-01-01 12.5 17667 0.228 18883 17914 1540 1455
5 Australia 1994-01-01 10.2 17855 0.231 19849 18883 1626 1540
6 Australia 1995-01-01 10.2 18072 0.233 21079 19849 1737 1626
7 Australia 1996-01-01 10.6 18311 0.237 21923 21079 1846 1737
8 Australia 1997-01-01 10.3 18518 0.239 22961 21923 1948 1846
9 Australia 1998-01-01 10.5 18711 0.242 24148 22961 2077 1948
10 Australia 1999-01-01 8.67 18926 0.244 25445 24148 2231 2077
# ℹ 18 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
# assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
# consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>
This could get cumbersome fast.
%in%
for multiple selectionsmy_countries <- c("Australia", "Canada", "United States", "Ireland")
organdata |>
filter(country %in% my_countries)
# A tibble: 56 × 21
country year donors pop pop_dens gdp gdp_lag health health_lag
<chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
1 Australia NA NA 17065 0.220 16774 16591 1300 1224
2 Australia 1991-01-01 12.1 17284 0.223 17171 16774 1379 1300
3 Australia 1992-01-01 12.4 17495 0.226 17914 17171 1455 1379
4 Australia 1993-01-01 12.5 17667 0.228 18883 17914 1540 1455
5 Australia 1994-01-01 10.2 17855 0.231 19849 18883 1626 1540
6 Australia 1995-01-01 10.2 18072 0.233 21079 19849 1737 1626
7 Australia 1996-01-01 10.6 18311 0.237 21923 21079 1846 1737
8 Australia 1997-01-01 10.3 18518 0.239 22961 21923 1948 1846
9 Australia 1998-01-01 10.5 18711 0.242 24148 22961 2077 1948
10 Australia 1999-01-01 8.67 18926 0.244 25445 24148 2231 2077
# ℹ 46 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
# assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
# consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>
%in%
my_countries <- c("Australia", "Canada", "United States", "Ireland")
organdata |>
filter(!(country %in% my_countries))
# A tibble: 182 × 21
country year donors pop pop_dens gdp gdp_lag health health_lag
<chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
1 Austria NA NA 7678 9.16 18914 17425 1344 1255
2 Austria 1991-01-01 27.6 7755 9.25 19860 18914 1419 1344
3 Austria 1992-01-01 23.1 7841 9.35 20601 19860 1551 1419
4 Austria 1993-01-01 26.2 7906 9.43 21119 20601 1674 1551
5 Austria 1994-01-01 21.4 7936 9.46 21940 21119 1739 1674
6 Austria 1995-01-01 21.5 7948 9.48 22817 21940 1865 1739
7 Austria 1996-01-01 24.7 7959 9.49 23798 22817 1986 1865
8 Austria 1997-01-01 19.5 7968 9.50 24364 23798 1848 1986
9 Austria 1998-01-01 20.7 7977 9.51 25423 24364 1953 1848
10 Austria 1999-01-01 25.9 7992 9.53 26513 25423 2069 1953
# ℹ 172 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
# assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
# consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>
Also a bit awkward. There’s no built-in “Not in” operator.
# A tibble: 182 × 21
country year donors pop pop_dens gdp gdp_lag health health_lag
<chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
1 Austria NA NA 7678 9.16 18914 17425 1344 1255
2 Austria 1991-01-01 27.6 7755 9.25 19860 18914 1419 1344
3 Austria 1992-01-01 23.1 7841 9.35 20601 19860 1551 1419
4 Austria 1993-01-01 26.2 7906 9.43 21119 20601 1674 1551
5 Austria 1994-01-01 21.4 7936 9.46 21940 21119 1739 1674
6 Austria 1995-01-01 21.5 7948 9.48 22817 21940 1865 1739
7 Austria 1996-01-01 24.7 7959 9.49 23798 22817 1986 1865
8 Austria 1997-01-01 19.5 7968 9.50 24364 23798 1848 1986
9 Austria 1998-01-01 20.7 7977 9.51 25423 24364 1953 1848
10 Austria 1999-01-01 25.9 7992 9.53 26513 25423 2069 1953
# ℹ 172 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
# assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
# consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>
across()
Earlier we saw this:
gss_sm |>
group_by(race, sex, degree) |>
summarize(n = n(),
mean_age = mean(age, na.rm = TRUE),
mean_kids = mean(childs, na.rm = TRUE))
# A tibble: 34 × 6
# Groups: race, sex [6]
race sex degree n mean_age mean_kids
<fct> <fct> <fct> <int> <dbl> <dbl>
1 White Male Lt High School 96 52.9 2.45
2 White Male High School 470 48.8 1.61
3 White Male Junior College 65 47.1 1.54
4 White Male Bachelor 208 48.6 1.35
5 White Male Graduate 112 56.0 1.71
6 White Female Lt High School 101 55.4 2.81
7 White Female High School 587 51.9 1.98
8 White Female Junior College 101 48.2 1.91
9 White Female Bachelor 218 49.2 1.44
10 White Female Graduate 138 53.6 1.38
# ℹ 24 more rows
Similarly for organdata
we might want to do:
organdata |>
group_by(consent_law, country) |>
summarize(donors_mean = mean(donors, na.rm = TRUE),
donors_sd = sd(donors, na.rm = TRUE),
gdp_mean = mean(gdp, na.rm = TRUE),
health_mean = mean(health, na.rm = TRUE),
roads_mean = mean(roads, na.rm = TRUE))
# A tibble: 17 × 7
# Groups: consent_law [2]
consent_law country donors_mean donors_sd gdp_mean health_mean roads_mean
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Informed Australia 10.6 1.14 22179. 1958. 105.
2 Informed Canada 14.0 0.751 23711. 2272. 109.
3 Informed Denmark 13.1 1.47 23722. 2054. 102.
4 Informed Germany 13.0 0.611 22163. 2349. 113.
5 Informed Ireland 19.8 2.48 20824. 1480. 118.
6 Informed Netherlands 13.7 1.55 23013. 1993. 76.1
7 Informed United Kin… 13.5 0.775 21359. 1561. 67.9
8 Informed United Sta… 20.0 1.33 29212. 3988. 155.
9 Presumed Austria 23.5 2.42 23876. 1875. 150.
10 Presumed Belgium 21.9 1.94 22500. 1958. 155.
11 Presumed Finland 18.4 1.53 21019. 1615. 93.6
12 Presumed France 16.8 1.60 22603. 2160. 156.
13 Presumed Italy 11.1 4.28 21554. 1757 122.
14 Presumed Norway 15.4 1.11 26448. 2217. 70.0
15 Presumed Spain 28.1 4.96 16933 1289. 161.
16 Presumed Sweden 13.1 1.75 22415. 1951. 72.3
17 Presumed Switzerland 14.2 1.71 27233 2776. 96.4
This works, but it’s really tedious. Also error-prone.
across()
Instead, use across()
to apply a function to more than one column.
my_vars <- c("gdp", "donors", "roads")
## nested parens again, but it's worth it
organdata |>
group_by(consent_law, country) |>
summarize(across(all_of(my_vars),
list(avg = \(x) mean(x, na.rm = TRUE))
)
)
# A tibble: 17 × 5
# Groups: consent_law [2]
consent_law country gdp_avg donors_avg roads_avg
<chr> <chr> <dbl> <dbl> <dbl>
1 Informed Australia 22179. 10.6 105.
2 Informed Canada 23711. 14.0 109.
3 Informed Denmark 23722. 13.1 102.
4 Informed Germany 22163. 13.0 113.
5 Informed Ireland 20824. 19.8 118.
6 Informed Netherlands 23013. 13.7 76.1
7 Informed United Kingdom 21359. 13.5 67.9
8 Informed United States 29212. 20.0 155.
9 Presumed Austria 23876. 23.5 150.
10 Presumed Belgium 22500. 21.9 155.
11 Presumed Finland 21019. 18.4 93.6
12 Presumed France 22603. 16.8 156.
13 Presumed Italy 21554. 11.1 122.
14 Presumed Norway 26448. 15.4 70.0
15 Presumed Spain 16933 28.1 161.
16 Presumed Sweden 22415. 13.1 72.3
17 Presumed Switzerland 27233 14.2 96.4
# A tibble: 238 × 21
country year donors pop pop_dens gdp gdp_lag health health_lag
<chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
1 Australia NA NA 17065 0.220 16774 16591 1300 1224
2 Australia 1991-01-01 12.1 17284 0.223 17171 16774 1379 1300
3 Australia 1992-01-01 12.4 17495 0.226 17914 17171 1455 1379
4 Australia 1993-01-01 12.5 17667 0.228 18883 17914 1540 1455
5 Australia 1994-01-01 10.2 17855 0.231 19849 18883 1626 1540
6 Australia 1995-01-01 10.2 18072 0.233 21079 19849 1737 1626
7 Australia 1996-01-01 10.6 18311 0.237 21923 21079 1846 1737
8 Australia 1997-01-01 10.3 18518 0.239 22961 21923 1948 1846
9 Australia 1998-01-01 10.5 18711 0.242 24148 22961 2077 1948
10 Australia 1999-01-01 8.67 18926 0.244 25445 24148 2231 2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
# assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
# consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>
# A tibble: 238 × 21
# Groups: consent_law, country [17]
country year donors pop pop_dens gdp gdp_lag health health_lag
<chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
1 Australia NA NA 17065 0.220 16774 16591 1300 1224
2 Australia 1991-01-01 12.1 17284 0.223 17171 16774 1379 1300
3 Australia 1992-01-01 12.4 17495 0.226 17914 17171 1455 1379
4 Australia 1993-01-01 12.5 17667 0.228 18883 17914 1540 1455
5 Australia 1994-01-01 10.2 17855 0.231 19849 18883 1626 1540
6 Australia 1995-01-01 10.2 18072 0.233 21079 19849 1737 1626
7 Australia 1996-01-01 10.6 18311 0.237 21923 21079 1846 1737
8 Australia 1997-01-01 10.3 18518 0.239 22961 21923 1948 1846
9 Australia 1998-01-01 10.5 18711 0.242 24148 22961 2077 1948
10 Australia 1999-01-01 8.67 18926 0.244 25445 24148 2231 2077
# ℹ 228 more rows
# ℹ 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
# assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
# consent_law <chr>, consent_practice <chr>, consistent <chr>, ccode <chr>
# A tibble: 17 × 5
# Groups: consent_law [2]
consent_law country gdp_avg donors_avg roads_avg
<chr> <chr> <dbl> <dbl> <dbl>
1 Informed Australia 22179. 10.6 105.
2 Informed Canada 23711. 14.0 109.
3 Informed Denmark 23722. 13.1 102.
4 Informed Germany 22163. 13.0 113.
5 Informed Ireland 20824. 19.8 118.
6 Informed Netherlands 23013. 13.7 76.1
7 Informed United Kingdom 21359. 13.5 67.9
8 Informed United States 29212. 20.0 155.
9 Presumed Austria 23876. 23.5 150.
10 Presumed Belgium 22500. 21.9 155.
11 Presumed Finland 21019. 18.4 93.6
12 Presumed France 22603. 16.8 156.
13 Presumed Italy 21554. 11.1 122.
14 Presumed Norway 26448. 15.4 70.0
15 Presumed Spain 16933 28.1 161.
16 Presumed Sweden 22415. 13.1 72.3
17 Presumed Switzerland 27233 14.2 96.4
my_vars
are selected by across()
all_of()
or any_of()
to be explicitlist()
of the form result = function
gives the new columns that will be calculated.\(x)
, is an anonymous functionmy_vars <- c("gdp", "donors", "roads")
organdata |>
group_by(consent_law, country) |>
summarize(across(all_of(my_vars),
list(avg = \(x) mean(x, na.rm = TRUE),
sdev = \(x) sd(x, na.rm = TRUE),
md = \(x) median(x, na.rm = TRUE))
)
)
# A tibble: 17 × 11
# Groups: consent_law [2]
consent_law country gdp_avg gdp_sdev gdp_md donors_avg donors_sdev donors_md
<chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 Informed Austral… 22179. 3959. 21923 10.6 1.14 10.4
2 Informed Canada 23711. 3966. 22764 14.0 0.751 14.0
3 Informed Denmark 23722. 3896. 23548 13.1 1.47 12.9
4 Informed Germany 22163. 2501. 22164 13.0 0.611 13
5 Informed Ireland 20824. 6670. 19245 19.8 2.48 19.2
6 Informed Netherl… 23013. 3770. 22541 13.7 1.55 13.8
7 Informed United … 21359. 3929. 20839 13.5 0.775 13.5
8 Informed United … 29212. 4571. 28772 20.0 1.33 20.1
9 Presumed Austria 23876. 3343. 23798 23.5 2.42 23.8
10 Presumed Belgium 22500. 3171. 22152 21.9 1.94 21.4
11 Presumed Finland 21019. 3668. 19842 18.4 1.53 19.4
12 Presumed France 22603. 3260. 21990 16.8 1.60 16.6
13 Presumed Italy 21554. 2781. 21396 11.1 4.28 11.3
14 Presumed Norway 26448. 6492. 26218 15.4 1.11 15.4
15 Presumed Spain 16933 2888. 16416 28.1 4.96 28
16 Presumed Sweden 22415. 3213. 22029 13.1 1.75 12.7
17 Presumed Switzer… 27233 2153. 26304 14.2 1.71 14.4
# ℹ 3 more variables: roads_avg <dbl>, roads_sdev <dbl>, roads_md <dbl>
my_vars <- c("gdp", "donors", "roads")
organdata |>
group_by(consent_law, country) |>
summarize(across(all_of(my_vars),
list(mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE),
median = \(x) median(x, na.rm = TRUE))
)
)
# A tibble: 17 × 11
# Groups: consent_law [2]
consent_law country gdp_mean gdp_sd gdp_median donors_mean donors_sd
<chr> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 Informed Australia 22179. 3959. 21923 10.6 1.14
2 Informed Canada 23711. 3966. 22764 14.0 0.751
3 Informed Denmark 23722. 3896. 23548 13.1 1.47
4 Informed Germany 22163. 2501. 22164 13.0 0.611
5 Informed Ireland 20824. 6670. 19245 19.8 2.48
6 Informed Netherlands 23013. 3770. 22541 13.7 1.55
7 Informed United Kingdom 21359. 3929. 20839 13.5 0.775
8 Informed United States 29212. 4571. 28772 20.0 1.33
9 Presumed Austria 23876. 3343. 23798 23.5 2.42
10 Presumed Belgium 22500. 3171. 22152 21.9 1.94
11 Presumed Finland 21019. 3668. 19842 18.4 1.53
12 Presumed France 22603. 3260. 21990 16.8 1.60
13 Presumed Italy 21554. 2781. 21396 11.1 4.28
14 Presumed Norway 26448. 6492. 26218 15.4 1.11
15 Presumed Spain 16933 2888. 16416 28.1 4.96
16 Presumed Sweden 22415. 3213. 22029 13.1 1.75
17 Presumed Switzerland 27233 2153. 26304 14.2 1.71
# ℹ 4 more variables: donors_median <dbl>, roads_mean <dbl>, roads_sd <dbl>,
# roads_median <dbl>
across(where())
organdata |>
group_by(consent_law, country) |>
summarize(across(where(is.numeric),
list(mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE),
median = \(x) median(x, na.rm = TRUE))
)
) |>
print(n = 3) # just to save slide space
# A tibble: 17 × 41
# Groups: consent_law [2]
consent_law country donors_mean donors_sd donors_median pop_mean pop_sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Informed Australia 10.6 1.14 10.4 18318. 831.
2 Informed Canada 14.0 0.751 14.0 29608. 1193.
3 Informed Denmark 13.1 1.47 12.9 5257. 80.6
# ℹ 14 more rows
# ℹ 34 more variables: pop_median <int>, pop_dens_mean <dbl>,
# pop_dens_sd <dbl>, pop_dens_median <dbl>, gdp_mean <dbl>, gdp_sd <dbl>,
# gdp_median <int>, gdp_lag_mean <dbl>, gdp_lag_sd <dbl>,
# gdp_lag_median <dbl>, health_mean <dbl>, health_sd <dbl>,
# health_median <dbl>, health_lag_mean <dbl>, health_lag_sd <dbl>,
# health_lag_median <dbl>, pubhealth_mean <dbl>, pubhealth_sd <dbl>, …
.names
organdata |>
group_by(consent_law, country) |>
summarize(across(where(is.numeric),
list(mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE),
median = \(x) median(x, na.rm = TRUE)
),
.names = "{fn}_{col}"
)
) |>
print(n = 3)
# A tibble: 17 × 41
# Groups: consent_law [2]
consent_law country mean_donors sd_donors median_donors mean_pop sd_pop
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Informed Australia 10.6 1.14 10.4 18318. 831.
2 Informed Canada 14.0 0.751 14.0 29608. 1193.
3 Informed Denmark 13.1 1.47 12.9 5257. 80.6
# ℹ 14 more rows
# ℹ 34 more variables: median_pop <int>, mean_pop_dens <dbl>,
# sd_pop_dens <dbl>, median_pop_dens <dbl>, mean_gdp <dbl>, sd_gdp <dbl>,
# median_gdp <int>, mean_gdp_lag <dbl>, sd_gdp_lag <dbl>,
# median_gdp_lag <dbl>, mean_health <dbl>, sd_health <dbl>,
# median_health <dbl>, mean_health_lag <dbl>, sd_health_lag <dbl>,
# median_health_lag <dbl>, mean_pubhealth <dbl>, sd_pubhealth <dbl>, …
.names
In tidyverse functions, arguments that begin with a “.
” generally have it in order to avoid confusion with existing items, or are “pronouns” referring to e.g. “the name of the thing we’re currently talking about as we evaluate this function”.
mutate()
, too# A tibble: 238 × 7
country world opt consent_law consent_practice consistent ccode
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 AUSTRALIA LIBERAL IN INFORMED INFORMED YES OZ
2 AUSTRALIA LIBERAL IN INFORMED INFORMED YES OZ
3 AUSTRALIA LIBERAL IN INFORMED INFORMED YES OZ
4 AUSTRALIA LIBERAL IN INFORMED INFORMED YES OZ
5 AUSTRALIA LIBERAL IN INFORMED INFORMED YES OZ
6 AUSTRALIA LIBERAL IN INFORMED INFORMED YES OZ
7 AUSTRALIA LIBERAL IN INFORMED INFORMED YES OZ
8 AUSTRALIA LIBERAL IN INFORMED INFORMED YES OZ
9 AUSTRALIA LIBERAL IN INFORMED INFORMED YES OZ
10 AUSTRALIA LIBERAL IN INFORMED INFORMED YES OZ
# ℹ 228 more rows
by_country <- organdata |>
group_by(consent_law, country) |>
summarize(across(where(is.numeric),
list(mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE),
median = \(x) median(x, na.rm = TRUE))
)
)
by_country
# A tibble: 17 × 41
# Groups: consent_law [2]
consent_law country donors_mean donors_sd donors_median pop_mean pop_sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Informed Australia 10.6 1.14 10.4 18318. 8.31e2
2 Informed Canada 14.0 0.751 14.0 29608. 1.19e3
3 Informed Denmark 13.1 1.47 12.9 5257. 8.06e1
4 Informed Germany 13.0 0.611 13 80255. 5.16e3
5 Informed Ireland 19.8 2.48 19.2 3674. 1.32e2
6 Informed Netherlands 13.7 1.55 13.8 15548. 3.73e2
7 Informed United Kingd… 13.5 0.775 13.5 58187. 6.26e2
8 Informed United States 20.0 1.33 20.1 269330. 1.25e4
9 Presumed Austria 23.5 2.42 23.8 7927. 1.09e2
10 Presumed Belgium 21.9 1.94 21.4 10153. 1.09e2
11 Presumed Finland 18.4 1.53 19.4 5112. 6.86e1
12 Presumed France 16.8 1.60 16.6 58056. 8.51e2
13 Presumed Italy 11.1 4.28 11.3 57360. 4.25e2
14 Presumed Norway 15.4 1.11 15.4 4386. 9.73e1
15 Presumed Spain 28.1 4.96 28 39666. 9.51e2
16 Presumed Sweden 13.1 1.75 12.7 8789. 1.14e2
17 Presumed Switzerland 14.2 1.71 14.4 7037. 1.70e2
# ℹ 34 more variables: pop_median <int>, pop_dens_mean <dbl>,
# pop_dens_sd <dbl>, pop_dens_median <dbl>, gdp_mean <dbl>, gdp_sd <dbl>,
# gdp_median <int>, gdp_lag_mean <dbl>, gdp_lag_sd <dbl>,
# gdp_lag_median <dbl>, health_mean <dbl>, health_sd <dbl>,
# health_median <dbl>, health_lag_mean <dbl>, health_lag_sd <dbl>,
# health_lag_median <dbl>, pubhealth_mean <dbl>, pubhealth_sd <dbl>,
# pubhealth_median <dbl>, roads_mean <dbl>, roads_sd <dbl>, …
The problem is that countries can only be in one Consent Law category.
Restricting to one column doesn’t fix it.
Normally the point of a facet is to preserve comparability between panels by not allowing the scales to vary. But for categorical measures it can be useful to allow this.
by_country |>
ggplot(mapping =
aes(x = donors_mean,
y = reorder(country, donors_mean),
color = consent_law)) +
geom_pointrange(mapping =
aes(xmin = donors_mean - donors_sd,
xmax = donors_mean + donors_sd)) +
guides(color = "none") +
facet_wrap(~ consent_law,
ncol = 1,
scales = "free_y") +
labs(x = "Donor Procurement Rate",
y = NULL,
color = "Consent Law")
geom_text()
for basic labelsggrepel
insteadggrepel
package provides geom_text_repel()
and geom_label_repel()
elections_historic
is in socviz
# A tibble: 49 × 19
election year winner win_party ec_pct popular_pct popular_margin votes
<int> <int> <chr> <chr> <dbl> <dbl> <dbl> <int>
1 10 1824 John Quinc… D.-R. 0.322 0.309 -0.104 1.13e5
2 11 1828 Andrew Jac… Dem. 0.682 0.559 0.122 6.43e5
3 12 1832 Andrew Jac… Dem. 0.766 0.547 0.178 7.03e5
4 13 1836 Martin Van… Dem. 0.578 0.508 0.142 7.63e5
5 14 1840 William He… Whig 0.796 0.529 0.0605 1.28e6
6 15 1844 James Polk Dem. 0.618 0.495 0.0145 1.34e6
7 16 1848 Zachary Ta… Whig 0.562 0.473 0.0479 1.36e6
8 17 1852 Franklin P… Dem. 0.858 0.508 0.0695 1.61e6
9 18 1856 James Buch… Dem. 0.588 0.453 0.122 1.84e6
10 19 1860 Abraham Li… Rep. 0.594 0.396 0.101 1.86e6
# ℹ 39 more rows
# ℹ 11 more variables: margin <int>, runner_up <chr>, ru_part <chr>,
# turnout_pct <dbl>, winner_lname <chr>, winner_label <chr>, ru_lname <chr>,
# ru_label <chr>, two_term <lgl>, ec_votes <dbl>, ec_denom <dbl>
Presidential elections
## The packages we'll use in addition to ggplot
library(ggrepel)
library(scales)
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
This looks terrible here because geom_text_repel()
uses the dimensions of the available graphics device to iteratively figure out the labels. Let’s allow it to draw on the whole slide.
p <- ggplot(data = elections_historic,
mapping = aes(x = popular_pct,
y = ec_pct,
label = winner_label))
p_out <- p + geom_hline(yintercept = 0.5,
linewidth = 1.4,
color = "gray80") +
geom_vline(xintercept = 0.5,
linewidth = 1.4,
color = "gray80") +
geom_point() +
geom_text_repel() +
scale_x_continuous(labels = label_percent()) +
scale_y_continuous(labels = label_percent())
p <- ggplot(data = elections_historic,
mapping = aes(x = popular_pct,
y = ec_pct,
label = winner_label))
p_out <- p + geom_hline(yintercept = 0.5,
linewidth = 1.4,
color = "gray80") +
geom_vline(xintercept = 0.5,
linewidth = 1.4,
color = "gray80") +
geom_point() +
geom_text_repel(mapping = aes(family = "Tenso Slide")) +
scale_x_continuous(labels = label_percent()) +
scale_y_continuous(labels = label_percent()) +
labs(x = x_label, y = y_label,
title = p_title,
subtitle = p_subtitle,
caption = p_caption)
ggplot
ggplot
Stuffing everything into the subset()
call might get messy
dplyr
first# A tibble: 6 × 41
# Groups: consent_law [2]
consent_law country donors_mean donors_sd donors_median pop_mean pop_sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Informed Ireland 19.8 2.48 19.2 3674. 132.
2 Informed United States 20.0 1.33 20.1 269330. 12545.
3 Presumed Belgium 21.9 1.94 21.4 10153. 109.
4 Presumed Norway 15.4 1.11 15.4 4386. 97.3
5 Presumed Spain 28.1 4.96 28 39666. 951.
6 Presumed Switzerland 14.2 1.71 14.4 7037. 170.
# ℹ 34 more variables: pop_median <int>, pop_dens_mean <dbl>,
# pop_dens_sd <dbl>, pop_dens_median <dbl>, gdp_mean <dbl>, gdp_sd <dbl>,
# gdp_median <int>, gdp_lag_mean <dbl>, gdp_lag_sd <dbl>,
# gdp_lag_median <dbl>, health_mean <dbl>, health_sd <dbl>,
# health_median <dbl>, health_lag_mean <dbl>, health_lag_sd <dbl>,
# health_lag_median <dbl>, pubhealth_mean <dbl>, pubhealth_sd <dbl>,
# pubhealth_median <dbl>, roads_mean <dbl>, roads_sd <dbl>, …
dplyr
firstThis makes things neater. A geom
can be fully “autonomous”. Each one can have its own mapping
call and its own data
source. This can be very useful when building up plots overlaying several sources or subsets of data.
annotate()
can imitate geomsannotate()
can imitate geomsx
and y
but also color
, fill
, shape
, size
, and alpha
are scales.scale_<MAPPING>_<KIND>()
scale_<MAPPING>_<KIND>()
scale_<MAPPING>_<KIND>()
We already know there are a lot of mappings
x, y, color, size, shape, and so on.
And there are many kinds of scale as well.
discrete, continuous, log10, date, binned, and many others.
So there’s a whole zoo of scale functions.
The naming convention helps us keep track.
scale_<MAPPING>_<KIND>()
scale_<MAPPING>_<KIND>()
guides()
functionControl overall properties of the guide labels. Most common use: turning it off.
theme()
functiontheme()
styles parts of your plot that are not directly representing your data. Often the first thing people want to adjust; but logically it’s the last thing.
## Using the "classic" ggplot theme here
organdata |>
ggplot(mapping = aes(x = roads,
y = donors,
color = consent_law)) +
geom_point() +
labs(title = "By Consent Law",
x = "Road Deaths",
y = "Donor Procurement",
color = "Legal Regime:") +
theme(legend.position = "bottom",
plot.title = element_text(color = "darkred",
face = "bold"))