Iterating on Data

Soc 690S: Week 10b

Kieran Healy

Duke University

March 2025

Iterating on data with purrr and map

Load the packages, as always

library(here)      # manage file paths
library(socviz)    # data and some useful functions
library(tidyverse) # your friend and mine

Split

Apply

Combine

The Core Idea

  • Split or group the data into pieces
  • Apply a function, i.e. act on the data in some way
  • Combine the results back into a table

Sometimes we start out in a “split” state, as when we have multiple data files, and we need to read them in and then combine them.

More than one data file

  • Inside files/examples/ is a folder named congress/
# A little trick from the fs package:
fs::dir_tree(here("files", "data", "congress"))
/Users/kjhealy/Documents/courses/socdata.co/files/data/congress
├── 01_79_congress.csv
├── 02_80_congress.csv
├── 03_81_congress.csv
├── 04_82_congress.csv
├── 05_83_congress.csv
├── 06_84_congress.csv
├── 07_85_congress.csv
├── 08_86_congress.csv
├── 09_87_congress.csv
├── 10_88_congress.csv
├── 11_89_congress.csv
├── 12_90_congress.csv
├── 13_91_congress.csv
├── 14_92_congress.csv
├── 15_93_congress.csv
├── 16_94_congress.csv
├── 17_95_congress.csv
├── 18_96_congress.csv
├── 19_97_congress.csv
├── 20_98_congress.csv
├── 21_99_congress.csv
├── 22_100_congress.csv
├── 23_101_congress.csv
├── 24_102_congress.csv
├── 25_103_congress.csv
├── 26_104_congress.csv
├── 27_105_congress.csv
├── 28_106_congress.csv
├── 29_107_congress.csv
├── 30_108_congress.csv
├── 31_109_congress.csv
├── 32_110_congress.csv
├── 33_111_congress.csv
├── 34_112_congress.csv
├── 35_113_congress.csv
├── 36_114_congress.csv
├── 37_115_congress.csv
└── 38_116_congress.csv

More than one data file

Let’s look at one.

read_csv(here("files", "data", "congress", "17_95_congress.csv")) |>
  janitor::clean_names() |>
  head()
# A tibble: 6 × 25
  last     first   middle suffix nickname born  death sex   position party state
  <chr>    <chr>   <chr>  <chr>  <chr>    <chr> <chr> <chr> <chr>    <chr> <chr>
1 Abdnor   James   <NA>   <NA>   <NA>     02/1… 11/0… M     U.S. Re… Repu… SD   
2 Abourezk James   George <NA>   <NA>     02/2… <NA>  M     U.S. Se… Demo… SD   
3 Adams    Brockm… <NA>   <NA>   Brock    01/1… 09/1… M     U.S. Re… Demo… WA   
4 Addabbo  Joseph  Patri… <NA>   <NA>     03/1… 04/1… M     U.S. Re… Demo… NY   
5 Aiken    George  David  <NA>   <NA>     08/2… 11/1… M     U.S. Se… Repu… VT   
6 Akaka    Daniel  Kahik… <NA>   <NA>     09/1… 04/0… M     U.S. Re… Demo… HI   
# ℹ 14 more variables: district <chr>, start <chr>, end <chr>, religion <chr>,
#   race <chr>, educational_attainment <chr>, job_type1 <chr>, job_type2 <chr>,
#   job_type3 <chr>, job_type4 <chr>, job_type5 <lgl>, mil1 <chr>, mil2 <chr>,
#   mil3 <chr>

We often find ourselves in this situation. We know each file has the same structure, and we would like to use them all at once.

Loops?

How to read them all in?

One traditional way, which we could do in R, is to write an explicit loop that iterated over a vector of filenames, read each file, and then joined the results together in a tall rectangle.

# Pseudocode

filenames <- c("01_79_congress.csv", "02_80_congress.csv", "03_81_congress.csv",
                "04_82_congress.csv" [etc etc])

collected_files <- NULL

for(i in 1:length(filenames)) {
      new_file <- read_file(filenames[i])
      collected_files <- append_to(collected_files, new_files)
}

Loops?

  • You may have noticed we have not written any loops, however.
  • While loops are still lurking there underneath the surface, what we will do instead is to take advantage of the combination of vectors and functions and map one to the other in order to generate results.
  • Speaking loosely, think of map() as a way of iterating without writing loops. You start with a vector of things. You feed it one thing at a time to some function. The function does whatever it does. You get back output that is the same length as your input, and of a specific type.

Mapping is just a kind of iteration

  • The purrr package provides a big family of mapping functions. One reason there are a lot of them is that purrr, like the rest of the tidyverse, is picky about data types.
  • So in addition to the basic map(), which always returns a list, we also have map_chr(), map_int(), map_dbl(), map_lgl() and others. They always return the data type indicated by their suffix, or die trying.

Vectorized arithmetic again

The simplest cases are not that different from the vectorized arithmetic we’re already familiar with.

a <- c(1:10)

b <- 1

# You know what R will do here
a + b
 [1]  2  3  4  5  6  7  8  9 10 11

R’s vectorized rules add b to every element of a. In a sense, the + operation can be thought of as a function that takes each element of a and does something with it. In this case “add b”.

Vectorized arithmetic again

We can make this explicit by writing a function:

add_b <- function(x) {
  b <- 1
  x + b # for any x
}

Vectorized arithmetic again

We can make this explicit by writing a function:

add_b <- function(x) {
  b <- 1
  x + b # for any x
}

Now:

add_b(x = a)
 [1]  2  3  4  5  6  7  8  9 10 11

Vectorized arithmetic again

Again, R’s vectorized approach means it automatically adds b to every element of the x we give it.

add_b(x = 10)
[1] 11
add_b(x = c(1, 99, 1000))
[1]    2  100 1001

Iterating in a pipeline

Some operations can’t directly be vectorized in this way, which is why we need to manually iterate, or will want to write loops.

library(gapminder)
gapminder |>
  summarize(country_n = n_distinct(country),
            continent_n = n_distinct(continent),
            year_n = n_distinct(year),
            lifeExp_n = n_distinct(lifeExp),
            population_n = n_distinct(population))
# A tibble: 1 × 5
  country_n continent_n year_n lifeExp_n population_n
      <int>       <int>  <int>     <int>        <int>
1       142           5     12      1626         4060

That’s tedious to write! Computers are supposed to allow us to avoid that sort of thing.

Iterating in a pipeline

So how would we iterate this? What we want is to apply the n_distinct() function to each column of gapminder, but in a way that still allows us to use pipelines and so on.

library(gapminder)
gapminder |>
  summarize(n_distinct(country),
            n_distinct(continent),
            n_distinct(year),
            n_distinct(lifeExp),
            n_distinct(population))
# A tibble: 1 × 5
  `n_distinct(country)` `n_distinct(continent)` `n_distinct(year)`
                  <int>                   <int>              <int>
1                   142                       5                 12
# ℹ 2 more variables: `n_distinct(lifeExp)` <int>,
#   `n_distinct(population)` <int>

Iterating in a pipeline

You’d use across(), like this:

gapminder |>
  summarize(across(everything(), n_distinct))
# A tibble: 1 × 6
  country continent  year lifeExp   pop gdpPercap
    <int>     <int> <int>   <int> <int>     <int>
1     142         5    12    1626  1704      1704

Iterating in a pipeline

But you could also do this …

  map(gapminder, n_distinct)
$country
[1] 142

$continent
[1] 5

$year
[1] 12

$lifeExp
[1] 1626

$pop
[1] 1704

$gdpPercap
[1] 1704
  • Read it as “Feed each column of gapminder to the n_distinct() function.
  • (This is pretty much what across() is doing more nicely.)

Iterating in a pipeline

Or, in pipeline form:

gapminder |>
  map(n_distinct)
$country
[1] 142

$continent
[1] 5

$year
[1] 12

$lifeExp
[1] 1626

$pop
[1] 1704

$gdpPercap
[1] 1704

You can see we are getting a list back.

Iterating in a pipeline

Or, in pipeline form:

result <- gapminder |>
  map(n_distinct)

class(result)
[1] "list"
result$continent
[1] 5
result[[2]]
[1] 5

Iterating in a pipeline

But we know n_distinct() should always return an integer. So we use map_int() instead of the generic map().

gapminder |>
  map_int(n_distinct)
  country continent      year   lifeExp       pop gdpPercap 
      142         5        12      1626      1704      1704 

The thing about the map() family is that they can deal with all kinds of input types and output types.

Get a vector of filenames

filenames <- dir(path = here("files", "data", "congress"),
                 pattern = "*.csv",
                 full.names = TRUE)

filenames[1:15] # Just displaying the first 15, to save slide space
 [1] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/01_79_congress.csv"
 [2] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/02_80_congress.csv"
 [3] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/03_81_congress.csv"
 [4] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/04_82_congress.csv"
 [5] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/05_83_congress.csv"
 [6] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/06_84_congress.csv"
 [7] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/07_85_congress.csv"
 [8] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/08_86_congress.csv"
 [9] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/09_87_congress.csv"
[10] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/10_88_congress.csv"
[11] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/11_89_congress.csv"
[12] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/12_90_congress.csv"
[13] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/13_91_congress.csv"
[14] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/14_92_congress.csv"
[15] "/Users/kjhealy/Documents/courses/socdata.co/files/data/congress/15_93_congress.csv"

And feed it to read_csv()

… using map() and binding the resulting list into a tibble.

df <- filenames |>
  map(read_csv) |> 
  list_rbind(names_to = "congress") |>
  janitor::clean_names()

df
# A tibble: 20,580 × 26
   congress last   first middle suffix nickname born  death sex   position party
      <int> <chr>  <chr> <chr>  <chr>  <chr>    <chr> <chr> <chr> <chr>    <chr>
 1        1 Abern… Thom… Gerst… <NA>   <NA>     05/1… 01/2… M     U.S. Re… Demo…
 2        1 Adams  Sher… <NA>   <NA>   <NA>     01/0… 10/2… M     U.S. Re… Repu…
 3        1 Aiken  Geor… David  <NA>   <NA>     08/2… 11/1… M     U.S. Se… Repu…
 4        1 Allen  Asa   Leona… <NA>   <NA>     01/0… 01/0… M     U.S. Re… Demo…
 5        1 Allen  Leo   Elwood <NA>   <NA>     10/0… 01/1… M     U.S. Re… Repu…
 6        1 Almond J.    Linds… Jr.    <NA>     06/1… 04/1… M     U.S. Re… Demo…
 7        1 Ander… Herm… Carl   <NA>   <NA>     01/2… 07/2… M     U.S. Re… Repu…
 8        1 Ander… Clin… Presba <NA>   <NA>     10/2… 11/1… M     U.S. Re… Demo…
 9        1 Ander… John  Zuing… <NA>   <NA>     03/2… 02/0… M     U.S. Re… Repu…
10        1 Andre… Augu… Herman <NA>   <NA>     10/1… 01/1… M     U.S. Re… Repu…
# ℹ 20,570 more rows
# ℹ 15 more variables: state <chr>, district <chr>, start <chr>, end <chr>,
#   religion <chr>, race <chr>, educational_attainment <chr>, job_type1 <chr>,
#   job_type2 <chr>, job_type3 <chr>, job_type4 <chr>, job_type5 <chr>,
#   mil1 <chr>, mil2 <chr>, mil3 <chr>

read_csv() can do this directly

In fact map() is not required for this particular use:

tmp <- read_csv(filenames, id = "path",
                name_repair = janitor::make_clean_names)

tmp |>
  mutate(congress = stringr::str_extract(path, "_\\d{2,3}_congress"),
         congress = stringr::str_extract(congress, "\\d{2,3}")) |>
  relocate(congress)
# A tibble: 20,580 × 27
   congress path   last  first middle suffix nickname born  death sex   position
   <chr>    <chr>  <chr> <chr> <chr>  <chr>  <chr>    <chr> <chr> <chr> <chr>   
 1 79       /User… Aber… Thom… Gerst… <NA>   <NA>     05/1… 01/2… M     U.S. Re…
 2 79       /User… Adams Sher… <NA>   <NA>   <NA>     01/0… 10/2… M     U.S. Re…
 3 79       /User… Aiken Geor… David  <NA>   <NA>     08/2… 11/1… M     U.S. Se…
 4 79       /User… Allen Asa   Leona… <NA>   <NA>     01/0… 01/0… M     U.S. Re…
 5 79       /User… Allen Leo   Elwood <NA>   <NA>     10/0… 01/1… M     U.S. Re…
 6 79       /User… Almo… J.    Linds… Jr.    <NA>     06/1… 04/1… M     U.S. Re…
 7 79       /User… Ande… Herm… Carl   <NA>   <NA>     01/2… 07/2… M     U.S. Re…
 8 79       /User… Ande… Clin… Presba <NA>   <NA>     10/2… 11/1… M     U.S. Re…
 9 79       /User… Ande… John  Zuing… <NA>   <NA>     03/2… 02/0… M     U.S. Re…
10 79       /User… Andr… Augu… Herman <NA>   <NA>     10/1… 01/1… M     U.S. Re…
# ℹ 20,570 more rows
# ℹ 16 more variables: party <chr>, state <chr>, district <chr>, start <chr>,
#   end <chr>, religion <chr>, race <chr>, educational_attainment <chr>,
#   job_type1 <chr>, job_type2 <chr>, job_type3 <chr>, job_type4 <chr>,
#   job_type5 <chr>, mil1 <chr>, mil2 <chr>, mil3 <chr>

Example: Iterating on the U.S. Census

General strategy

  • Identify the action you want to perform.
  • Solve the single case: what function must I use or write to do the thing once?
  • Consider how many things will vary to do the thing repeatedly.
  • Replace those varying things with a placeholder or variable.
  • Choose a map or walk function that can repeatedly apply the function with those variables.

Iterating on the U.S. Census

Mapped iteration is very general, and not just for local files

## Register for a free Census API key at
## <https://api.census.gov/data/key_signup.html>
library(tidycensus)
out <- get_acs(geography = "county",
                    variables = "B19013_001",
                    state = "NY",
                    county = "New York",
                    survey = "acs1",
                    year = 2005)
out
# A tibble: 1 × 5
  GEOID NAME                      variable   estimate   moe
  <chr> <chr>                     <chr>         <dbl> <dbl>
1 36061 New York County, New York B19013_001    55973  1462

Iterating on the U.S. Census

All counties in New York State for a specific year

out <- get_acs(geography = "county",
                    variables = "B19013_001",
                    state = "NY",
                    survey = "acs1",
                    year = 2005)
out
# A tibble: 38 × 5
   GEOID NAME                         variable   estimate   moe
   <chr> <chr>                        <chr>         <dbl> <dbl>
 1 36001 Albany County, New York      B19013_001    50054  2030
 2 36005 Bronx County, New York       B19013_001    29228   853
 3 36007 Broome County, New York      B19013_001    36394  2340
 4 36009 Cattaraugus County, New York B19013_001    37580  2282
 5 36011 Cayuga County, New York      B19013_001    42057  2406
 6 36013 Chautauqua County, New York  B19013_001    35495  2077
 7 36015 Chemung County, New York     B19013_001    37418  3143
 8 36019 Clinton County, New York     B19013_001    44757  3500
 9 36027 Dutchess County, New York    B19013_001    61889  2431
10 36029 Erie County, New York        B19013_001    41967  1231
# ℹ 28 more rows

Iterating on the U.S. Census

What if we want the results for every available year? First, a handy function: set_names()

x <- c(1:10)

x
 [1]  1  2  3  4  5  6  7  8  9 10
x <- set_names(x, nm = letters[1:10])

x
 a  b  c  d  e  f  g  h  i  j 
 1  2  3  4  5  6  7  8  9 10 

Iterating on the U.S. Census

By default, set_names() will label a vector with that vector’s values:

c(1:10) |>
  set_names()
 1  2  3  4  5  6  7  8  9 10 
 1  2  3  4  5  6  7  8  9 10 

Iterating on the U.S. Census

This works with map() just fine:

df <- 2005:2019 |>
  map(\(x) get_acs(geography = "county",
                   variables = "B19013_001",
                   state = "NY",
                   survey = "acs1",
                   year = x)) |>
  list_rbind(names_to = "year")
df
# A tibble: 580 × 6
    year GEOID NAME                         variable   estimate   moe
   <int> <chr> <chr>                        <chr>         <dbl> <dbl>
 1     1 36001 Albany County, New York      B19013_001    50054  2030
 2     1 36005 Bronx County, New York       B19013_001    29228   853
 3     1 36007 Broome County, New York      B19013_001    36394  2340
 4     1 36009 Cattaraugus County, New York B19013_001    37580  2282
 5     1 36011 Cayuga County, New York      B19013_001    42057  2406
 6     1 36013 Chautauqua County, New York  B19013_001    35495  2077
 7     1 36015 Chemung County, New York     B19013_001    37418  3143
 8     1 36019 Clinton County, New York     B19013_001    44757  3500
 9     1 36027 Dutchess County, New York    B19013_001    61889  2431
10     1 36029 Erie County, New York        B19013_001    41967  1231
# ℹ 570 more rows

Iterating on the U.S. Census

Our id column tracks the year. But we’d like it to be the year. So, we use set_names():

df <- 2005:2019 |>
  set_names() |>
  map(\(x) get_acs(geography = "county",
                   variables = "B19013_001",
                   state = "NY",
                   survey = "acs1",
                   year = x)) |>
  list_rbind(names_to = "year") |>
  mutate(year = as.integer(year))

Iterating on the U.S. Census

df
# A tibble: 580 × 6
    year GEOID NAME                         variable   estimate   moe
   <int> <chr> <chr>                        <chr>         <dbl> <dbl>
 1  2005 36001 Albany County, New York      B19013_001    50054  2030
 2  2005 36005 Bronx County, New York       B19013_001    29228   853
 3  2005 36007 Broome County, New York      B19013_001    36394  2340
 4  2005 36009 Cattaraugus County, New York B19013_001    37580  2282
 5  2005 36011 Cayuga County, New York      B19013_001    42057  2406
 6  2005 36013 Chautauqua County, New York  B19013_001    35495  2077
 7  2005 36015 Chemung County, New York     B19013_001    37418  3143
 8  2005 36019 Clinton County, New York     B19013_001    44757  3500
 9  2005 36027 Dutchess County, New York    B19013_001    61889  2431
10  2005 36029 Erie County, New York        B19013_001    41967  1231
# ℹ 570 more rows

Now year is just the year. The year column will be created as a character vector, so we converted it back to an integer again at the end.

Iterating on the U.S. Census

p_out <- 2005:2019 |>
  set_names() |>
  map(\(x) get_acs(geography = "county",
                   variables = "B19013_001",
                   state = "NY",
                   survey = "acs1",
                   year = x)) |>
  list_rbind(names_to = "year") |>
  mutate(year = as.integer(year)) |>
  ggplot(mapping = aes(x = year, y = estimate, group = year)) +
  geom_boxplot(fill = "lightblue", alpha = 0.5, outlier.alpha = 0) +
  geom_jitter(position = position_jitter(width = 0.1), shape = 1) +
  scale_y_continuous(labels = scales::label_dollar()) +
  labs(x = "Year", y = "Dollars",
       title = "Median Household Income by County in New York State, 2005-2019",
       subtitle = "ACS 1-year estimates", caption = "Data: U.S. Census Bureau.")

Iterating on the U.S. Census

print(p_out)

Iterating on the U.S. Census

# This live call to the Census API draws the graph underneath
2005:2019 |>
  set_names() |>
  map(\(x) get_acs(geography = "county", variables = "B19013_001",
                   state = "NY", survey = "acs1", year = x)) |>
  list_rbind(names_to = "year") |>
  mutate(year = as.integer(year)) |>
  ggplot(mapping = aes(x = year, y = estimate, group = year)) +
  geom_boxplot(fill = "lightblue", alpha = 0.5, outlier.alpha = 0) +
  geom_jitter(position = position_jitter(width = 0.1), shape = 1) +
  scale_y_continuous(labels = scales::label_dollar()) +
  labs(x = "Year", y = "Dollars", title = "Median Household Income by County in New York State, 2005-2019",
       subtitle = "ACS 1-year estimates", caption = "Data: U.S. Census Bureau.")

Mapping and Wrangling Models

This is not a statistics seminar!

  • I’ll just give you an example of the sort of thing that many other modeling packages implement for all kinds of modeling techniques.
  • Again, the principle is tidy incorporation of models and their output.

Tidy regression output with broom

library(broom)
library(gapminder)
out <- lm(formula = lifeExp ~ gdpPercap + pop + continent,
          data = gapminder)

Tidy regression output with broom

We can’t do anything with this, programatically.

summary(out)

Call:
lm(formula = lifeExp ~ gdpPercap + pop + continent, data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.161  -4.486   0.297   5.110  25.175 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       4.781e+01  3.395e-01 140.819  < 2e-16 ***
gdpPercap         4.495e-04  2.346e-05  19.158  < 2e-16 ***
pop               6.570e-09  1.975e-09   3.326 0.000901 ***
continentAmericas 1.348e+01  6.000e-01  22.458  < 2e-16 ***
continentAsia     8.193e+00  5.712e-01  14.342  < 2e-16 ***
continentEurope   1.747e+01  6.246e-01  27.973  < 2e-16 ***
continentOceania  1.808e+01  1.782e+00  10.146  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.365 on 1697 degrees of freedom
Multiple R-squared:  0.5821,    Adjusted R-squared:  0.5806 
F-statistic: 393.9 on 6 and 1697 DF,  p-value: < 2.2e-16

Tidy regression output with broom

library(broom)
tidy(out)
# A tibble: 7 × 5
  term              estimate     std.error statistic   p.value
  <chr>                <dbl>         <dbl>     <dbl>     <dbl>
1 (Intercept)        4.78e+1 0.340            141.   0        
2 gdpPercap          4.50e-4 0.0000235         19.2  3.24e- 74
3 pop                6.57e-9 0.00000000198      3.33 9.01e-  4
4 continentAmericas  1.35e+1 0.600             22.5  5.19e- 98
5 continentAsia      8.19e+0 0.571             14.3  4.06e- 44
6 continentEurope    1.75e+1 0.625             28.0  6.34e-142
7 continentOceania   1.81e+1 1.78              10.1  1.59e- 23

That’s a lot nicer. Now it’s just a tibble. We know those.

Tidy regression output with broom

out_conf <- tidy(out, conf.int = TRUE)
out_conf
# A tibble: 7 × 7
  term              estimate    std.error statistic   p.value conf.low conf.high
  <chr>                <dbl>        <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)        4.78e+1      3.40e-1    141.   0          4.71e+1   4.85e+1
2 gdpPercap          4.50e-4      2.35e-5     19.2  3.24e- 74  4.03e-4   4.96e-4
3 pop                6.57e-9      1.98e-9      3.33 9.01e-  4  2.70e-9   1.04e-8
4 continentAmericas  1.35e+1      6.00e-1     22.5  5.19e- 98  1.23e+1   1.47e+1
5 continentAsia      8.19e+0      5.71e-1     14.3  4.06e- 44  7.07e+0   9.31e+0
6 continentEurope    1.75e+1      6.25e-1     28.0  6.34e-142  1.62e+1   1.87e+1
7 continentOceania   1.81e+1      1.78e+0     10.1  1.59e- 23  1.46e+1   2.16e+1

Tidy regression output with broom

out_conf |>
    filter(term %nin% "(Intercept)") |>
    mutate(nicelabs = prefix_strip(term, "continent")) |>
    select(nicelabs, everything())
# A tibble: 6 × 8
  nicelabs  term       estimate std.error statistic   p.value conf.low conf.high
  <chr>     <chr>         <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 gdpPercap gdpPercap   4.50e-4   2.35e-5     19.2  3.24e- 74  4.03e-4   4.96e-4
2 Pop       pop         6.57e-9   1.98e-9      3.33 9.01e-  4  2.70e-9   1.04e-8
3 Americas  continent…  1.35e+1   6.00e-1     22.5  5.19e- 98  1.23e+1   1.47e+1
4 Asia      continent…  8.19e+0   5.71e-1     14.3  4.06e- 44  7.07e+0   9.31e+0
5 Europe    continent…  1.75e+1   6.25e-1     28.0  6.34e-142  1.62e+1   1.87e+1
6 Oceania   continent…  1.81e+1   1.78e+0     10.1  1.59e- 23  1.46e+1   2.16e+1

Grouped analysis and list columns

Again, single case; then tidily generalize to many cases.

eu77 <- gapminder |> filter(continent == "Europe", year == 1977)
fit <- lm(lifeExp ~ log(gdpPercap), data = eu77)
tidy(fit)
# A tibble: 2 × 5
  term           estimate std.error statistic    p.value
  <chr>             <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)       29.5      7.16       4.12 0.000306  
2 log(gdpPercap)     4.49     0.756      5.94 0.00000217

Grouped analysis and list columns

df_nest<- gapminder |>
    group_by(continent, year) |>
    nest()

df_nest
# A tibble: 60 × 3
# Groups:   continent, year [60]
   continent  year data             
   <fct>     <int> <list>           
 1 Asia       1952 <tibble [33 × 4]>
 2 Asia       1957 <tibble [33 × 4]>
 3 Asia       1962 <tibble [33 × 4]>
 4 Asia       1967 <tibble [33 × 4]>
 5 Asia       1972 <tibble [33 × 4]>
 6 Asia       1977 <tibble [33 × 4]>
 7 Asia       1982 <tibble [33 × 4]>
 8 Asia       1987 <tibble [33 × 4]>
 9 Asia       1992 <tibble [33 × 4]>
10 Asia       1997 <tibble [33 × 4]>
# ℹ 50 more rows

Think of nesting as a kind of “super-grouping”. Look in the object inspector.

Grouped analysis and list columns

It’s still in there.

df_nest|> filter(continent == "Europe" & year == 1977) |>
    unnest(cols = c(data))
# A tibble: 30 × 6
# Groups:   continent, year [1]
   continent  year country                lifeExp      pop gdpPercap
   <fct>     <int> <fct>                    <dbl>    <int>     <dbl>
 1 Europe     1977 Albania                   68.9  2509048     3533.
 2 Europe     1977 Austria                   72.2  7568430    19749.
 3 Europe     1977 Belgium                   72.8  9821800    19118.
 4 Europe     1977 Bosnia and Herzegovina    69.9  4086000     3528.
 5 Europe     1977 Bulgaria                  70.8  8797022     7612.
 6 Europe     1977 Croatia                   70.6  4318673    11305.
 7 Europe     1977 Czech Republic            70.7 10161915    14800.
 8 Europe     1977 Denmark                   74.7  5088419    20423.
 9 Europe     1977 Finland                   72.5  4738902    15605.
10 Europe     1977 France                    73.8 53165019    18293.
# ℹ 20 more rows

Grouped analysis and list columns

Here we map() a custom function to every row in the data column.

# Function to solve the single case. There's one varying argument, i.e. the data
fit_ols <- function(df) {
    lm(lifeExp ~ log(gdpPercap), data = df)
}

df_nest<- gapminder |>
    group_by(continent, year) |>
    nest() |>
    mutate(model = map(data, fit_ols)) 

Grouped analysis and list columns

df_nest
# A tibble: 60 × 4
# Groups:   continent, year [60]
   continent  year data              model 
   <fct>     <int> <list>            <list>
 1 Asia       1952 <tibble [33 × 4]> <lm>  
 2 Asia       1957 <tibble [33 × 4]> <lm>  
 3 Asia       1962 <tibble [33 × 4]> <lm>  
 4 Asia       1967 <tibble [33 × 4]> <lm>  
 5 Asia       1972 <tibble [33 × 4]> <lm>  
 6 Asia       1977 <tibble [33 × 4]> <lm>  
 7 Asia       1982 <tibble [33 × 4]> <lm>  
 8 Asia       1987 <tibble [33 × 4]> <lm>  
 9 Asia       1992 <tibble [33 × 4]> <lm>  
10 Asia       1997 <tibble [33 × 4]> <lm>  
# ℹ 50 more rows

Grouped analysis and list columns

We can tidy the nested models, too.

fit_ols <- function(df) {
    lm(lifeExp ~ log(gdpPercap), data = df)
}

out_tidy <- gapminder |>
    group_by(continent, year) |>
    nest() |>
    mutate(model = map(data, fit_ols),
           tidied = map(model, tidy)) |>
    unnest(cols = c(tidied)) |>
    filter(term %nin% "(Intercept)" &
           continent %nin% "Oceania")

Grouped analysis and list columns

out_tidy
# A tibble: 48 × 9
# Groups:   continent, year [48]
   continent  year data     model  term     estimate std.error statistic p.value
   <fct>     <int> <list>   <list> <chr>       <dbl>     <dbl>     <dbl>   <dbl>
 1 Asia       1952 <tibble> <lm>   log(gdp…     4.16     1.25       3.33 2.28e-3
 2 Asia       1957 <tibble> <lm>   log(gdp…     4.17     1.28       3.26 2.71e-3
 3 Asia       1962 <tibble> <lm>   log(gdp…     4.59     1.24       3.72 7.94e-4
 4 Asia       1967 <tibble> <lm>   log(gdp…     4.50     1.15       3.90 4.77e-4
 5 Asia       1972 <tibble> <lm>   log(gdp…     4.44     1.01       4.41 1.16e-4
 6 Asia       1977 <tibble> <lm>   log(gdp…     4.87     1.03       4.75 4.42e-5
 7 Asia       1982 <tibble> <lm>   log(gdp…     4.78     0.852      5.61 3.77e-6
 8 Asia       1987 <tibble> <lm>   log(gdp…     5.17     0.727      7.12 5.31e-8
 9 Asia       1992 <tibble> <lm>   log(gdp…     5.09     0.649      7.84 7.60e-9
10 Asia       1997 <tibble> <lm>   log(gdp…     5.11     0.628      8.15 3.35e-9
# ℹ 38 more rows

Grouped analysis and list columns

out_tidy |>
    ungroup() |>
    sample_n(5)
# A tibble: 5 × 9
  continent  year data     model  term      estimate std.error statistic p.value
  <fct>     <int> <list>   <list> <chr>        <dbl>     <dbl>     <dbl>   <dbl>
1 Europe     1962 <tibble> <lm>   log(gdpP…     5.91     0.853      6.93 1.56e-7
2 Americas   2007 <tibble> <lm>   log(gdpP…     4.49     0.752      5.98 4.26e-6
3 Asia       1987 <tibble> <lm>   log(gdpP…     5.17     0.727      7.12 5.31e-8
4 Africa     1992 <tibble> <lm>   log(gdpP…     7.33     1.10       6.64 2.22e-8
5 Europe     1987 <tibble> <lm>   log(gdpP…     4.14     0.752      5.51 6.93e-6

Nest to map other things

Let’s run a very slightly different model:

# New model
fit_ols2 <- function(df) {
    lm(lifeExp ~ log(gdpPercap) + log(pop), data = df)
}

out_tidy <- gapminder |>
    group_by(continent, year) |>
    nest() |>
    mutate(model = map(data, fit_ols2),
           tidied = map(model, tidy))

out_tidy
# A tibble: 60 × 5
# Groups:   continent, year [60]
   continent  year data              model  tidied          
   <fct>     <int> <list>            <list> <list>          
 1 Asia       1952 <tibble [33 × 4]> <lm>   <tibble [3 × 5]>
 2 Asia       1957 <tibble [33 × 4]> <lm>   <tibble [3 × 5]>
 3 Asia       1962 <tibble [33 × 4]> <lm>   <tibble [3 × 5]>
 4 Asia       1967 <tibble [33 × 4]> <lm>   <tibble [3 × 5]>
 5 Asia       1972 <tibble [33 × 4]> <lm>   <tibble [3 × 5]>
 6 Asia       1977 <tibble [33 × 4]> <lm>   <tibble [3 × 5]>
 7 Asia       1982 <tibble [33 × 4]> <lm>   <tibble [3 × 5]>
 8 Asia       1987 <tibble [33 × 4]> <lm>   <tibble [3 × 5]>
 9 Asia       1992 <tibble [33 × 4]> <lm>   <tibble [3 × 5]>
10 Asia       1997 <tibble [33 × 4]> <lm>   <tibble [3 × 5]>
# ℹ 50 more rows

A function to draw a coef plot

# Plot the output from our model
mod_plot <- function(data,
                     title){
  data |>
    filter(term %nin% "(Intercept)") |>
    ggplot(mapping = aes(x = estimate,
                         xmin = estimate - std.error,
                         xmax = estimate + std.error,
                         y = reorder(term, estimate))) +
    geom_pointrange() +
    labs(title = title,
         y = NULL)
}

Add it using map2() or pmap()

  • When we have two arguments to feed a function we can use map2(). The general case is pmap(), for passing any number of arguments in a list.
out_tidy <- gapminder |> group_by(continent, year) |> nest() |>
    mutate(title = paste(continent, year),
           model = map(data, fit_ols2),
           tidied = map(model, tidy),
           ggout = pmap(list(tidied, title), 
                        mod_plot)) 

out_tidy
# A tibble: 60 × 7
# Groups:   continent, year [60]
   continent  year data              title     model  tidied           ggout 
   <fct>     <int> <list>            <chr>     <list> <list>           <list>
 1 Asia       1952 <tibble [33 × 4]> Asia 1952 <lm>   <tibble [3 × 5]> <gg>  
 2 Asia       1957 <tibble [33 × 4]> Asia 1957 <lm>   <tibble [3 × 5]> <gg>  
 3 Asia       1962 <tibble [33 × 4]> Asia 1962 <lm>   <tibble [3 × 5]> <gg>  
 4 Asia       1967 <tibble [33 × 4]> Asia 1967 <lm>   <tibble [3 × 5]> <gg>  
 5 Asia       1972 <tibble [33 × 4]> Asia 1972 <lm>   <tibble [3 × 5]> <gg>  
 6 Asia       1977 <tibble [33 × 4]> Asia 1977 <lm>   <tibble [3 × 5]> <gg>  
 7 Asia       1982 <tibble [33 × 4]> Asia 1982 <lm>   <tibble [3 × 5]> <gg>  
 8 Asia       1987 <tibble [33 × 4]> Asia 1987 <lm>   <tibble [3 × 5]> <gg>  
 9 Asia       1992 <tibble [33 × 4]> Asia 1992 <lm>   <tibble [3 × 5]> <gg>  
10 Asia       1997 <tibble [33 × 4]> Asia 1997 <lm>   <tibble [3 × 5]> <gg>  
# ℹ 50 more rows

Now, every element of ggout is …

A plot!

out_tidy$ggout[[8]]

out_tidy$ggout[[18]]

We don’t just put them in there for fun

  • We can e.g. walk the plots out to disk

We don’t just put them in there for fun

  • We can e.g. walk the plots out to disk
# If a root-level figures directory doesn't exist, create one
if(!dir.exists(here("figures"))) dir.create(here("figures"))

fs::dir_ls(here("figures"))
character(0)

We don’t just put them in there for fun

  • We can e.g. walk the plots out to disk

  • walk() is map() for when you just want a “side-effect” such as printed output. There is also walk2() and pwalk()

  • When to use walk instead of map? Ask yourself “Do I want anything like a vector or a list or a tibble back—either at the console, or in an object—from this thing I’m about to do?” If the answer is no, use walk.

pwalk(
  list(
    filename = paste0(out_tidy$title, ".png"),
    plot = out_tidy$ggout,
    path = here("figures"),
    height = 3, width = 4,
    dpi = 300
  ),
  ggsave
)

Peek again in the figures/ folder

fs::dir_ls(here("figures")) |>
  basename()
 [1] "Africa 1952.png"   "Africa 1957.png"   "Africa 1962.png"  
 [4] "Africa 1967.png"   "Africa 1972.png"   "Africa 1977.png"  
 [7] "Africa 1982.png"   "Africa 1987.png"   "Africa 1992.png"  
[10] "Africa 1997.png"   "Africa 2002.png"   "Africa 2007.png"  
[13] "Americas 1952.png" "Americas 1957.png" "Americas 1962.png"
[16] "Americas 1967.png" "Americas 1972.png" "Americas 1977.png"
[19] "Americas 1982.png" "Americas 1987.png" "Americas 1992.png"
[22] "Americas 1997.png" "Americas 2002.png" "Americas 2007.png"
[25] "Asia 1952.png"     "Asia 1957.png"     "Asia 1962.png"    
[28] "Asia 1967.png"     "Asia 1972.png"     "Asia 1977.png"    
[31] "Asia 1982.png"     "Asia 1987.png"     "Asia 1992.png"    
[34] "Asia 1997.png"     "Asia 2002.png"     "Asia 2007.png"    
[37] "Europe 1952.png"   "Europe 1957.png"   "Europe 1962.png"  
[40] "Europe 1967.png"   "Europe 1972.png"   "Europe 1977.png"  
[43] "Europe 1982.png"   "Europe 1987.png"   "Europe 1992.png"  
[46] "Europe 1997.png"   "Europe 2002.png"   "Europe 2007.png"  
[49] "Oceania 1952.png"  "Oceania 1957.png"  "Oceania 1962.png" 
[52] "Oceania 1967.png"  "Oceania 1972.png"  "Oceania 1977.png" 
[55] "Oceania 1982.png"  "Oceania 1987.png"  "Oceania 1992.png" 
[58] "Oceania 1997.png"  "Oceania 2002.png"  "Oceania 2007.png" 

Clean up

fs::dir_delete(here("figures"))