Scraping and APIs

Kieran Healy

Duke University

March 6, 2024

Load the packages, as always

library(here)      # manage file paths
library(socviz)    # data and some useful functions
library(tidyverse) # your friend and mine
library(rvest)     # For web-scraping


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

Followup: Iterating with Files

Solve the single case

make_a_plot <- function(df, name) {
  df |>
    ggplot2::ggplot(ggplot2::aes(x = value, y = n)) +
    ggplot2::geom_col() +
    ggplot2::labs(title = glue::glue("ugly example plot: { name }"))

}

Solve the single case

## E.g., for a single case
tmp <- mtcars |>
  select(cyl) |>
  count(value = cyl)

make_a_plot(tmp, "cyl")

Then generalize

out <- mtcars |>
  as_tibble() |>
  select(cyl, vs, am, gear, carb) |>
  mutate(across(everything(), as.factor)) |>
  pivot_longer(everything()) |>
  count(name, value) |>
  nest(.by = name) |>
  mutate(fname = paste({here::here()}, "tmp", paste0(name, ".png"), sep="/"),
         plot = map2(data, name, \(x,y) make_a_plot(x, y)),
         fname = walk2(fname, plot, \(x,y) ggsave(x, y, height = 5, width = 5)))

Scraping

Scraping is fundamentally unclean

It is awkward

It is very prone to error

It is brittle

It’s often against the terms of service of websites

Server-side rendering

If the webpages you are interested in looking at are statically or dynamically assembled on the server side, they arrive in your browser “fully formed”. The tables or other data structures they may contain are actually populated with numbers.

The task is to get them out by identifying the HTML elements (e.g. <table>, <tr>, <td>, <li> etc) or the CSS styling selectors (e.g. .pagetable, .datalist, #website-content-tabledata etc) and then extract the values they enclose.

HTML elements have a fixed number of names; CSS selectors have a fixed identifying format (the . and # prefixes, etc) and internal structure, but they can be named according to the needs of the site designer.

Client-side rendering

If the webpages you are interested in looking at are assembled on the client side, then what comes down is roughly in two pieces: an empty template for the the layout of the page; and a set of instructions for pulling data separately and to fill the template.

The process of requesting specific data takes place through some private or public application programming interface or API.

An API takes well-formed requests for data and returns whatever is requested in a structured format (usually JSON or XML) that can be used by the client. So we should just talk to the API directly if we can.

Scraping Example

Getting a single table

A single table where we know the class.

Consider this table

Type	Mean PBE	Median PBE	Max PBE	Min PBE	N Schools	N Students
Private Waldorf	47.49	44.19	84.21	20	16	513
Public Montessori	17.08	12.24	54.55	5.97	11	706
Charter Montessori	14.28	10.26	31.67	4.35	5	227
Charter	10.76	3.03	70.37	0	314	19,863
Private Christian	6.74	3.70	92.86	0	333	8,763
Private Non-Specific	5.89	0	86.96	0	596	16,795
Private Montessori	4.64	0	35.71	0	98	2,101
Private Jewish or Islamic	2.59	0	14.29	0	8	237
Public	2.33	0.81	75	0	5314	472,802
Private Catholic	1.80	0	27.78	0	333	8,855
Private Christian Montessori	1.25	0	5	0	4	78

Getting a single table

It’s on my website. It’s the only table on the page, which means we can just select it with by saying we want the <table> element on the page. We use html_element() to isolate the table and html_table() to parse it to a tibble:

vactab <- read_html("https://kieranhealy.org/blog/archives/2015/02/03/another-look-at-the-california-vaccination-data/")

vactab |> 
  html_element("table") |> 
  html_table() |> 
  janitor::clean_names()

# A tibble: 11 × 7
   type                 mean_pbe median_pbe max_pbe min_pbe n_schools n_students
   <chr>                   <dbl>      <dbl>   <dbl>   <dbl>     <int> <chr>     
 1 Private Waldorf         47.5       44.2     84.2   20           16 513       
 2 Public Montessori       17.1       12.2     54.6    5.97        11 706       
 3 Charter Montessori      14.3       10.3     31.7    4.35         5 227       
 4 Charter                 10.8        3.03    70.4    0          314 19,863    
 5 Private Christian        6.74       3.7     92.9    0          333 8,763     
 6 Private Non-Specific     5.89       0       87.0    0          596 16,795    
 7 Private Montessori       4.64       0       35.7    0           98 2,101     
 8 Private Jewish or I…     2.59       0       14.3    0            8 237       
 9 Public                   2.33       0.81    75      0         5314 472,802   
10 Private Catholic         1.8        0       27.8    0          333 8,855     
11 Private Christian M…     1.25       0        5      0            4 78

When there’s more than one

Next, consider the tables on this page. Again, these are static tables, but there are several of them.

With html_element() we’ll just extract the first one:

philtabs <- read_html("https://kieranhealy.org/blog/archives/2013/06/19/lewis-and-the-women/")

philtabs |> 
  html_element("table") |> 
  html_table() |> 
  janitor::clean_names()

# A tibble: 24 × 4
    rank cites item                                 typically_cited_in        
   <int> <int> <chr>                                <chr>                     
 1     1   180 Kripke S 1980 Naming Necessity       Nous, Philosophical Review
 2     2   131 Lewis D 1986 Plurality Worlds        Nous, Philosophical Review
 3     3    97 Quine W 1960 Word Object             Philosophical Review, Nous
 4     4    83 Williamson T 2000 Knowledge Limits   Nous, Philosophical Review
 5     5    82 Lewis D 1973 Counterfactuals         Mind, Nous                
 6     6    78 Evans G 1982 Varieties Reference     Philosophical Review, Nous
 7     7    77 Chalmers D 1996 Conscious Mind       Philosophical Review, Nous
 8     7    77 Davidson D 1980 Essays Actions Event Philosophical Review, Mind
 9     9    73 Lewis D 1986 Philos Papers           Mind, Nous                
10    10    64 Parfit D 1984 Reasons Persons        Philosophical Review, Nous
# ℹ 14 more rows

When there’s more than one

With html_elements() we’ll extract all of them:

philtabs |> 
  html_elements("table")

{xml_nodeset (5)}
[1] <table>\n<thead><tr>\n<th align="left"><em>Rank</em></th>\n<th align="lef ...
[2] <table>\n<thead><tr>\n<th align="left"><em>Rank</em></th>\n<th align="lef ...
[3] <table>\n<thead><tr class="header">\n<th align="left"><em>Rank</em></th>\ ...
[4] <table>\n<thead><tr>\n<th align="left"><em>Rank</em></th>\n<th align="lef ...
[5] <table>\n<thead><tr>\n<th align="right"><em>Rank</em></th>\n<th align="ri ...

When there’s more than one

We can get the nth one with pluck():

philtabs |> 
  html_elements("table") |> 
  pluck(2) |> 
  html_table() |> 
  janitor::clean_names()

# A tibble: 33 × 4
    rank cites item                              typically_cited_in         
   <int> <int> <chr>                             <chr>                      
 1     2   131 Lewis D 1986 Plurality Worlds     Nous, Philosophical Review 
 2     5    82 Lewis D 1973 Counterfactuals      Mind, Nous                 
 3     9    73 Lewis D 1986 Philos Papers        Mind, Nous                 
 4    16    50 Lewis D 1983 Philos Papers        Nous, Mind                 
 5    18    48 Lewis D 1983 Australas J Philos   Nous, Mind                 
 6    20    44 Lewis D 1996 Australas J Philos   Nous, Philosophical Review 
 7    41    36 Lewis D 1979 J Philos Logic       Nous, Mind                 
 8    47    34 Lewis D 1991 Parts Classes        Nous, Mind                 
 9    67    29 Lewis D 1969 Convention Philos St Mind, Journal of Philosophy
10    67    29 Lewis D 1986 Philos Papers        Philosophical Review, Nous 
# ℹ 23 more rows

Or we can use Selector Gadget (or an equivalent dev tool) to find the CSS or XPath selector to the specific table, if we can find one.

Wikipedia tables

For a long time, Wikipedia tables were fairly straightforward to select because they were static. These days many of them have a Javascript element that makes them sortable by column, but also makes them harder to grab with a CSS selector. Getting all the table elements on a page and cleaning later is often the path of least resistance.

Wikipedia tables

irl_demog <- read_html("https://en.wikipedia.org/wiki/Demographics_of_the_Republic_of_Ireland")

irl_demog |>
  # Get all the tables classed as `.wikitable` on the page
  html_elements(".wikitable") |>
  pluck(7) |>
  html_table() |>
  janitor::clean_names()

# A tibble: 124 × 10
   x     population_on_1_april live_births deaths natural_change
   <chr> <chr>                 <chr>       <chr>  <chr>         
 1 1900  3,231,000             70,435      ""     ""            
 2 1901  3,234,000             70,194      ""     ""            
 3 1902  3,205,000             71,156      ""     ""            
 4 1903  3,191,000             70,541      ""     ""            
 5 1904  3,169,000             72,261      ""     ""            
 6 1905  3,160,000             71,427      ""     ""            
 7 1906  3,164,000             72,147      ""     ""            
 8 1907  3,145,000             70,773      ""     ""            
 9 1908  3,147,000             71,439      ""     ""            
10 1909  3,135,000             72,119      ""     ""            
# ℹ 114 more rows
# ℹ 5 more variables: crude_birth_rate_per_1000 <dbl>,
#   crude_death_rate_per_1000 <dbl>, natural_change_per_1000 <dbl>,
#   crude_migration_per_1000 <chr>, total_fertility_rate_fn_1_11 <dbl>

Again, scraping is unclean

Scraping can be useful to quickly grab a table or two that you need from a website. For harvesting large amounts of data it is no longer a very good idea, on the whole, and may get you banned from websites if you abuse it.

APIs

The idea of an API

Zapier provide a nice overview of APIs in a web guide that you can work your way through or skim.

We use APIs when requesting data directly. An API has a restricted set of protocols and methods by which a client (which can be you, but also can be your browser or an application) can ask it for data.

It has a defined set of responses (to say “OK” or “No” or “That didn’t work”) and formats in which it provides the client with an answer. Usually this will be a blob of JSON or XML data.

API endpoints

In a similar way that a URL specifies a request for a specific webpage, an API endpoint is a URL-like request for a specific blob of data. The trick is specifying the correct URL, which in essence specifies a request to API.

Example: the NY Citibike API

NYC CitiBikes

Many cities have Bike Share programs. Many of those provide data according to the GBFS specification. A spec like this is a set of rules that says “If you adhere to this spec, you will provide data in the following consistent way”, where this includes rules about the data format and the endpoints or URLs where that data can be found.

NYC CitiBikes

The spec has a rule saying “Provide a feed specifying the data available”.

gbfsurl <- "https://gbfs.citibikenyc.com/gbfs/2.3/gbfs.json"
feeds <- jsonlite::fromJSON(gbfsurl) 
str(feeds)

List of 4
 $ data        :List of 3
  ..$ en:List of 1
  .. ..$ feeds:'data.frame':    12 obs. of  2 variables:
  .. .. ..$ name: chr [1:12] "gbfs" "system_information" "station_information" "station_status" ...
  .. .. ..$ url : chr [1:12] "https://gbfs.lyft.com/gbfs/2.3/bkn/gbfs.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_status.json" ...
  ..$ es:List of 1
  .. ..$ feeds:'data.frame':    12 obs. of  2 variables:
  .. .. ..$ name: chr [1:12] "gbfs" "system_information" "station_information" "station_status" ...
  .. .. ..$ url : chr [1:12] "https://gbfs.lyft.com/gbfs/2.3/bkn/gbfs.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/es/system_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/es/station_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/es/station_status.json" ...
  ..$ fr:List of 1
  .. ..$ feeds:'data.frame':    12 obs. of  2 variables:
  .. .. ..$ name: chr [1:12] "gbfs" "system_information" "station_information" "station_status" ...
  .. .. ..$ url : chr [1:12] "https://gbfs.lyft.com/gbfs/2.3/bkn/gbfs.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/fr/system_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/fr/station_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/fr/station_status.json" ...
 $ last_updated: int 1711645038
 $ ttl         : int 60
 $ version     : chr "2.3"

This comes in to us a nested list.

NYC CitiBikes

JSON data is typically nested. It can look really complex at first glance. This is one reason to read the spec. We can put the feed into a tibble if we like:

feeds_df <- as_tibble(feeds)
feeds_df

# A tibble: 3 × 4
  data             last_updated   ttl version
  <named list>            <int> <int> <chr>  
1 <named list [1]>   1711645038    60 2.3    
2 <named list [1]>   1711645038    60 2.3    
3 <named list [1]>   1711645038    60 2.3

RStudio’s object viewer is a good way to explore the hierarchical structure of unfamiliar lists.

NYC CitiBikes

If we explore or look at the spec we see that each row provides the same feed information in English, Spanish, or French. We can slice out the English feed and look at what’s in it:

feeds_df |> 
  slice(1) |> 
  unnest(data) |> # It's two levels down
  unnest(data)

# A tibble: 12 × 5
   name                 url                           last_updated   ttl version
   <chr>                <chr>                                <int> <int> <chr>  
 1 gbfs                 https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
 2 system_information   https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
 3 station_information  https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
 4 station_status       https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
 5 free_bike_status     https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
 6 system_hours         https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
 7 system_calendar      https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
 8 system_regions       https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
 9 system_pricing_plans https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
10 system_alerts        https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
11 gbfs_versions        https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3    
12 vehicle_types        https://gbfs.lyft.com/gbfs/2…   1711645038    60 2.3

More urls! Let’s extract the station data feed.

NYC CitiBikes

nyc_stations_url <- feeds_df |> 
  slice(1) |> unnest(data) |> unnest(data) |> 
  filter(name == "station_information") |> pull(url)

nyc_stations_url

[1] "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_information.json"

NYC CitiBikes

That was tedious. If we know precisely what we’re after we can do it faster by plumbing down through the list:

# Base R style
feeds_df[[1]]$en$feeds$url[3]

[1] "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_information.json"

Or with pluck():

# Pluck by element number or name
feeds_df |> pluck(1,"en", "feeds", "url", 3)

[1] "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_information.json"

NYC CitiBikes

Or with some combination of methods:

feeds_df |> 
  pluck(1,"en", "feeds") |> 
  as_tibble()

# A tibble: 12 × 2
   name                 url                                                     
   <chr>                <chr>                                                   
 1 gbfs                 https://gbfs.lyft.com/gbfs/2.3/bkn/gbfs.json            
 2 system_information   https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_informatio…
 3 station_information  https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_informati…
 4 station_status       https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_status.js…
 5 free_bike_status     https://gbfs.lyft.com/gbfs/2.3/bkn/en/free_bike_status.…
 6 system_hours         https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_hours.json 
 7 system_calendar      https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_calendar.j…
 8 system_regions       https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_regions.js…
 9 system_pricing_plans https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_pricing_pl…
10 system_alerts        https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_alerts.json
11 gbfs_versions        https://gbfs.lyft.com/gbfs/2.3/bkn/en/gbfs_versions.json
12 vehicle_types        https://gbfs.lyft.com/gbfs/2.3/bkn/en/vehicle_types.json

Then work as you would with a tibble.

NYC CitiBikes

station_status_url <- feeds_df |> pluck(1,"en", "feeds") |> 
  filter(name == "station_status") |> pull(url)

station_status_df <- jsonlite::fromJSON(station_status_url)

str(station_status_df) # Still a list

List of 4
 $ data        :List of 1
  ..$ stations:'data.frame':    2209 obs. of  13 variables:
  .. ..$ num_docks_available     : int [1:2209] 0 0 0 3 19 27 8 2 9 8 ...
  .. ..$ num_ebikes_available    : int [1:2209] 0 0 0 0 3 1 0 1 2 10 ...
  .. ..$ last_reported           : int [1:2209] 86400 1710177561 1711117451 1711644898 1711644899 1711644899 1711644899 1711644900 1711644899 1711644906 ...
  .. ..$ num_bikes_available     : int [1:2209] 0 0 0 14 26 6 16 17 11 11 ...
  .. ..$ num_bikes_disabled      : int [1:2209] 0 0 0 2 2 0 1 1 0 1 ...
  .. ..$ is_installed            : int [1:2209] 0 0 0 1 1 1 1 1 1 1 ...
  .. ..$ vehicle_types_available :List of 2209
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 14 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 23 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 5 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 16 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 16 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 9 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 1 10
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 5
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 9 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 13 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 5 19
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 10 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 13 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 10 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 15 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 31 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 18 12
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 11
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 6
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 1 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 21 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 29
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 13
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 10 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 21 33
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 15
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 5 8
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 15 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 42 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 9 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 28
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 58 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 10 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 8
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 9 8
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 12 6
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 11
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 20 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 14 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 13 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 22
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 12 8
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 9 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 8 24
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 5
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 15 22
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 17 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 28 9
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 1 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 5 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 1 10
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 11
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 12 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 6
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 10 8
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 6
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 17 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 21 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 9 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 20 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 14 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 5 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 16 5
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 19 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 8 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 13 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 6
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 9
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 40 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 2
  .. .. .. [list output truncated]
  .. ..$ num_docks_disabled      : int [1:2209] 0 0 0 0 0 0 0 0 0 0 ...
  .. ..$ is_renting              : int [1:2209] 0 0 0 1 1 1 1 1 1 1 ...
  .. ..$ is_returning            : int [1:2209] 0 0 0 1 1 1 1 1 1 1 ...
  .. ..$ station_id              : chr [1:2209] "06439006-11b6-44f0-8545-c9d39035f32a" "19d17911-1e4a-41fa-b62b-719aa0a6182e" "66dd4a52-0aca-11e7-82f6-3863bb44ef7c" "1848241791125254746" ...
  .. ..$ num_scooters_available  : int [1:2209] NA NA NA 0 0 0 0 0 0 0 ...
  .. ..$ num_scooters_unavailable: int [1:2209] NA NA NA 0 0 0 0 0 0 0 ...
 $ last_updated: int 1711645041
 $ ttl         : int 60
 $ version     : chr "2.3"

Like the overall feeds data, the station status data is a nested list with information on when it was last updated and the software version as well as the actual data, which is in station_df$data$stations. We can get it out.

NYC CitiBikes

station_status_df <- station_status_df$data$stations |> 
  as_tibble()

station_status_df

# A tibble: 2,209 × 13
   num_docks_available num_ebikes_available last_reported num_bikes_available
                 <int>                <int>         <int>               <int>
 1                   0                    0         86400                   0
 2                   0                    0    1710177561                   0
 3                   0                    0    1711117451                   0
 4                   3                    0    1711644898                  14
 5                  19                    3    1711644899                  26
 6                  27                    1    1711644899                   6
 7                   8                    0    1711644899                  16
 8                   2                    1    1711644900                  17
 9                   9                    2    1711644899                  11
10                   8                   10    1711644906                  11
# ℹ 2,199 more rows
# ℹ 9 more variables: num_bikes_disabled <int>, is_installed <int>,
#   vehicle_types_available <list>, num_docks_disabled <int>, is_renting <int>,
#   is_returning <int>, station_id <chr>, num_scooters_available <int>,
#   num_scooters_unavailable <int>

As you can see there is more nested data in here too, in the column about the types of bike and scooter models available.

NYC CitiBikes

Let’s do the same for the feed of static information about stations.

station_info_url <- feeds_df |> pluck(1,"en", "feeds") |> 
  filter(name == "station_information") |> pull(url)

station_info_df <- station_info_url |> 
  jsonlite::fromJSON() |> 
  pluck("data", "stations") |> 
  as_tibble()

station_info_df

# A tibble: 2,209 × 8
   name     short_name region_id rental_uris$ios   lat   lon capacity station_id
   <chr>    <chr>      <chr>     <chr>           <dbl> <dbl>    <int> <chr>     
 1 Vesey S… 5216.06    71        https://bkn.lf…  40.7 -74.0       48 06439006-…
 2 6 Ave &… 5430.10    71        https://bkn.lf…  40.7 -74.0       39 19d17911-…
 3 W 67 St… 7116.04    71        https://bkn.lf…  40.8 -74.0       31 66dd4a52-…
 4 51 Ave … 6113.07    <NA>      https://bkn.lf…  40.7 -73.9       19 184824179…
 5 Bridge … 4968.03    71        https://bkn.lf…  40.7 -74.0       47 c71cca54-…
 6 Divisio… 5084.03    71        https://bkn.lf…  40.7 -74.0       33 66dce9b5-…
 7 1 Ave &… 7522.02    71        https://bkn.lf…  40.8 -73.9       25 0b613329-…
 8 Decatur… 8664.06    71        https://bkn.lf…  40.9 -73.9       20 6542d952-…
 9 62 St &… 6191.04    71        https://bkn.lf…  40.7 -73.9       20 a45a712c-…
10 6 St & … 3834.10    71        https://bkn.lf…  40.7 -74.0       21 66dde69c-…
# ℹ 2,199 more rows
# ℹ 1 more variable: rental_uris$android <chr>

NYC CityBikes

We can join these by station_id

stations_df <- station_status_df |> 
  left_join(station_info_df, by = "station_id") |> 
  relocate(name, capacity, num_bikes_available, lat, lon)

stations_df

# A tibble: 2,209 × 20
   name             capacity num_bikes_available   lat   lon num_docks_available
   <chr>               <int>               <int> <dbl> <dbl>               <int>
 1 Vesey St & Chur…       48                   0  40.7 -74.0                   0
 2 6 Ave & Walker …       39                   0  40.7 -74.0                   0
 3 W 67 St & Broad…       31                   0  40.8 -74.0                   0
 4 51 Ave & Juncti…       19                  14  40.7 -73.9                   3
 5 Bridge St & Wat…       47                  26  40.7 -74.0                  19
 6 Division Ave & …       33                   6  40.7 -74.0                  27
 7 1 Ave & E 110 St       25                  16  40.8 -73.9                   8
 8 Decatur Ave & B…       20                  17  40.9 -73.9                   2
 9 62 St & 43 Ave         20                  11  40.7 -73.9                   9
10 6 St & 7 Ave           21                  11  40.7 -74.0                   8
# ℹ 2,199 more rows
# ℹ 14 more variables: num_ebikes_available <int>, last_reported <int>,
#   num_bikes_disabled <int>, is_installed <int>,
#   vehicle_types_available <list>, num_docks_disabled <int>, is_renting <int>,
#   is_returning <int>, station_id <chr>, num_scooters_available <int>,
#   num_scooters_unavailable <int>, short_name <chr>, region_id <chr>,
#   rental_uris <df[,2]>

NYC CityBikes

stations_df |> 
  ggplot(aes(x = lon, 
             y = lat, 
             color = num_bikes_available/capacity)) + 
  geom_point(size = 0.5) +
  scale_color_viridis_c(option = "plasma", 
                        labels = scales::percent_format()) + 
  coord_equal() + 
  labs(color = "Availability",
       title = "New York CitiBike Stations",
       subtitle = "Current bike availability as a percentage of station capacity") + 
  theme_void()

NYC CitiBikes

Finally, the “last updated” tag is a good old Unix time number expressed in seconds since 1970:

feeds_df$last_updated[1]

[1] 1711645038

as_datetime(feeds_df$last_updated)[1]

[1] "2024-03-28 16:57:18 UTC"

Hidden APIs

Remeber, dynamic or client-side content has to come from somewhere.

Example: Forbes

(Courtesy of Hadley Wickham.)

Example: Iterating on the US Census

Iterating on the US Census

Mapped iteration is very general, and not just for local files

## Register for a free Census API key
library(tidycensus)

out <- get_acs(geography = "county", 
                    variables = "B19013_001",
                    state = "NY", 
                    county = "New York", 
                    survey = "acs1",
                    year = 2005)

out

# A tibble: 1 × 5
  GEOID NAME                      variable   estimate   moe
  <chr> <chr>                     <chr>         <dbl> <dbl>
1 36061 New York County, New York B19013_001    55973  1462

Iterating on the US Census

All counties in New York State for a specific year

out <- get_acs(geography = "county", 
                    variables = "B19013_001",
                    state = "NY", 
                    survey = "acs1",
                    year = 2005)

out

# A tibble: 38 × 5
   GEOID NAME                         variable   estimate   moe
   <chr> <chr>                        <chr>         <dbl> <dbl>
 1 36001 Albany County, New York      B19013_001    50054  2030
 2 36005 Bronx County, New York       B19013_001    29228   853
 3 36007 Broome County, New York      B19013_001    36394  2340
 4 36009 Cattaraugus County, New York B19013_001    37580  2282
 5 36011 Cayuga County, New York      B19013_001    42057  2406
 6 36013 Chautauqua County, New York  B19013_001    35495  2077
 7 36015 Chemung County, New York     B19013_001    37418  3143
 8 36019 Clinton County, New York     B19013_001    44757  3500
 9 36027 Dutchess County, New York    B19013_001    61889  2431
10 36029 Erie County, New York        B19013_001    41967  1231
# ℹ 28 more rows

Iterating on the US Census

What if we want the results for every available year? First, a handy function: set_names()

x <- c(1:10)

x

 [1]  1  2  3  4  5  6  7  8  9 10

x <- set_names(x, nm = letters[1:10])

x

 a  b  c  d  e  f  g  h  i  j 
 1  2  3  4  5  6  7  8  9 10

Iterating on the US Census

By default, set_names() will label a vector with that vector’s values:

c(1:10) |> 
  set_names()

 1  2  3  4  5  6  7  8  9 10 
 1  2  3  4  5  6  7  8  9 10

Iterating on the US Census

This works with map() just fine:

df <- 2005:2019 |> 
  map(\(x) get_acs(geography = "county",
                   variables = "B19013_001",
                   state = "NY",
                   survey = "acs1",
                   year = x)) |> 
  list_rbind(names_to = "year")

df

# A tibble: 580 × 6
    year GEOID NAME                         variable   estimate   moe
   <int> <chr> <chr>                        <chr>         <dbl> <dbl>
 1     1 36001 Albany County, New York      B19013_001    50054  2030
 2     1 36005 Bronx County, New York       B19013_001    29228   853
 3     1 36007 Broome County, New York      B19013_001    36394  2340
 4     1 36009 Cattaraugus County, New York B19013_001    37580  2282
 5     1 36011 Cayuga County, New York      B19013_001    42057  2406
 6     1 36013 Chautauqua County, New York  B19013_001    35495  2077
 7     1 36015 Chemung County, New York     B19013_001    37418  3143
 8     1 36019 Clinton County, New York     B19013_001    44757  3500
 9     1 36027 Dutchess County, New York    B19013_001    61889  2431
10     1 36029 Erie County, New York        B19013_001    41967  1231
# ℹ 570 more rows

Iterating on the US Census

Our id column tracks the year. But we’d like it to be the year. So, we use set_names():

df <- 2005:2019 |> 
  set_names() |> 
  map(\(x) get_acs(geography = "county",
                   variables = "B19013_001",
                   state = "NY",
                   survey = "acs1",
                   year = x)) |> 
  list_rbind(names_to = "year") |>
  mutate(year = as.integer(year))

Iterating on the US Census

df

# A tibble: 580 × 6
    year GEOID NAME                         variable   estimate   moe
   <int> <chr> <chr>                        <chr>         <dbl> <dbl>
 1  2005 36001 Albany County, New York      B19013_001    50054  2030
 2  2005 36005 Bronx County, New York       B19013_001    29228   853
 3  2005 36007 Broome County, New York      B19013_001    36394  2340
 4  2005 36009 Cattaraugus County, New York B19013_001    37580  2282
 5  2005 36011 Cayuga County, New York      B19013_001    42057  2406
 6  2005 36013 Chautauqua County, New York  B19013_001    35495  2077
 7  2005 36015 Chemung County, New York     B19013_001    37418  3143
 8  2005 36019 Clinton County, New York     B19013_001    44757  3500
 9  2005 36027 Dutchess County, New York    B19013_001    61889  2431
10  2005 36029 Erie County, New York        B19013_001    41967  1231
# ℹ 570 more rows

Now year is just the year. The year column will be created as a character vector, so we converted it back to an integer again at the end.

Iterating on the US Census

p_out <- 2005:2019 |>
  set_names() |>
  map(\(x) get_acs(geography = "county",
                   variables = "B19013_001",
                   state = "NY",
                   survey = "acs1",
                   year = x)) |>
  list_rbind(names_to = "year") |>
  mutate(year = as.integer(year)) |>
  ggplot(mapping = aes(x = year, y = estimate, group = year)) +
  geom_boxplot(fill = "lightblue", alpha = 0.5, outlier.alpha = 0) +
  geom_jitter(position = position_jitter(width = 0.1), shape = 1) +
  scale_y_continuous(labels = scales::label_dollar()) +
  labs(x = "Year", y = "Dollars",
       title = "Median Household Income by County in New York State, 2005-2019",
       subtitle = "ACS 1-year estimates", caption = "Data: U.S. Census Bureau.")