Scraping and APIs

Soc 690S: Week 09a

Kieran Healy

Duke University

March 2025

Load the packages, as always

library(here)      # manage file paths
library(socviz)    # data and some useful functions
library(tidyverse) # your friend and mine
library(rvest)     # For web-scraping

Attaching package: 'rvest'
The following object is masked from 'package:readr':

    guess_encoding

Scraping

Scraping is fundamentally unclean

It is awkward

It is very prone to error

It is brittle

It’s often against the terms of service of websites

Server-side rendering

If the webpages you are interested in looking at are statically or dynamically assembled on the server side, they arrive in your browser “fully formed”. The tables or other data structures they may contain are actually populated with numbers.

The task is to get them out by identifying the HTML elements (e.g. <table>, <tr>, <td>, <li> etc) or the CSS styling selectors (e.g. .pagetable, .datalist, #website-content-tabledata etc) and then extract the values they enclose.

HTML elements have a fixed number of names; CSS selectors have a fixed identifying format (the . and # prefixes, etc) and internal structure, but they can be named according to the needs of the site designer.

Client-side rendering

If the webpages you are interested in looking at are assembled on the client side, then what comes down is roughly in two pieces: an empty template for the the layout of the page; and a set of instructions for pulling data separately and to fill the template.

The process of requesting specific data takes place through some private or public application programming interface or API.

An API takes well-formed requests for data and returns whatever is requested in a structured format (usually JSON or XML) that can be used by the client. So we should just talk to the API directly if we can.

Scraping Example

Getting a single table

A single table where we know the class.

Consider this table

Type Mean PBE Median PBE Max PBE Min PBE N Schools N Students
Private Waldorf 47.49 44.19 84.21 20 16 513
Public Montessori 17.08 12.24 54.55 5.97 11 706
Charter Montessori 14.28 10.26 31.67 4.35 5 227
Charter 10.76 3.03 70.37 0 314 19,863
Private Christian 6.74 3.70 92.86 0 333 8,763
Private Non-Specific 5.89 0 86.96 0 596 16,795
Private Montessori 4.64 0 35.71 0 98 2,101
Private Jewish or Islamic 2.59 0 14.29 0 8 237
Public 2.33 0.81 75 0 5314 472,802
Private Catholic 1.80 0 27.78 0 333 8,855
Private Christian Montessori 1.25 0 5 0 4 78

Getting a single table

It’s on my website. It’s the only table on the page, which means we can just select it with by saying we want the <table> element on the page. We use html_element() to isolate the table and html_table() to parse it to a tibble:

vactab <- read_html("https://kieranhealy.org/blog/archives/2015/02/03/another-look-at-the-california-vaccination-data/")

vactab |>
  html_element("table") |>
  html_table() |>
  janitor::clean_names()
# A tibble: 11 × 7
   type                 mean_pbe median_pbe max_pbe min_pbe n_schools n_students
   <chr>                   <dbl>      <dbl>   <dbl>   <dbl>     <int> <chr>     
 1 Private Waldorf         47.5       44.2     84.2   20           16 513       
 2 Public Montessori       17.1       12.2     54.6    5.97        11 706       
 3 Charter Montessori      14.3       10.3     31.7    4.35         5 227       
 4 Charter                 10.8        3.03    70.4    0          314 19,863    
 5 Private Christian        6.74       3.7     92.9    0          333 8,763     
 6 Private Non-Specific     5.89       0       87.0    0          596 16,795    
 7 Private Montessori       4.64       0       35.7    0           98 2,101     
 8 Private Jewish or I…     2.59       0       14.3    0            8 237       
 9 Public                   2.33       0.81    75      0         5314 472,802   
10 Private Catholic         1.8        0       27.8    0          333 8,855     
11 Private Christian M…     1.25       0        5      0            4 78        

When there’s more than one

Next, consider the tables on this page. Again, these are static tables, but there are several of them.

With html_element() we’ll just extract the first one:

philtabs <- read_html("https://kieranhealy.org/blog/archives/2013/06/19/lewis-and-the-women/")

philtabs |>
  html_element("table") |>
  html_table() |>
  janitor::clean_names()
# A tibble: 24 × 4
    rank cites item                                 typically_cited_in        
   <int> <int> <chr>                                <chr>                     
 1     1   180 Kripke S 1980 Naming Necessity       Nous, Philosophical Review
 2     2   131 Lewis D 1986 Plurality Worlds        Nous, Philosophical Review
 3     3    97 Quine W 1960 Word Object             Philosophical Review, Nous
 4     4    83 Williamson T 2000 Knowledge Limits   Nous, Philosophical Review
 5     5    82 Lewis D 1973 Counterfactuals         Mind, Nous                
 6     6    78 Evans G 1982 Varieties Reference     Philosophical Review, Nous
 7     7    77 Chalmers D 1996 Conscious Mind       Philosophical Review, Nous
 8     7    77 Davidson D 1980 Essays Actions Event Philosophical Review, Mind
 9     9    73 Lewis D 1986 Philos Papers           Mind, Nous                
10    10    64 Parfit D 1984 Reasons Persons        Philosophical Review, Nous
# ℹ 14 more rows

When there’s more than one

With html_elements() we’ll extract all of them:

philtabs |>
  html_elements("table")
{xml_nodeset (5)}
[1] <table>\n<thead><tr>\n<th align="left"><em>Rank</em></th>\n<th align="lef ...
[2] <table>\n<thead><tr>\n<th align="left"><em>Rank</em></th>\n<th align="lef ...
[3] <table>\n<thead><tr class="header">\n<th align="left"><em>Rank</em></th>\ ...
[4] <table>\n<thead><tr>\n<th align="left"><em>Rank</em></th>\n<th align="lef ...
[5] <table>\n<thead><tr>\n<th align="right"><em>Rank</em></th>\n<th align="ri ...

When there’s more than one

We can get the nth one with pluck():

philtabs |>
  html_elements("table") |>
  pluck(2) |>
  html_table() |>
  janitor::clean_names()
# A tibble: 33 × 4
    rank cites item                              typically_cited_in         
   <int> <int> <chr>                             <chr>                      
 1     2   131 Lewis D 1986 Plurality Worlds     Nous, Philosophical Review 
 2     5    82 Lewis D 1973 Counterfactuals      Mind, Nous                 
 3     9    73 Lewis D 1986 Philos Papers        Mind, Nous                 
 4    16    50 Lewis D 1983 Philos Papers        Nous, Mind                 
 5    18    48 Lewis D 1983 Australas J Philos   Nous, Mind                 
 6    20    44 Lewis D 1996 Australas J Philos   Nous, Philosophical Review 
 7    41    36 Lewis D 1979 J Philos Logic       Nous, Mind                 
 8    47    34 Lewis D 1991 Parts Classes        Nous, Mind                 
 9    67    29 Lewis D 1969 Convention Philos St Mind, Journal of Philosophy
10    67    29 Lewis D 1986 Philos Papers        Philosophical Review, Nous 
# ℹ 23 more rows

Or we can use Selector Gadget (or an equivalent dev tool) to find the CSS or XPath selector to the specific table, if we can find one.

Wikipedia tables

For a long time, Wikipedia tables were fairly straightforward to select because they were static. These days many of them have a Javascript element that makes them sortable by column, but also makes them harder to grab with a CSS selector. Getting all the table elements on a page and cleaning later is often the path of least resistance.

Wikipedia tables

irl_demog <- read_html("https://en.wikipedia.org/wiki/Demographics_of_the_Republic_of_Ireland")

irl_demog |>
  # Get all the tables classed as `.wikitable` on the page
  html_elements(".wikitable") |>
  pluck(7) |>
  html_table() |>
  janitor::clean_names()
# A tibble: 125 × 10
       x population_on_1_april live_births deaths natural_change
   <int> <chr>                 <chr>       <chr>  <chr>         
 1  1900 3,231,000             70,435      ""     ""            
 2  1901 3,234,000             70,194      ""     ""            
 3  1902 3,205,000             71,156      ""     ""            
 4  1903 3,191,000             70,541      ""     ""            
 5  1904 3,169,000             72,261      ""     ""            
 6  1905 3,160,000             71,427      ""     ""            
 7  1906 3,164,000             72,147      ""     ""            
 8  1907 3,145,000             70,773      ""     ""            
 9  1908 3,147,000             71,439      ""     ""            
10  1909 3,135,000             72,119      ""     ""            
# ℹ 115 more rows
# ℹ 5 more variables: crude_birth_rate_per_1000 <dbl>,
#   crude_death_rate_per_1000 <dbl>, natural_change_per_1000 <dbl>,
#   crude_migration_per_1000 <chr>, total_fertility_rate_fn_1_10 <chr>

Again, scraping is unclean

Scraping can be useful to quickly grab a table or two that you need from a website. For harvesting large amounts of data it is no longer a very good idea, on the whole, and may get you banned from websites if you abuse it.

APIs

The idea of an API

Zapier provide a nice overview of APIs in a web guide that you can work your way through or skim.

We use APIs when requesting data directly. An API has a restricted set of protocols and methods by which a client (which can be you, but also can be your browser or an application) can ask it for data.

It has a defined set of responses (to say “OK” or “No” or “That didn’t work”) and formats in which it provides the client with an answer. Usually this will be a blob of JSON or XML data.

API endpoints

In a similar way that a URL specifies a request for a specific webpage, an API endpoint is a URL-like request for a specific blob of data. The trick is specifying the correct URL, which in essence specifies a request to API.

Example: the NY Citibike API

NYC CitiBikes

Many cities have Bike Share programs. Many of those provide data according to the GBFS specification. A spec like this is a set of rules that says “If you adhere to this spec, you will provide data in the following consistent way”, where this includes rules about the data format and the endpoints or URLs where that data can be found.

NYC CitiBikes

The spec has a rule saying “Provide a feed specifying the data available”.

gbfsurl <- "https://gbfs.citibikenyc.com/gbfs/2.3/gbfs.json"
feeds <- jsonlite::fromJSON(gbfsurl)
str(feeds)
List of 4
 $ data        :List of 3
  ..$ en:List of 1
  .. ..$ feeds:'data.frame':    12 obs. of  2 variables:
  .. .. ..$ url : chr [1:12] "https://gbfs.lyft.com/gbfs/2.3/bkn/gbfs.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_status.json" ...
  .. .. ..$ name: chr [1:12] "gbfs" "system_information" "station_information" "station_status" ...
  ..$ fr:List of 1
  .. ..$ feeds:'data.frame':    12 obs. of  2 variables:
  .. .. ..$ url : chr [1:12] "https://gbfs.lyft.com/gbfs/2.3/bkn/gbfs.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/fr/system_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/fr/station_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/fr/station_status.json" ...
  .. .. ..$ name: chr [1:12] "gbfs" "system_information" "station_information" "station_status" ...
  ..$ es:List of 1
  .. ..$ feeds:'data.frame':    12 obs. of  2 variables:
  .. .. ..$ url : chr [1:12] "https://gbfs.lyft.com/gbfs/2.3/bkn/gbfs.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/es/system_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/es/station_information.json" "https://gbfs.lyft.com/gbfs/2.3/bkn/es/station_status.json" ...
  .. .. ..$ name: chr [1:12] "gbfs" "system_information" "station_information" "station_status" ...
 $ last_updated: int 1742932833
 $ ttl         : int 60
 $ version     : chr "2.3"

This comes in to us a nested list.

NYC CitiBikes

JSON data is typically nested. It can look really complex at first glance. This is one reason to read the spec. We can put the feed into a tibble if we like:

feeds_df <- as_tibble(feeds)
feeds_df
# A tibble: 3 × 4
  data             last_updated   ttl version
  <named list>            <int> <int> <chr>  
1 <named list [1]>   1742932833    60 2.3    
2 <named list [1]>   1742932833    60 2.3    
3 <named list [1]>   1742932833    60 2.3    

RStudio’s object viewer is a good way to explore the hierarchical structure of unfamiliar lists.

NYC CitiBikes

If we explore or look at the spec we see that each row provides the same feed information in English, Spanish, or French. We can slice out the English feed and look at what’s in it:

feeds_df |>
  slice(1) |>
  unnest(data) |> # It's two levels down
  unnest(data)
# A tibble: 12 × 5
   url                                          name  last_updated   ttl version
   <chr>                                        <chr>        <int> <int> <chr>  
 1 https://gbfs.lyft.com/gbfs/2.3/bkn/gbfs.json gbfs    1742932833    60 2.3    
 2 https://gbfs.lyft.com/gbfs/2.3/bkn/en/syste… syst…   1742932833    60 2.3    
 3 https://gbfs.lyft.com/gbfs/2.3/bkn/en/stati… stat…   1742932833    60 2.3    
 4 https://gbfs.lyft.com/gbfs/2.3/bkn/en/stati… stat…   1742932833    60 2.3    
 5 https://gbfs.lyft.com/gbfs/2.3/bkn/en/free_… free…   1742932833    60 2.3    
 6 https://gbfs.lyft.com/gbfs/2.3/bkn/en/syste… syst…   1742932833    60 2.3    
 7 https://gbfs.lyft.com/gbfs/2.3/bkn/en/syste… syst…   1742932833    60 2.3    
 8 https://gbfs.lyft.com/gbfs/2.3/bkn/en/syste… syst…   1742932833    60 2.3    
 9 https://gbfs.lyft.com/gbfs/2.3/bkn/en/syste… syst…   1742932833    60 2.3    
10 https://gbfs.lyft.com/gbfs/2.3/bkn/en/syste… syst…   1742932833    60 2.3    
11 https://gbfs.lyft.com/gbfs/2.3/bkn/en/gbfs_… gbfs…   1742932833    60 2.3    
12 https://gbfs.lyft.com/gbfs/2.3/bkn/en/vehic… vehi…   1742932833    60 2.3    

More urls! Let’s extract the station data feed.

NYC CitiBikes

nyc_stations_url <- feeds_df |>
  slice(1) |> unnest(data) |> unnest(data) |>
  filter(name == "station_information") |> pull(url)

nyc_stations_url
[1] "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_information.json"

NYC CitiBikes

That was tedious. If we know precisely what we’re after we can do it faster by plumbing down through the list:

# Base R style
feeds_df[[1]]$en$feeds$url[3]
[1] "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_information.json"

Or with pluck():

# Pluck by element number or name
feeds_df |> pluck(1,"en", "feeds", "url", 3)
[1] "https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_information.json"

NYC CitiBikes

Or with some combination of methods:

feeds_df |>
  pluck(1,"en", "feeds") |>
  as_tibble()
# A tibble: 12 × 2
   url                                                             name         
   <chr>                                                           <chr>        
 1 https://gbfs.lyft.com/gbfs/2.3/bkn/gbfs.json                    gbfs         
 2 https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_information.json   system_infor…
 3 https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_information.json  station_info…
 4 https://gbfs.lyft.com/gbfs/2.3/bkn/en/station_status.json       station_stat…
 5 https://gbfs.lyft.com/gbfs/2.3/bkn/en/free_bike_status.json     free_bike_st…
 6 https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_hours.json         system_hours 
 7 https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_calendar.json      system_calen…
 8 https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_regions.json       system_regio…
 9 https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_pricing_plans.json system_prici…
10 https://gbfs.lyft.com/gbfs/2.3/bkn/en/system_alerts.json        system_alerts
11 https://gbfs.lyft.com/gbfs/2.3/bkn/en/gbfs_versions.json        gbfs_versions
12 https://gbfs.lyft.com/gbfs/2.3/bkn/en/vehicle_types.json        vehicle_types

Then work as you would with a tibble.

NYC CitiBikes

station_status_url <- feeds_df |> pluck(1,"en", "feeds") |>
  filter(name == "station_status") |> pull(url)

station_status_df <- jsonlite::fromJSON(station_status_url)

str(station_status_df) # Still a list
List of 4
 $ data        :List of 1
  ..$ stations:'data.frame':    2234 obs. of  13 variables:
  .. ..$ station_id              : chr [1:2234] "579e583e-60ab-4f2a-83d0-fabebddfcd0f" "1851254468343716806" "7a1ad7c0-4958-4354-91a1-11bf13348abd" "53ef19c6-eaa7-47f7-807e-f51b574b1be0" ...
  .. ..$ num_bikes_available     : int [1:2234] 17 17 18 12 1 29 19 9 18 8 ...
  .. ..$ num_scooters_unavailable: int [1:2234] 0 0 0 0 0 0 0 0 0 0 ...
  .. ..$ vehicle_types_available :List of 2234
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 11
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 8 9
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 14 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 11 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 1 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 29 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 16 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 7
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 18 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 23
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 5 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 10 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 12 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 21 16
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 20 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 28
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 11 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 15 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 10 5
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 12 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 17 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 44 20
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 17 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 8 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 1 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 18 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 22 17
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 14 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 13 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 15
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 5
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 9 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 10
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 19 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 44 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 10 56
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 9
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 44 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 9 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 8
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 14 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 6 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 21 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 18 7
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 22 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 5 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 5 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 13 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 10 7
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 8 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 14 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 9
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 19 9
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 9 21
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 1 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 18
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 27 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 1 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 4 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 13 4
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 13 5
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 18 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 36 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 5 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 1 2
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 14 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 8 3
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 7 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 28 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 12 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 21 19
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 5 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 20 44
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 2 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 0 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 25 43
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 19 0
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 8 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 18 1
  .. .. ..$ :'data.frame':  2 obs. of  2 variables:
  .. .. .. ..$ vehicle_type_id: chr [1:2] "1" "2"
  .. .. .. ..$ count          : int [1:2] 3 1
  .. .. .. [list output truncated]
  .. ..$ is_installed            : int [1:2234] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ num_bikes_disabled      : int [1:2234] 2 2 1 7 3 2 2 0 2 1 ...
  .. ..$ num_docks_disabled      : int [1:2234] 0 0 0 0 0 0 0 0 0 0 ...
  .. ..$ is_returning            : int [1:2234] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ num_scooters_available  : int [1:2234] 0 0 0 0 0 0 0 0 0 0 ...
  .. ..$ is_renting              : int [1:2234] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ num_ebikes_available    : int [1:2234] 11 9 4 1 0 0 3 7 0 1 ...
  .. ..$ num_docks_available     : int [1:2234] 0 1 0 0 31 13 9 12 3 10 ...
  .. ..$ last_reported           : int [1:2234] 1742932698 1742932698 1742932705 1742932706 1742932706 1742932702 1742932701 1742932700 1742932701 1742932701 ...
 $ last_updated: int 1742932835
 $ ttl         : int 60
 $ version     : chr "2.3"

Like the overall feeds data, the station status data is a nested list with information on when it was last updated and the software version as well as the actual data, which is in station_df$data$stations. We can get it out.

NYC CitiBikes

station_status_df <- station_status_df$data$stations |>
  as_tibble()

station_status_df
# A tibble: 2,234 × 13
   station_id  num_bikes_available num_scooters_unavail…¹ vehicle_types_availa…²
   <chr>                     <int>                  <int> <list>                
 1 579e583e-6…                  17                      0 <df [2 × 2]>          
 2 1851254468…                  17                      0 <df [2 × 2]>          
 3 7a1ad7c0-4…                  18                      0 <df [2 × 2]>          
 4 53ef19c6-e…                  12                      0 <df [2 × 2]>          
 5 66dd3d7f-0…                   1                      0 <df [2 × 2]>          
 6 c7604869-5…                  29                      0 <df [2 × 2]>          
 7 bacfe3c2-7…                  19                      0 <df [2 × 2]>          
 8 1869753654…                   9                      0 <df [2 × 2]>          
 9 cd2d9dab-7…                  18                      0 <df [2 × 2]>          
10 bf880f45-4…                   8                      0 <df [2 × 2]>          
# ℹ 2,224 more rows
# ℹ abbreviated names: ¹​num_scooters_unavailable, ²​vehicle_types_available
# ℹ 9 more variables: is_installed <int>, num_bikes_disabled <int>,
#   num_docks_disabled <int>, is_returning <int>, num_scooters_available <int>,
#   is_renting <int>, num_ebikes_available <int>, num_docks_available <int>,
#   last_reported <int>

As you can see there is more nested data in here too, in the column about the types of bike and scooter models available.

NYC CitiBikes

Let’s do the same for the feed of static information about stations.

station_info_url <- feeds_df |> pluck(1,"en", "feeds") |>
  filter(name == "station_information") |> pull(url)

station_info_df <- station_info_url |>
  jsonlite::fromJSON() |>
  pluck("data", "stations") |>
  as_tibble()

station_info_df
# A tibble: 2,234 × 8
   station_id    short_name   lon capacity   lat rental_uris$ios region_id name 
   <chr>         <chr>      <dbl>    <int> <dbl> <chr>           <chr>     <chr>
 1 66de2b2f-0ac… 4110.10    -74.0       53  40.7 https://bkn.lf… 71        3 St…
 2 66dc0dab-0ac… 7023.04    -74.0      117  40.8 https://bkn.lf… 71        W 59…
 3 181490742305… 3070.04    -74.0       26  40.6 https://bkn.lf… 71        E 16…
 4 00967b8f-1a4… 6225.03    -73.9       20  40.7 https://bkn.lf… 71        41 A…
 5 187541514476… 8819.05    -73.9       25  40.9 https://bkn.lf… 71        Napl…
 6 26217b73-7bb… 8104.10    -73.9       27  40.8 https://bkn.lf… 71        Amst…
 7 633fbc4c-761… 4095.10    -74.0       24  40.7 https://bkn.lf… 71        Van …
 8 6476c5fd-611… 7927.08    -73.9       19  40.8 https://bkn.lf… 71        E 16…
 9 182896662655… 6718.02    -73.9       20  40.8 https://bkn.lf… 71        77 S…
10 a4d041fd-c5f… 7809.13    -73.9       22  40.8 https://bkn.lf… 71        E 13…
# ℹ 2,224 more rows
# ℹ 1 more variable: rental_uris$android <chr>

NYC CityBikes

We can join these by station_id

stations_df <- station_status_df |>
  left_join(station_info_df, by = "station_id") |>
  relocate(name, capacity, num_bikes_available, lat, lon)

stations_df
# A tibble: 2,234 × 20
   name                      capacity num_bikes_available   lat   lon station_id
   <chr>                        <int>               <int> <dbl> <dbl> <chr>     
 1 Broadway & Ellwood St           19                  17  40.9 -73.9 579e583e-…
 2 60 Ave & Otis Ave               21                  17  40.7 -73.9 185125446…
 3 56 Dr & 59 St                   19                  18  40.7 -73.9 7a1ad7c0-…
 4 Dyckman St & Staff St           19                  12  40.9 -73.9 53ef19c6-…
 5 E 68 St & 3 Ave                 36                   1  40.8 -74.0 66dd3d7f-…
 6 Kent Ave & Division Ave         44                  29  40.7 -74.0 c7604869-…
 7 Madison Ave & E 120 St          30                  19  40.8 -73.9 bacfe3c2-…
 8 Marble Hill Ave & W 225 …       21                   9  40.9 -73.9 186975365…
 9 Rutland Rd & E 45 St            23                  18  40.7 -73.9 cd2d9dab-…
10 Fulton Ave & St Paul's Pl       19                   8  40.8 -73.9 bf880f45-…
# ℹ 2,224 more rows
# ℹ 14 more variables: num_scooters_unavailable <int>,
#   vehicle_types_available <list>, is_installed <int>,
#   num_bikes_disabled <int>, num_docks_disabled <int>, is_returning <int>,
#   num_scooters_available <int>, is_renting <int>, num_ebikes_available <int>,
#   num_docks_available <int>, last_reported <int>, short_name <chr>,
#   rental_uris <df[,2]>, region_id <chr>

NYC CityBikes

stations_df |>
  ggplot(aes(x = lon,
             y = lat,
             color = num_bikes_available/capacity)) +
  geom_point(size = 0.5) +
  scale_color_viridis_c(option = "plasma",
                        labels = scales::percent_format()) +
  coord_equal() +
  labs(color = "Availability",
       title = "New York CitiBike Stations",
       subtitle = "Current bike availability as a percentage of station capacity") +
  theme_void()

NYC CitiBikes

Finally, the “last updated” tag is a good old Unix time number expressed in seconds since 1970:

feeds_df$last_updated[1]
[1] 1742932833
as_datetime(feeds_df$last_updated)[1]
[1] "2025-03-25 20:00:33 UTC"

Hidden APIs

Hidden APIs

Remeber, dynamic or client-side content has to come from somewhere.

Example: Forbes