Databases

Soc 690S: Week 11

Kieran Healy

Duke University

April 2025

Load the packages, as always

library(here)      # manage file paths
library(socviz)    # data and some useful functions
library(tidyverse) # your friend and mine
library(gapminder) # inescapable

library(DBI) # DBMS interface layer
library(duckdb) # Local database server

‘Big’ Data is just data, except there’s more of everything

Pig Data

Benny’s dilemma.

Pig Data

Benny has a brother now. He wanted one. And then he got one.

Some Datasets

palmerpenguins::penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

N = 344

Some Datasets

gapminder::gapminder

# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

N = 1,704

Some Datasets

socviz::gss_lon

# A tibble: 62,466 × 25
    year    id ballot age   degree race  sex   siblings kids  bigregion income16
   <dbl> <dbl> <labe> <lab> <fct>  <fct> <fct> <fct>    <fct> <fct>     <fct>   
 1  1972     1 NA     23    Bache… White Fema… 3        0     Midwest   <NA>    
 2  1972     2 NA     70    Lt Hi… White Male  4        4+    Midwest   <NA>    
 3  1972     3 NA     48    High … White Fema… 5        4+    Midwest   <NA>    
 4  1972     4 NA     27    Bache… White Fema… 5        0     Midwest   <NA>    
 5  1972     5 NA     61    High … White Fema… 2        2     Midwest   <NA>    
 6  1972     6 NA     26    High … White Male  1        0     Midwest   <NA>    
 7  1972     7 NA     28    High … White Male  6+       2     Midwest   <NA>    
 8  1972     8 NA     27    Bache… White Male  1        0     Midwest   <NA>    
 9  1972     9 NA     21    High … Black Fema… 2        2     South     <NA>    
10  1972    10 NA     30    High … Black Fema… 6+       4+    South     <NA>    
# ℹ 62,456 more rows
# ℹ 14 more variables: religion <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners_rc <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, vpsu <dbl>, vstrat <dbl>

N = 62,466

Some Datasets

nycflights13::flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

N = 336,766

Some Datasets

covdata::stmf

# A tibble: 580,395 × 17
   country_code cname    iso2  continent iso3   year  week sex   split split_sex
   <chr>        <chr>    <chr> <chr>     <chr> <dbl> <dbl> <chr> <dbl>     <dbl>
 1 AUS          Austral… AU    Oceania   AUS    2015     1 m         1         0
 2 AUS          Austral… AU    Oceania   AUS    2015     1 m         1         0
 3 AUS          Austral… AU    Oceania   AUS    2015     1 m         1         0
 4 AUS          Austral… AU    Oceania   AUS    2015     1 m         1         0
 5 AUS          Austral… AU    Oceania   AUS    2015     1 m         1         0
 6 AUS          Austral… AU    Oceania   AUS    2015     1 f         1         0
 7 AUS          Austral… AU    Oceania   AUS    2015     1 f         1         0
 8 AUS          Austral… AU    Oceania   AUS    2015     1 f         1         0
 9 AUS          Austral… AU    Oceania   AUS    2015     1 f         1         0
10 AUS          Austral… AU    Oceania   AUS    2015     1 f         1         0
# ℹ 580,385 more rows
# ℹ 7 more variables: forecast <dbl>, approx_date <date>, age_group <chr>,
#   death_count <dbl>, death_rate <dbl>, deaths_total <dbl>, rate_total <dbl>

N = 580,395

Some Datasets

covdata::apple_mobility

# A tibble: 2,254,515 × 7
   country sub_region subregion_and_city geo_type date       transportation_type
   <chr>   <chr>      <chr>              <chr>    <date>     <chr>              
 1 Albania Total      Total              country… 2020-01-13 driving            
 2 Albania Total      Total              country… 2020-01-14 driving            
 3 Albania Total      Total              country… 2020-01-15 driving            
 4 Albania Total      Total              country… 2020-01-16 driving            
 5 Albania Total      Total              country… 2020-01-17 driving            
 6 Albania Total      Total              country… 2020-01-18 driving            
 7 Albania Total      Total              country… 2020-01-19 driving            
 8 Albania Total      Total              country… 2020-01-20 driving            
 9 Albania Total      Total              country… 2020-01-21 driving            
10 Albania Total      Total              country… 2020-01-22 driving            
# ℹ 2,254,505 more rows
# ℹ 1 more variable: score <dbl>

N = 2,254,515

Some Datasets

apple_health_all

# A tibble: 7,808,311 × 10
   rowid type       sourceName sourceVersion device unit  creationDate startDate
   <int> <chr>      <chr>      <chr>         <chr>  <chr> <chr>        <chr>
 1     1 HKQuantit… Withings   6050301       <<HKD… count 2024-08-31 … 2024-05-…
 2     2 HKQuantit… Withings   6050301       <<HKD… count 2024-08-31 … 2024-05-…
 3     3 HKQuantit… Withings   6050301       <<HKD… count 2024-08-31 … 2024-05-…
 4     4 HKQuantit… Withings   6050301       <<HKD… count 2024-08-31 … 2024-05-…
 5     5 HKQuantit… Withings   6050301       <<HKD… count 2024-08-31 … 2024-05-…
 6     6 HKQuantit… Withings   6050301       <<HKD… count 2024-08-31 … 2024-05-…
 7     7 HKQuantit… Withings   6050301       <<HKD… count 2024-08-31 … 2024-05-…
 8     8 HKQuantit… Withings   6050301       <<HKD… count 2024-08-31 … 2024-05-…
 9     9 HKQuantit… Withings   6050301       <<HKD… count 2024-08-31 … 2024-05-…
10    10 HKQuantit… Withings   6050301       <<HKD… count 2024-08-31 … 2024-05-…
# ℹ 7,808,301 more rows
# ℹ 2 more variables: endDate <chr>, value <chr>

N = 7,808,311. Entirely about me. On my phone.

Some Datasets

## Arcos opioids data
df_db <- tbl(con, "arcos")
## How many rows?
df_db |> count()

#> # Source:   SQL [1 x 1]
#> # Database: DuckDB v1.1.1 [root@Darwin 24.0.0:R 4.4.1//Users/kjhealy/Documents/data/arcos_opioids_test/db/arcos.duckdb]
#>           n
#>       <dbl>
#> 1 178598026

N = 178,598,026

This is the only one that’s even remotely “big” right now—i.e. that needs a database if I want to interact with it properly. Truly big data won’t fit on your laptop.

Pig Data

Benny’s brother screams and screams.

‘Big’ Data

More of everything

More observations
More patterns
More possibilities
More mess
More cleaning
More decisions
More problems

But we stay the same

Same perception
Same cognitive limits
Same graphs we know how to understand
Same number of things we can reasonably pay attention to
Same tendency to spin yarns when we see a pattern

Table Joins

Merging Data

Often we work with a single table of data. Ultimately, a lot of quantitative analysis that we want to do will require a single table (or rather, a single matrix).

Sometimes we want to add additional information to this table. Very often, something from a different level of analysis.

E.g. we have a table of individual-level measures about people, and we have a separate table of information about the county or state they live in.

Merging Data

In the world of spreadsheets, and also a lot of social science, we talk about this kind of problem as one of merging data.

In the world of databases, this is a table join.

Example: Cleaning Up Congress

Cleaning up congress

filenames <- dir(path = here("files", "data", "congress"),
                 pattern = "*.csv",
                 full.names = TRUE)

df <- read_csv(filenames, id = "path",
                name_repair = janitor::make_clean_names) |>
  mutate(congress = stringr::str_extract(path, "_\\d{2,3}_congress"),
         congress = stringr::str_extract(congress, "\\d{2,3}"),
         congress = as.integer(congress)) |>
  relocate(congress)

df |>
  select(congress, last, first, born, death, start, end)

# A tibble: 20,580 × 7
   congress last      first   born       death      start      end       
      <int> <chr>     <chr>   <chr>      <chr>      <chr>      <chr>     
 1       79 Abernethy Thomas  05/16/1903 01/23/1953 01/03/1945 01/03/1953
 2       79 Adams     Sherman 01/08/1899 10/27/1986 01/03/1945 01/03/1947
 3       79 Aiken     George  08/20/1892 11/19/1984 01/03/1945 01/03/1979
 4       79 Allen     Asa     01/05/1891 01/05/1969 01/03/1945 01/03/1953
 5       79 Allen     Leo     10/05/1898 01/19/1973 01/03/1945 01/02/1949
 6       79 Almond    J.      06/15/1898 04/14/1986 02/04/1946 04/17/1948
 7       79 Andersen  Herman  01/27/1897 07/26/1978 01/03/1945 01/03/1963
 8       79 Anderson  Clinton 10/23/1895 11/11/1975 01/03/1941 06/30/1945
 9       79 Anderson  John    03/22/1904 02/09/1981 01/03/1945 01/03/1953
10       79 Andresen  August  10/11/1890 01/14/1958 01/03/1945 01/14/1958
# ℹ 20,570 more rows

Cleaning up congress

date_recodes <- c("born", "death", "start", "end")
df <- df |>
    mutate(across(any_of(date_recodes), mdy))

df |>
 select(congress, last, first, born, death, start, end)

# A tibble: 20,580 × 7
   congress last      first   born       death      start      end       
      <int> <chr>     <chr>   <date>     <date>     <date>     <date>    
 1       79 Abernethy Thomas  1903-05-16 1953-01-23 1945-01-03 1953-01-03
 2       79 Adams     Sherman 1899-01-08 1986-10-27 1945-01-03 1947-01-03
 3       79 Aiken     George  1892-08-20 1984-11-19 1945-01-03 1979-01-03
 4       79 Allen     Asa     1891-01-05 1969-01-05 1945-01-03 1953-01-03
 5       79 Allen     Leo     1898-10-05 1973-01-19 1945-01-03 1949-01-02
 6       79 Almond    J.      1898-06-15 1986-04-14 1946-02-04 1948-04-17
 7       79 Andersen  Herman  1897-01-27 1978-07-26 1945-01-03 1963-01-03
 8       79 Anderson  Clinton 1895-10-23 1975-11-11 1941-01-03 1945-06-30
 9       79 Anderson  John    1904-03-22 1981-02-09 1945-01-03 1953-01-03
10       79 Andresen  August  1890-10-11 1958-01-14 1945-01-03 1958-01-14
# ℹ 20,570 more rows

Cleaning up congress

Next, a new table.

sessions <- tibble(congress = 79L:116L,
                   start_year = seq(1945, 2019, by = 2),
                   end_year = seq(1947, 2021, by = 2)) |>
  mutate(start_year = ymd(paste(start_year, "01", "03", sep = "-")),
         end_year = ymd(paste(end_year, "01", "03", sep = "-")))


sessions

# A tibble: 38 × 3
   congress start_year end_year  
      <int> <date>     <date>    
 1       79 1945-01-03 1947-01-03
 2       80 1947-01-03 1949-01-03
 3       81 1949-01-03 1951-01-03
 4       82 1951-01-03 1953-01-03
 5       83 1953-01-03 1955-01-03
 6       84 1955-01-03 1957-01-03
 7       85 1957-01-03 1959-01-03
 8       86 1959-01-03 1961-01-03
 9       87 1961-01-03 1963-01-03
10       88 1963-01-03 1965-01-03
# ℹ 28 more rows

We’re going to join these tables

The big table:

df |>
  select(congress, last, born)

# A tibble: 20,580 × 3
   congress last      born      
      <int> <chr>     <date>    
 1       79 Abernethy 1903-05-16
 2       79 Adams     1899-01-08
 3       79 Aiken     1892-08-20
 4       79 Allen     1891-01-05
 5       79 Allen     1898-10-05
 6       79 Almond    1898-06-15
 7       79 Andersen  1897-01-27
 8       79 Anderson  1895-10-23
 9       79 Anderson  1904-03-22
10       79 Andresen  1890-10-11
# ℹ 20,570 more rows

The smaller table

sessions

# A tibble: 38 × 3
   congress start_year end_year  
      <int> <date>     <date>    
 1       79 1945-01-03 1947-01-03
 2       80 1947-01-03 1949-01-03
 3       81 1949-01-03 1951-01-03
 4       82 1951-01-03 1953-01-03
 5       83 1953-01-03 1955-01-03
 6       84 1955-01-03 1957-01-03
 7       85 1957-01-03 1959-01-03
 8       86 1959-01-03 1961-01-03
 9       87 1961-01-03 1963-01-03
10       88 1963-01-03 1965-01-03
# ℹ 28 more rows

We’re going to join these tables

We will use left_join() which is what you want most of the time when you are looking to merge a smaller table with additional information into a larger main one.

df <- left_join(df, sessions) |>
  relocate(start_year:end_year, .after = congress)

Joining with `by = join_by(congress)`

df |>
  select(congress, last, first, start_year, end_year, start, end)

# A tibble: 20,580 × 7
   congress last      first   start_year end_year   start      end       
      <int> <chr>     <chr>   <date>     <date>     <date>     <date>    
 1       79 Abernethy Thomas  1945-01-03 1947-01-03 1945-01-03 1953-01-03
 2       79 Adams     Sherman 1945-01-03 1947-01-03 1945-01-03 1947-01-03
 3       79 Aiken     George  1945-01-03 1947-01-03 1945-01-03 1979-01-03
 4       79 Allen     Asa     1945-01-03 1947-01-03 1945-01-03 1953-01-03
 5       79 Allen     Leo     1945-01-03 1947-01-03 1945-01-03 1949-01-02
 6       79 Almond    J.      1945-01-03 1947-01-03 1946-02-04 1948-04-17
 7       79 Andersen  Herman  1945-01-03 1947-01-03 1945-01-03 1963-01-03
 8       79 Anderson  Clinton 1945-01-03 1947-01-03 1941-01-03 1945-06-30
 9       79 Anderson  John    1945-01-03 1947-01-03 1945-01-03 1953-01-03
10       79 Andresen  August  1945-01-03 1947-01-03 1945-01-03 1958-01-14
# ℹ 20,570 more rows

Table joins

Left join, left_join()

All rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.

Left join (contd), left_join()

If there are multiple matches between x and y, all combinations of the matches are returned.

Inner join, inner_join()

All rows from x where there are matching values in y, and all columns from x and y.

Full join, full_join()

All rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Semi join, semi_join()

All rows from x where there are matching values in y, keeping just columns from x.

Anti join, anti_join()

All rows from x where there are not matching values in y, keeping just columns from x.

Databases

When we’re working with data in the social sciences the basic case is a single table that we’re going to do something with, like run a regression or make a plot.

gapminder

# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

But the bigger a dataset gets, the more we have to think about whether we really want (or even can have) all of it in memory all the time.

Databases

In addition, much of what we want to do with a specific dataset will involve actually acting on some relatively small subset of it.

gapminder |>
  select(gdpPercap, lifeExp)

# A tibble: 1,704 × 2
   gdpPercap lifeExp
       <dbl>   <dbl>
 1      779.    28.8
 2      821.    30.3
 3      853.    32.0
 4      836.    34.0
 5      740.    36.1
 6      786.    38.4
 7      978.    39.9
 8      852.    40.8
 9      649.    41.7
10      635.    41.8
# ℹ 1,694 more rows

Databases

In addition, much of what we want to do with a specific dataset will involve actually acting on some relatively small subset of it.

gapminder |>
  filter(continent == "Europe",
         year == 1977)

# A tibble: 30 × 6
   country                continent  year lifeExp      pop gdpPercap
   <fct>                  <fct>     <int>   <dbl>    <int>     <dbl>
 1 Albania                Europe     1977    68.9  2509048     3533.
 2 Austria                Europe     1977    72.2  7568430    19749.
 3 Belgium                Europe     1977    72.8  9821800    19118.
 4 Bosnia and Herzegovina Europe     1977    69.9  4086000     3528.
 5 Bulgaria               Europe     1977    70.8  8797022     7612.
 6 Croatia                Europe     1977    70.6  4318673    11305.
 7 Czech Republic         Europe     1977    70.7 10161915    14800.
 8 Denmark                Europe     1977    74.7  5088419    20423.
 9 Finland                Europe     1977    72.5  4738902    15605.
10 France                 Europe     1977    73.8 53165019    18293.
# ℹ 20 more rows

Databases

In addition, much of what we want to do with a specific dataset will involve actually acting on some relatively small subset of it.

gapminder |>
  group_by(continent) |>
  summarize(lifeExp = mean(lifeExp),
            pop = mean(pop),
            gdpPercap = mean(gdpPercap))

# A tibble: 5 × 4
  continent lifeExp       pop gdpPercap
  <fct>       <dbl>     <dbl>     <dbl>
1 Africa       48.9  9916003.     2194.
2 Americas     64.7 24504795.     7136.
3 Asia         60.1 77038722.     7902.
4 Europe       71.9 17169765.    14469.
5 Oceania      74.3  8874672.    18622.

Databases

Efficiently storing and querying really large quantities of data is the realm of the database and of Structured Query Languages.

As with everything in information technology there is a long and interesting story about various efforts to come up with a good theory of data storage and retrieval, and efficient algorithms for it. If you are interested, watch e.g. this lecture from a DBMS course from about twelve minutes in.

Where’s the database?

Local or remote?

On disk or in memory?

The important thing from the database admin’s point of view is that the data is stored efficiently, that we have a means of querying it, and those queries rely on some search-and-retrieval method that’s really fast.

There’s no free lunch. We want storage methods to be efficient and queries to be fast because the datasets are gonna be gigantic, and accessing them will take time.

Database layouts

A real database is usually not a single giant table. Instead it is more like a collection of related tables that are partially connected through keys, i.e. identifier columns shared between tables. Those keys are efficently indexed by the database software. Further, tables are stored in a super-clever, tree-like way that makes searching much faster than just going down each row and looking for matches.

From a social science perspective, putting things in different tables might be thought of a matter of logically organizing entities at different units of observation. Querying tables is a matter of assembling tables ad hoc at various units of analysis.

Database layouts

gapminder_xtra <- read_csv(here("files", "data", "gapminder_xtra.csv"))
gapminder_xtra

# A tibble: 1,704 × 13
   country     continent  year lifeExp      pop gdpPercap area_pct pop_pct
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.     29.8    59.4
 2 Afghanistan Asia       1957    30.3  9240934      821.     29.8    59.4
 3 Afghanistan Asia       1962    32.0 10267083      853.     29.8    59.4
 4 Afghanistan Asia       1967    34.0 11537966      836.     29.8    59.4
 5 Afghanistan Asia       1972    36.1 13079460      740.     29.8    59.4
 6 Afghanistan Asia       1977    38.4 14880372      786.     29.8    59.4
 7 Afghanistan Asia       1982    39.9 12881816      978.     29.8    59.4
 8 Afghanistan Asia       1987    40.8 13867957      852.     29.8    59.4
 9 Afghanistan Asia       1992    41.7 16317921      649.     29.8    59.4
10 Afghanistan Asia       1997    41.8 22227415      635.     29.8    59.4
# ℹ 1,694 more rows
# ℹ 5 more variables: gm_countries <dbl>, country_fr <chr>, iso2 <chr>,
#   iso3 <chr>, number <dbl>

gapminder_xtra  |> select(gm_countries:number)

# A tibble: 1,704 × 5
   gm_countries country_fr       iso2  iso3  number
          <dbl> <chr>            <chr> <chr>  <dbl>
 1           33 Afghanistan (l') AF    AFG        4
 2           33 Afghanistan (l') AF    AFG        4
 3           33 Afghanistan (l') AF    AFG        4
 4           33 Afghanistan (l') AF    AFG        4
 5           33 Afghanistan (l') AF    AFG        4
 6           33 Afghanistan (l') AF    AFG        4
 7           33 Afghanistan (l') AF    AFG        4
 8           33 Afghanistan (l') AF    AFG        4
 9           33 Afghanistan (l') AF    AFG        4
10           33 Afghanistan (l') AF    AFG        4
# ℹ 1,694 more rows

Again, in social science terms, the redundancies are annoying in part because they apply to different levels or units of observation. From a Database point of view they are also bad because they allow the possibility of a variety of errors or anomalies when updating the table, and they make things really inefficient for search and querying.

Database normalization

A hierarchical set of rules and criteria for ensuring the integrity of data stored across multiple tables and for reducing redundancy in data storage.

Tries to eliminate various sources of error — so-called Insertion, Update, and Deletion anomalies — particularly ones that will pollute, damage, or corrupt things beyond the specific change.

Redundancy and error are minimized by breaking the database up into a series of linked or related tables. Hence the term “relational database”

Normal Forms

0NF: No duplicate rows!

1NF: Using row order to convey information is not allowed; Mixing data types in the same column is not allowed; No table without a primary key is not allowed. Primary keys can be defined by more than one column though. No “repeating groups”.

2NF: Each non-key attribute must depend on the entire primary key

3NF: Every non-key attribute should depend wholly and only on the key.

Think of these rules in connection with ideas about “tidy data” that we’ve already covered.

Good S/O answer: https://stackoverflow.com/questions/23194292/normalization-what-does-repeating-groups-mean The term “repeating group” originally meant the concept in CODASYL and COBOL based languages where a single field could contain an array of repeating values. When E.F.Codd described his First Normal Form that was what he meant by a repeating group. The concept does not exist in any modern relational or SQL-based DBMS.

The term “repeating group” has also come to be used informally and imprecisely by database designers to mean a repeating set of columns, meaning a collection of columns containing similar kinds of values in a table. This is different to its original meaning in relation to 1NF. For instance in the case of a table called Families with columns named Parent1, Parent2, Child1, Child2, Child3, … etc the collection of Child N columns is sometimes referred to as a repeating group and assumed to be in violation of 1NF even though it is not a repeating group in the sense that Codd intended.

This latter sense of a so-called repeating group is not technically a violation of 1NF if each attribute is only single-valued. The attributes themselves do not contain repeating values and therefore there is no violation of 1NF for that reason. Such a design is often considered an anti-pattern however because it constrains the table to a predetermined fixed number of values (maximum N children in a family) and because it forces queries and other business logic to be repeated for each of the columns. In other words it violates the “DRY” principle of design. Because it is generally considered poor design it suits database designers and sometimes even teachers to refer to repeated columns of this kind as a “repeating group” and a violation of the spirit of the First Normal Form.

Separate

1NF : Your table is organized as an unordered set of data, and there are no repeating columns.

2NF: You don’t repeat data in one column of your table because of another column. [nb repeating groups]

3NF: Every column in your table relates only to your table’s key – you wouldn’t have a column in a table that describes another column in your table which isn’t the key.

Database normalization

gapminder_xtra

# A tibble: 1,704 × 13
   country     continent  year lifeExp      pop gdpPercap area_pct pop_pct
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.     29.8    59.4
 2 Afghanistan Asia       1957    30.3  9240934      821.     29.8    59.4
 3 Afghanistan Asia       1962    32.0 10267083      853.     29.8    59.4
 4 Afghanistan Asia       1967    34.0 11537966      836.     29.8    59.4
 5 Afghanistan Asia       1972    36.1 13079460      740.     29.8    59.4
 6 Afghanistan Asia       1977    38.4 14880372      786.     29.8    59.4
 7 Afghanistan Asia       1982    39.9 12881816      978.     29.8    59.4
 8 Afghanistan Asia       1987    40.8 13867957      852.     29.8    59.4
 9 Afghanistan Asia       1992    41.7 16317921      649.     29.8    59.4
10 Afghanistan Asia       1997    41.8 22227415      635.     29.8    59.4
# ℹ 1,694 more rows
# ℹ 5 more variables: gm_countries <dbl>, country_fr <chr>, iso2 <chr>,
#   iso3 <chr>, number <dbl>

Database normalization

gapminder

# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

Database normalization

continent_tbl <- read_tsv(here("files", "data", "continent_tab.tsv"))
country_tbl <- read_tsv(here("files", "data", "country_tab.tsv"))
year_tbl <-  read_tsv(here("files", "data", "year_tab.tsv"))

continent_tbl

# A tibble: 5 × 5
  continent_id continent area_pct pop_pct gm_countries
         <dbl> <chr>        <dbl>   <dbl>        <dbl>
1            1 Africa        20.3    17.6           52
2            2 Americas      28.1    13             25
3            3 Asia          29.8    59.4           33
4            4 Europe         6.7     9.4           30
5            5 Oceania        5.7     0.6            2

gapminder

# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

Database normalization

continent_tbl

# A tibble: 5 × 5
  continent_id continent area_pct pop_pct gm_countries
         <dbl> <chr>        <dbl>   <dbl>        <dbl>
1            1 Africa        20.3    17.6           52
2            2 Americas      28.1    13             25
3            3 Asia          29.8    59.4           33
4            4 Europe         6.7     9.4           30
5            5 Oceania        5.7     0.6            2

country_tbl

# A tibble: 249 × 8
   country_id continent_id country     iso_country country_fr iso2  iso3  number
        <dbl>        <dbl> <chr>       <chr>       <chr>      <chr> <chr>  <dbl>
 1          1            3 Afghanistan Afghanistan Afghanist… AF    AFG        4
 2          2            4 Albania     Albania     Albanie (… AL    ALB        8
 3          3            1 Algeria     Algeria     Algérie (… DZ    DZA       12
 4          4           NA <NA>        American S… Samoa amé… AS    ASM       16
 5          5           NA <NA>        Andorra     Andorre (… AD    AND       20
 6          6            1 Angola      Angola      Angola (l… AO    AGO       24
 7          7           NA Anguilla    Anguilla    Anguilla   AI    AIA      660
 8          8           NA Antarctica  Antarctica  Antarctiq… AQ    ATA       10
 9          9           NA Antigua an… Antigua an… Antigua-e… AG    ATG       28
10         10            2 Argentina   Argentina   Argentine… AR    ARG       32
# ℹ 239 more rows

Database normalization

country_tbl

# A tibble: 249 × 8
   country_id continent_id country     iso_country country_fr iso2  iso3  number
        <dbl>        <dbl> <chr>       <chr>       <chr>      <chr> <chr>  <dbl>
 1          1            3 Afghanistan Afghanistan Afghanist… AF    AFG        4
 2          2            4 Albania     Albania     Albanie (… AL    ALB        8
 3          3            1 Algeria     Algeria     Algérie (… DZ    DZA       12
 4          4           NA <NA>        American S… Samoa amé… AS    ASM       16
 5          5           NA <NA>        Andorra     Andorre (… AD    AND       20
 6          6            1 Angola      Angola      Angola (l… AO    AGO       24
 7          7           NA Anguilla    Anguilla    Anguilla   AI    AIA      660
 8          8           NA Antarctica  Antarctica  Antarctiq… AQ    ATA       10
 9          9           NA Antigua an… Antigua an… Antigua-e… AG    ATG       28
10         10            2 Argentina   Argentina   Argentine… AR    ARG       32
# ℹ 239 more rows

year_tbl

# A tibble: 1,704 × 5
    year country_id lifeExp      pop gdpPercap
   <dbl>      <dbl>   <dbl>    <dbl>     <dbl>
 1  1952          1    28.8  8425333      779.
 2  1957          1    30.3  9240934      821.
 3  1962          1    32.0 10267083      853.
 4  1967          1    34.0 11537966      836.
 5  1972          1    36.1 13079460      740.
 6  1977          1    38.4 14880372      786.
 7  1982          1    39.9 12881816      978.
 8  1987          1    40.8 13867957      852.
 9  1992          1    41.7 16317921      649.
10  1997          1    41.8 22227415      635.
# ℹ 1,694 more rows

Talking to databases

The main idea

Ultimately, we query databases with SQL. There are several varieties, because there are a variety of database systems and each has their own wrinkles and quirks.

We try to abstract away from some of those quirk by using a DBI (DataBase Interface) layer, which is a generic set of commands for talking to some database. It’s analogous to an API.

We also need to use a package for the DBMS we’re talking to. It translates DBI instructions into the specific dialect the DBMS speaks.

Talking to databases

Some databases are small, and some are far away.

Client-server databases are like websites, serving up responses to queries. The database lives on a machine somewhere in the building, or on campus or whatever.

Cloud DBMSs are like this, too, except the database lives on a machine in someone else’s building.

In-process DBMSs live and run on your laptop. We’ll use one of these, duckdb for examples here.

Talking to databases

We need to open a connection to a database before talking to it. Conventionally this is called con.

Once connected, we can ask it questions. Either we use functions or packages designed to translate our R / dplyr syntax into SQL, or we use functions to pass SQL queries on directly.

We try to minimize the amount of time we are actually making the database do a lot of work.

The key thing is that when working with databases our queries are lazy — they don’t actually do anything on the whole database unless its strictly necessary or they’re explicitly told to.

Example: `flights`

The nice example

Where everything is lovely and clean. Thanks to Grant McDermott for the following example.

duckdb and DBI

# library(DBI)

con <- dbConnect(duckdb::duckdb(), path = ":memory:")

Here we open a connection to an in-memory duckdb database. It’s empty. We’re going to populate it with data from nycflights.

duckdb and DBI

copy_to(
  dest = con,
  df = nycflights13::flights,
  name = "flights",
  temporary = FALSE,
  indexes = list(
    c("year", "month", "day"),
    "carrier",
    "tailnum",
    "dest"
    )
  )

Remember, keys and indexes are what make databases fast.

Make a lazy tibble from it

This says “go to con and get the ‘flights’ table in it, and pretend it’s a tibble called flights_db.

flights_db <- tbl(con, "flights")

flights_db

# Source:   table<flights> [?? x 19]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Run some dplyr-like queries

flights_db |> select(year:day, dep_delay, arr_delay)

# Source:   SQL [?? x 5]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# ℹ more rows

Run some dplyr-like queries

flights_db |> filter(dep_delay > 240)

# Source:   SQL [?? x 19]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      848           1835       853     1001           1950
 2  2013     1     1     1815           1325       290     2120           1542
 3  2013     1     1     1842           1422       260     1958           1535
 4  2013     1     1     2115           1700       255     2330           1920
 5  2013     1     1     2205           1720       285       46           2040
 6  2013     1     1     2343           1724       379      314           1938
 7  2013     1     2     1332            904       268     1616           1128
 8  2013     1     2     1412            838       334     1710           1147
 9  2013     1     2     1607           1030       337     2003           1355
10  2013     1     2     2131           1512       379     2340           1741
# ℹ more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Run some dplyr-like queries

flights_db |>
  group_by(dest) |>
  summarise(mean_dep_delay = mean(dep_delay))

# Source:   SQL [?? x 2]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
   dest  mean_dep_delay
   <chr>          <dbl>
 1 CLT             9.22
 2 MDW            18.6 
 3 HOU            14.3 
 4 SDF            16.4 
 5 LAS             9.42
 6 PHX            10.4 
 7 IAH            10.8 
 8 SYR            14.4 
 9 CAK            20.8 
10 BDL            17.7 
# ℹ more rows

Lazy, lazy, lazy

tailnum_delay_db <-
  flights_db |>
  group_by(tailnum) |>
  summarise(
    mean_dep_delay = mean(dep_delay),
    mean_arr_delay = mean(arr_delay),
    n = n()) |>
  filter(n > 100) |>
  arrange(desc(mean_arr_delay))

This doesn’t touch the database.

Lazy, lazy, lazy

Even when we ask to look at it, it just does the absolute minimum required.

tailnum_delay_db

# Source:     SQL [?? x 4]
# Database:   DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
# Ordered by: desc(mean_arr_delay)
   tailnum mean_dep_delay mean_arr_delay     n
   <chr>            <dbl>          <dbl> <dbl>
 1 N11119            32.6           30.3   148
 2 N16919            32.4           29.9   251
 3 N14998            29.4           27.9   230
 4 N15910            29.3           27.6   280
 5 N13123            29.6           26.0   121
 6 N11192            27.5           25.9   154
 7 N14950            26.2           25.3   219
 8 N21130            27.0           25.0   126
 9 N24128            24.8           24.9   129
10 N22971            26.5           24.7   230
# ℹ more rows

When ready, use `collect()`

tailnum_delay <-
  tailnum_delay_db |>
  collect()

tailnum_delay

# A tibble: 1,201 × 4
   tailnum mean_dep_delay mean_arr_delay     n
   <chr>            <dbl>          <dbl> <dbl>
 1 N11119            32.6           30.3   148
 2 N16919            32.4           29.9   251
 3 N14998            29.4           27.9   230
 4 N15910            29.3           27.6   280
 5 N13123            29.6           26.0   121
 6 N11192            27.5           25.9   154
 7 N14950            26.2           25.3   219
 8 N21130            27.0           25.0   126
 9 N24128            24.8           24.9   129
10 N22971            26.5           24.7   230
# ℹ 1,191 more rows

Now it exists for realsies.

Joins

Database systems will have more than one table. We query and join them. The idea is that getting the DBMS to do this will be way faster and more memory-efficient than trying to get dplyr to do it.

Joins

## Copy over the "planes" dataset to the same "con" DuckDB connection.
copy_to(
    dest = con,
    df = nycflights13::planes,
    name = "planes",
    temporary = FALSE,
    indexes = "tailnum"
    )

## List tables in our "con" database connection (i.e. now "flights" and "planes")
dbListTables(con)

[1] "flights" "planes"

## Reference from dplyr
planes_db <-  tbl(con, 'planes')

See what we did there? It’s like con the database connection has a list of tables in it.

Joins

# Still not done for realsies!
left_join(
    flights_db,
    planes_db %>% rename(year_built = year),
    by = "tailnum" ## Important: Be specific about the joining column
) |>
    select(year, month, day, dep_time, arr_time, carrier, flight, tailnum,
           year_built, type, model)

# Source:   SQL [?? x 11]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
    year month   day dep_time arr_time carrier flight tailnum year_built type   
   <int> <int> <int>    <int>    <int> <chr>    <int> <chr>        <int> <chr>  
 1  2013     2    14     1045     1204 UA          67 N16713        1998 Fixed …
 2  2013     2    14     1048     1412 B6         373 N659JB        2007 Fixed …
 3  2013     2    14     1048     1333 UA         764 N665UA        1998 Fixed …
 4  2013     2    14     1051     1203 US        2171 N747UW        2000 Fixed …
 5  2013     2    14     1057     1244 WN         366 N414WN        2001 Fixed …
 6  2013     2    14     1057     1353 UA        1550 N14228        1999 Fixed …
 7  2013     2    14     1058     1211 EV        4694 N14993        2000 Fixed …
 8  2013     2    14     1058     1337 DL        1647 N689DL        1998 Fixed …
 9  2013     2    14     1105     1210 B6          24 N197JB        2006 Fixed …
10  2013     2    14     1105     1302 MQ        4485 N711MQ        1976 Fixed …
# ℹ more rows
# ℹ 1 more variable: model <chr>

Finishing up

Close your connection!

dbDisconnect(con)

Example: ARCOS Opioids data

This one is messier

I’m not going to do it on the slides. We’ll try to process a pretty big data file on a machine of modest proportions. This week’s example shows some of the output.

Databases

Load the packages, as always

‘Big’ Data is just data, except there’s more of everything

Pig Data

Pig Data

Some Datasets

Some Datasets

Some Datasets

Some Datasets

Some Datasets

Some Datasets

Some Datasets

Some Datasets

Pig Data

‘Big’ Data

More of everything

But we stay the same

Table Joins

Merging Data

Merging Data

Example: Cleaning Up Congress

Cleaning up congress

Cleaning up congress

Cleaning up congress

We’re going to join these tables

We’re going to join these tables

Table joins

Left join, left_join()

Left join (contd), left_join()

Inner join, inner_join()

Full join, full_join()

Semi join, semi_join()

Anti join, anti_join()

Databases

Databases

Databases

Databases

Databases

Databases

Where’s the database?

Database layouts

Database layouts

Database normalization

Normal Forms

Database normalization

Database normalization

Database normalization

Database normalization

Database normalization

Talking to databases

The main idea

Talking to databases

Talking to databases

Example: flights

The nice example

duckdb and DBI

duckdb and DBI

Make a lazy tibble from it

Run some dplyr-like queries

Run some dplyr-like queries

Run some dplyr-like queries

Lazy, lazy, lazy

Lazy, lazy, lazy

When ready, use collect()

Joins

Joins

Joins

Finishing up

Example: ARCOS Opioids data

This one is messier

Example: `flights`

When ready, use `collect()`