library(here) # manage file paths
library(socviz) # data and some useful functions
library(tidyverse) # your friend and mine
library(gapminder) # inescapable
library(DBI) # DBMS interface layer
library(duckdb) # Local database server
Soc 690S: Week 11
Duke University
April 2025
Benny’s dilemma.
Benny has a brother now. He wanted one. And then he got one.
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
N = 344
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
N = 1,704
# A tibble: 62,466 × 25
year id ballot age degree race sex siblings kids bigregion income16
<dbl> <dbl> <labe> <lab> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 1972 1 NA 23 Bache… White Fema… 3 0 Midwest <NA>
2 1972 2 NA 70 Lt Hi… White Male 4 4+ Midwest <NA>
3 1972 3 NA 48 High … White Fema… 5 4+ Midwest <NA>
4 1972 4 NA 27 Bache… White Fema… 5 0 Midwest <NA>
5 1972 5 NA 61 High … White Fema… 2 2 Midwest <NA>
6 1972 6 NA 26 High … White Male 1 0 Midwest <NA>
7 1972 7 NA 28 High … White Male 6+ 2 Midwest <NA>
8 1972 8 NA 27 Bache… White Male 1 0 Midwest <NA>
9 1972 9 NA 21 High … Black Fema… 2 2 South <NA>
10 1972 10 NA 30 High … Black Fema… 6+ 4+ South <NA>
# ℹ 62,456 more rows
# ℹ 14 more variables: religion <fct>, marital <fct>, padeg <fct>, madeg <fct>,
# partyid <fct>, polviews <fct>, happy <fct>, partners_rc <fct>, grass <fct>,
# zodiac <fct>, pres12 <labelled>, wtssall <dbl>, vpsu <dbl>, vstrat <dbl>
N = 62,466
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
N = 336,766
# A tibble: 580,395 × 17
country_code cname iso2 continent iso3 year week sex split split_sex
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 AUS Austral… AU Oceania AUS 2015 1 m 1 0
2 AUS Austral… AU Oceania AUS 2015 1 m 1 0
3 AUS Austral… AU Oceania AUS 2015 1 m 1 0
4 AUS Austral… AU Oceania AUS 2015 1 m 1 0
5 AUS Austral… AU Oceania AUS 2015 1 m 1 0
6 AUS Austral… AU Oceania AUS 2015 1 f 1 0
7 AUS Austral… AU Oceania AUS 2015 1 f 1 0
8 AUS Austral… AU Oceania AUS 2015 1 f 1 0
9 AUS Austral… AU Oceania AUS 2015 1 f 1 0
10 AUS Austral… AU Oceania AUS 2015 1 f 1 0
# ℹ 580,385 more rows
# ℹ 7 more variables: forecast <dbl>, approx_date <date>, age_group <chr>,
# death_count <dbl>, death_rate <dbl>, deaths_total <dbl>, rate_total <dbl>
N = 580,395
# A tibble: 2,254,515 × 7
country sub_region subregion_and_city geo_type date transportation_type
<chr> <chr> <chr> <chr> <date> <chr>
1 Albania Total Total country… 2020-01-13 driving
2 Albania Total Total country… 2020-01-14 driving
3 Albania Total Total country… 2020-01-15 driving
4 Albania Total Total country… 2020-01-16 driving
5 Albania Total Total country… 2020-01-17 driving
6 Albania Total Total country… 2020-01-18 driving
7 Albania Total Total country… 2020-01-19 driving
8 Albania Total Total country… 2020-01-20 driving
9 Albania Total Total country… 2020-01-21 driving
10 Albania Total Total country… 2020-01-22 driving
# ℹ 2,254,505 more rows
# ℹ 1 more variable: score <dbl>
N = 2,254,515
# A tibble: 7,808,311 × 10
rowid type sourceName sourceVersion device unit creationDate startDate
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 HKQuantit… Withings 6050301 <<HKD… count 2024-08-31 … 2024-05-…
2 2 HKQuantit… Withings 6050301 <<HKD… count 2024-08-31 … 2024-05-…
3 3 HKQuantit… Withings 6050301 <<HKD… count 2024-08-31 … 2024-05-…
4 4 HKQuantit… Withings 6050301 <<HKD… count 2024-08-31 … 2024-05-…
5 5 HKQuantit… Withings 6050301 <<HKD… count 2024-08-31 … 2024-05-…
6 6 HKQuantit… Withings 6050301 <<HKD… count 2024-08-31 … 2024-05-…
7 7 HKQuantit… Withings 6050301 <<HKD… count 2024-08-31 … 2024-05-…
8 8 HKQuantit… Withings 6050301 <<HKD… count 2024-08-31 … 2024-05-…
9 9 HKQuantit… Withings 6050301 <<HKD… count 2024-08-31 … 2024-05-…
10 10 HKQuantit… Withings 6050301 <<HKD… count 2024-08-31 … 2024-05-…
# ℹ 7,808,301 more rows
# ℹ 2 more variables: endDate <chr>, value <chr>
N = 7,808,311. Entirely about me. On my phone.
#> # Source: SQL [1 x 1]
#> # Database: DuckDB v1.1.1 [root@Darwin 24.0.0:R 4.4.1//Users/kjhealy/Documents/data/arcos_opioids_test/db/arcos.duckdb]
#> n
#> <dbl>
#> 1 178598026
N = 178,598,026
This is the only one that’s even remotely “big” right now—i.e. that needs a database if I want to interact with it properly. Truly big data won’t fit on your laptop.
Benny’s brother screams and screams.
Often we work with a single table of data. Ultimately, a lot of quantitative analysis that we want to do will require a single table (or rather, a single matrix).
Sometimes we want to add additional information to this table. Very often, something from a different level of analysis.
E.g. we have a table of individual-level measures about people, and we have a separate table of information about the county or state they live in.
In the world of spreadsheets, and also a lot of social science, we talk about this kind of problem as one of merging data.
In the world of databases, this is a table join.
filenames <- dir(path = here("files", "data", "congress"),
pattern = "*.csv",
full.names = TRUE)
df <- read_csv(filenames, id = "path",
name_repair = janitor::make_clean_names) |>
mutate(congress = stringr::str_extract(path, "_\\d{2,3}_congress"),
congress = stringr::str_extract(congress, "\\d{2,3}"),
congress = as.integer(congress)) |>
relocate(congress)
df |>
select(congress, last, first, born, death, start, end)
# A tibble: 20,580 × 7
congress last first born death start end
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 79 Abernethy Thomas 05/16/1903 01/23/1953 01/03/1945 01/03/1953
2 79 Adams Sherman 01/08/1899 10/27/1986 01/03/1945 01/03/1947
3 79 Aiken George 08/20/1892 11/19/1984 01/03/1945 01/03/1979
4 79 Allen Asa 01/05/1891 01/05/1969 01/03/1945 01/03/1953
5 79 Allen Leo 10/05/1898 01/19/1973 01/03/1945 01/02/1949
6 79 Almond J. 06/15/1898 04/14/1986 02/04/1946 04/17/1948
7 79 Andersen Herman 01/27/1897 07/26/1978 01/03/1945 01/03/1963
8 79 Anderson Clinton 10/23/1895 11/11/1975 01/03/1941 06/30/1945
9 79 Anderson John 03/22/1904 02/09/1981 01/03/1945 01/03/1953
10 79 Andresen August 10/11/1890 01/14/1958 01/03/1945 01/14/1958
# ℹ 20,570 more rows
date_recodes <- c("born", "death", "start", "end")
df <- df |>
mutate(across(any_of(date_recodes), mdy))
df |>
select(congress, last, first, born, death, start, end)
# A tibble: 20,580 × 7
congress last first born death start end
<int> <chr> <chr> <date> <date> <date> <date>
1 79 Abernethy Thomas 1903-05-16 1953-01-23 1945-01-03 1953-01-03
2 79 Adams Sherman 1899-01-08 1986-10-27 1945-01-03 1947-01-03
3 79 Aiken George 1892-08-20 1984-11-19 1945-01-03 1979-01-03
4 79 Allen Asa 1891-01-05 1969-01-05 1945-01-03 1953-01-03
5 79 Allen Leo 1898-10-05 1973-01-19 1945-01-03 1949-01-02
6 79 Almond J. 1898-06-15 1986-04-14 1946-02-04 1948-04-17
7 79 Andersen Herman 1897-01-27 1978-07-26 1945-01-03 1963-01-03
8 79 Anderson Clinton 1895-10-23 1975-11-11 1941-01-03 1945-06-30
9 79 Anderson John 1904-03-22 1981-02-09 1945-01-03 1953-01-03
10 79 Andresen August 1890-10-11 1958-01-14 1945-01-03 1958-01-14
# ℹ 20,570 more rows
Next, a new table.
sessions <- tibble(congress = 79L:116L,
start_year = seq(1945, 2019, by = 2),
end_year = seq(1947, 2021, by = 2)) |>
mutate(start_year = ymd(paste(start_year, "01", "03", sep = "-")),
end_year = ymd(paste(end_year, "01", "03", sep = "-")))
sessions
# A tibble: 38 × 3
congress start_year end_year
<int> <date> <date>
1 79 1945-01-03 1947-01-03
2 80 1947-01-03 1949-01-03
3 81 1949-01-03 1951-01-03
4 82 1951-01-03 1953-01-03
5 83 1953-01-03 1955-01-03
6 84 1955-01-03 1957-01-03
7 85 1957-01-03 1959-01-03
8 86 1959-01-03 1961-01-03
9 87 1961-01-03 1963-01-03
10 88 1963-01-03 1965-01-03
# ℹ 28 more rows
The big table:
# A tibble: 20,580 × 3
congress last born
<int> <chr> <date>
1 79 Abernethy 1903-05-16
2 79 Adams 1899-01-08
3 79 Aiken 1892-08-20
4 79 Allen 1891-01-05
5 79 Allen 1898-10-05
6 79 Almond 1898-06-15
7 79 Andersen 1897-01-27
8 79 Anderson 1895-10-23
9 79 Anderson 1904-03-22
10 79 Andresen 1890-10-11
# ℹ 20,570 more rows
The smaller table
# A tibble: 38 × 3
congress start_year end_year
<int> <date> <date>
1 79 1945-01-03 1947-01-03
2 80 1947-01-03 1949-01-03
3 81 1949-01-03 1951-01-03
4 82 1951-01-03 1953-01-03
5 83 1953-01-03 1955-01-03
6 84 1955-01-03 1957-01-03
7 85 1957-01-03 1959-01-03
8 86 1959-01-03 1961-01-03
9 87 1961-01-03 1963-01-03
10 88 1963-01-03 1965-01-03
# ℹ 28 more rows
We will use left_join()
which is what you want most of the time when you are looking to merge a smaller table with additional information into a larger main one.
Joining with `by = join_by(congress)`
# A tibble: 20,580 × 7
congress last first start_year end_year start end
<int> <chr> <chr> <date> <date> <date> <date>
1 79 Abernethy Thomas 1945-01-03 1947-01-03 1945-01-03 1953-01-03
2 79 Adams Sherman 1945-01-03 1947-01-03 1945-01-03 1947-01-03
3 79 Aiken George 1945-01-03 1947-01-03 1945-01-03 1979-01-03
4 79 Allen Asa 1945-01-03 1947-01-03 1945-01-03 1953-01-03
5 79 Allen Leo 1945-01-03 1947-01-03 1945-01-03 1949-01-02
6 79 Almond J. 1945-01-03 1947-01-03 1946-02-04 1948-04-17
7 79 Andersen Herman 1945-01-03 1947-01-03 1945-01-03 1963-01-03
8 79 Anderson Clinton 1945-01-03 1947-01-03 1941-01-03 1945-06-30
9 79 Anderson John 1945-01-03 1947-01-03 1945-01-03 1953-01-03
10 79 Andresen August 1945-01-03 1947-01-03 1945-01-03 1958-01-14
# ℹ 20,570 more rows
All rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.
If there are multiple matches between x and y, all combinations of the matches are returned.
All rows from x where there are matching values in y, and all columns from x and y.
All rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
All rows from x where there are matching values in y, keeping just columns from x.
All rows from x where there are not matching values in y, keeping just columns from x.
When we’re working with data in the social sciences the basic case is a single table that we’re going to do something with, like run a regression or make a plot.
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
But the bigger a dataset gets, the more we have to think about whether we really want (or even can have) all of it in memory all the time.
In addition, much of what we want to do with a specific dataset will involve actually acting on some relatively small subset of it.
In addition, much of what we want to do with a specific dataset will involve actually acting on some relatively small subset of it.
# A tibble: 30 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Albania Europe 1977 68.9 2509048 3533.
2 Austria Europe 1977 72.2 7568430 19749.
3 Belgium Europe 1977 72.8 9821800 19118.
4 Bosnia and Herzegovina Europe 1977 69.9 4086000 3528.
5 Bulgaria Europe 1977 70.8 8797022 7612.
6 Croatia Europe 1977 70.6 4318673 11305.
7 Czech Republic Europe 1977 70.7 10161915 14800.
8 Denmark Europe 1977 74.7 5088419 20423.
9 Finland Europe 1977 72.5 4738902 15605.
10 France Europe 1977 73.8 53165019 18293.
# ℹ 20 more rows
In addition, much of what we want to do with a specific dataset will involve actually acting on some relatively small subset of it.
gapminder |>
group_by(continent) |>
summarize(lifeExp = mean(lifeExp),
pop = mean(pop),
gdpPercap = mean(gdpPercap))
# A tibble: 5 × 4
continent lifeExp pop gdpPercap
<fct> <dbl> <dbl> <dbl>
1 Africa 48.9 9916003. 2194.
2 Americas 64.7 24504795. 7136.
3 Asia 60.1 77038722. 7902.
4 Europe 71.9 17169765. 14469.
5 Oceania 74.3 8874672. 18622.
Efficiently storing and querying really large quantities of data is the realm of the database and of Structured Query Languages.
As with everything in information technology there is a long and interesting story about various efforts to come up with a good theory of data storage and retrieval, and efficient algorithms for it. If you are interested, watch e.g. this lecture from a DBMS course from about twelve minutes in.
Local or remote?
On disk or in memory?
The important thing from the database admin’s point of view is that the data is stored efficiently, that we have a means of querying it, and those queries rely on some search-and-retrieval method that’s really fast.
There’s no free lunch. We want storage methods to be efficient and queries to be fast because the datasets are gonna be gigantic, and accessing them will take time.
A real database is usually not a single giant table. Instead it is more like a collection of related tables that are partially connected through keys, i.e. identifier columns shared between tables. Those keys are efficently indexed by the database software. Further, tables are stored in a super-clever, tree-like way that makes searching much faster than just going down each row and looking for matches.
From a social science perspective, putting things in different tables might be thought of a matter of logically organizing entities at different units of observation. Querying tables is a matter of assembling tables ad hoc at various units of analysis.
# A tibble: 1,704 × 13
country continent year lifeExp pop gdpPercap area_pct pop_pct
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 29.8 59.4
2 Afghanistan Asia 1957 30.3 9240934 821. 29.8 59.4
3 Afghanistan Asia 1962 32.0 10267083 853. 29.8 59.4
4 Afghanistan Asia 1967 34.0 11537966 836. 29.8 59.4
5 Afghanistan Asia 1972 36.1 13079460 740. 29.8 59.4
6 Afghanistan Asia 1977 38.4 14880372 786. 29.8 59.4
7 Afghanistan Asia 1982 39.9 12881816 978. 29.8 59.4
8 Afghanistan Asia 1987 40.8 13867957 852. 29.8 59.4
9 Afghanistan Asia 1992 41.7 16317921 649. 29.8 59.4
10 Afghanistan Asia 1997 41.8 22227415 635. 29.8 59.4
# ℹ 1,694 more rows
# ℹ 5 more variables: gm_countries <dbl>, country_fr <chr>, iso2 <chr>,
# iso3 <chr>, number <dbl>
# A tibble: 1,704 × 5
gm_countries country_fr iso2 iso3 number
<dbl> <chr> <chr> <chr> <dbl>
1 33 Afghanistan (l') AF AFG 4
2 33 Afghanistan (l') AF AFG 4
3 33 Afghanistan (l') AF AFG 4
4 33 Afghanistan (l') AF AFG 4
5 33 Afghanistan (l') AF AFG 4
6 33 Afghanistan (l') AF AFG 4
7 33 Afghanistan (l') AF AFG 4
8 33 Afghanistan (l') AF AFG 4
9 33 Afghanistan (l') AF AFG 4
10 33 Afghanistan (l') AF AFG 4
# ℹ 1,694 more rows
Again, in social science terms, the redundancies are annoying in part because they apply to different levels or units of observation. From a Database point of view they are also bad because they allow the possibility of a variety of errors or anomalies when updating the table, and they make things really inefficient for search and querying.
A hierarchical set of rules and criteria for ensuring the integrity of data stored across multiple tables and for reducing redundancy in data storage.
Tries to eliminate various sources of error — so-called Insertion, Update, and Deletion anomalies — particularly ones that will pollute, damage, or corrupt things beyond the specific change.
Redundancy and error are minimized by breaking the database up into a series of linked or related tables. Hence the term “relational database”
0NF: No duplicate rows!
1NF: Using row order to convey information is not allowed; Mixing data types in the same column is not allowed; No table without a primary key is not allowed. Primary keys can be defined by more than one column though. No “repeating groups”.
2NF: Each non-key attribute must depend on the entire primary key
3NF: Every non-key attribute should depend wholly and only on the key.
Think of these rules in connection with ideas about “tidy data” that we’ve already covered.
# A tibble: 1,704 × 13
country continent year lifeExp pop gdpPercap area_pct pop_pct
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 29.8 59.4
2 Afghanistan Asia 1957 30.3 9240934 821. 29.8 59.4
3 Afghanistan Asia 1962 32.0 10267083 853. 29.8 59.4
4 Afghanistan Asia 1967 34.0 11537966 836. 29.8 59.4
5 Afghanistan Asia 1972 36.1 13079460 740. 29.8 59.4
6 Afghanistan Asia 1977 38.4 14880372 786. 29.8 59.4
7 Afghanistan Asia 1982 39.9 12881816 978. 29.8 59.4
8 Afghanistan Asia 1987 40.8 13867957 852. 29.8 59.4
9 Afghanistan Asia 1992 41.7 16317921 649. 29.8 59.4
10 Afghanistan Asia 1997 41.8 22227415 635. 29.8 59.4
# ℹ 1,694 more rows
# ℹ 5 more variables: gm_countries <dbl>, country_fr <chr>, iso2 <chr>,
# iso3 <chr>, number <dbl>
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
continent_tbl <- read_tsv(here("files", "data", "continent_tab.tsv"))
country_tbl <- read_tsv(here("files", "data", "country_tab.tsv"))
year_tbl <- read_tsv(here("files", "data", "year_tab.tsv"))
continent_tbl
# A tibble: 5 × 5
continent_id continent area_pct pop_pct gm_countries
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 Africa 20.3 17.6 52
2 2 Americas 28.1 13 25
3 3 Asia 29.8 59.4 33
4 4 Europe 6.7 9.4 30
5 5 Oceania 5.7 0.6 2
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
# A tibble: 5 × 5
continent_id continent area_pct pop_pct gm_countries
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 Africa 20.3 17.6 52
2 2 Americas 28.1 13 25
3 3 Asia 29.8 59.4 33
4 4 Europe 6.7 9.4 30
5 5 Oceania 5.7 0.6 2
# A tibble: 249 × 8
country_id continent_id country iso_country country_fr iso2 iso3 number
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 3 Afghanistan Afghanistan Afghanist… AF AFG 4
2 2 4 Albania Albania Albanie (… AL ALB 8
3 3 1 Algeria Algeria Algérie (… DZ DZA 12
4 4 NA <NA> American S… Samoa amé… AS ASM 16
5 5 NA <NA> Andorra Andorre (… AD AND 20
6 6 1 Angola Angola Angola (l… AO AGO 24
7 7 NA Anguilla Anguilla Anguilla AI AIA 660
8 8 NA Antarctica Antarctica Antarctiq… AQ ATA 10
9 9 NA Antigua an… Antigua an… Antigua-e… AG ATG 28
10 10 2 Argentina Argentina Argentine… AR ARG 32
# ℹ 239 more rows
# A tibble: 249 × 8
country_id continent_id country iso_country country_fr iso2 iso3 number
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 3 Afghanistan Afghanistan Afghanist… AF AFG 4
2 2 4 Albania Albania Albanie (… AL ALB 8
3 3 1 Algeria Algeria Algérie (… DZ DZA 12
4 4 NA <NA> American S… Samoa amé… AS ASM 16
5 5 NA <NA> Andorra Andorre (… AD AND 20
6 6 1 Angola Angola Angola (l… AO AGO 24
7 7 NA Anguilla Anguilla Anguilla AI AIA 660
8 8 NA Antarctica Antarctica Antarctiq… AQ ATA 10
9 9 NA Antigua an… Antigua an… Antigua-e… AG ATG 28
10 10 2 Argentina Argentina Argentine… AR ARG 32
# ℹ 239 more rows
# A tibble: 1,704 × 5
year country_id lifeExp pop gdpPercap
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1952 1 28.8 8425333 779.
2 1957 1 30.3 9240934 821.
3 1962 1 32.0 10267083 853.
4 1967 1 34.0 11537966 836.
5 1972 1 36.1 13079460 740.
6 1977 1 38.4 14880372 786.
7 1982 1 39.9 12881816 978.
8 1987 1 40.8 13867957 852.
9 1992 1 41.7 16317921 649.
10 1997 1 41.8 22227415 635.
# ℹ 1,694 more rows
Ultimately, we query databases with SQL. There are several varieties, because there are a variety of database systems and each has their own wrinkles and quirks.
We try to abstract away from some of those quirk by using a DBI (DataBase Interface) layer, which is a generic set of commands for talking to some database. It’s analogous to an API.
We also need to use a package for the DBMS we’re talking to. It translates DBI instructions into the specific dialect the DBMS speaks.
Some databases are small, and some are far away.
Client-server databases are like websites, serving up responses to queries. The database lives on a machine somewhere in the building, or on campus or whatever.
Cloud DBMSs are like this, too, except the database lives on a machine in someone else’s building.
In-process DBMSs live and run on your laptop. We’ll use one of these, duckdb
for examples here.
We need to open a connection to a database before talking to it. Conventionally this is called con
.
Once connected, we can ask it questions. Either we use functions or packages designed to translate our R / dplyr syntax into SQL, or we use functions to pass SQL queries on directly.
We try to minimize the amount of time we are actually making the database do a lot of work.
The key thing is that when working with databases our queries are lazy — they don’t actually do anything on the whole database unless its strictly necessary or they’re explicitly told to.
flights
Where everything is lovely and clean. Thanks to Grant McDermott for the following example.
Here we open a connection to an in-memory duckdb database. It’s empty. We’re going to populate it with data from nycflights
.
Remember, keys and indexes are what make databases fast.
This says “go to con
and get the ‘flights’ table in it, and pretend it’s a tibble called flights_db
.
# Source: table<flights> [?? x 19]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# Source: SQL [?? x 5]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# ℹ more rows
# Source: SQL [?? x 19]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 848 1835 853 1001 1950
2 2013 1 1 1815 1325 290 2120 1542
3 2013 1 1 1842 1422 260 1958 1535
4 2013 1 1 2115 1700 255 2330 1920
5 2013 1 1 2205 1720 285 46 2040
6 2013 1 1 2343 1724 379 314 1938
7 2013 1 2 1332 904 268 1616 1128
8 2013 1 2 1412 838 334 1710 1147
9 2013 1 2 1607 1030 337 2003 1355
10 2013 1 2 2131 1512 379 2340 1741
# ℹ more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# Source: SQL [?? x 2]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
dest mean_dep_delay
<chr> <dbl>
1 CLT 9.22
2 MDW 18.6
3 HOU 14.3
4 SDF 16.4
5 LAS 9.42
6 PHX 10.4
7 IAH 10.8
8 SYR 14.4
9 CAK 20.8
10 BDL 17.7
# ℹ more rows
This doesn’t touch the database.
Even when we ask to look at it, it just does the absolute minimum required.
# Source: SQL [?? x 4]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
# Ordered by: desc(mean_arr_delay)
tailnum mean_dep_delay mean_arr_delay n
<chr> <dbl> <dbl> <dbl>
1 N11119 32.6 30.3 148
2 N16919 32.4 29.9 251
3 N14998 29.4 27.9 230
4 N15910 29.3 27.6 280
5 N13123 29.6 26.0 121
6 N11192 27.5 25.9 154
7 N14950 26.2 25.3 219
8 N21130 27.0 25.0 126
9 N24128 24.8 24.9 129
10 N22971 26.5 24.7 230
# ℹ more rows
collect()
# A tibble: 1,201 × 4
tailnum mean_dep_delay mean_arr_delay n
<chr> <dbl> <dbl> <dbl>
1 N11119 32.6 30.3 148
2 N16919 32.4 29.9 251
3 N14998 29.4 27.9 230
4 N15910 29.3 27.6 280
5 N13123 29.6 26.0 121
6 N11192 27.5 25.9 154
7 N14950 26.2 25.3 219
8 N21130 27.0 25.0 126
9 N24128 24.8 24.9 129
10 N22971 26.5 24.7 230
# ℹ 1,191 more rows
Now it exists for realsies.
Database systems will have more than one table. We query and join them. The idea is that getting the DBMS to do this will be way faster and more memory-efficient than trying to get dplyr
to do it.
## Copy over the "planes" dataset to the same "con" DuckDB connection.
copy_to(
dest = con,
df = nycflights13::planes,
name = "planes",
temporary = FALSE,
indexes = "tailnum"
)
## List tables in our "con" database connection (i.e. now "flights" and "planes")
dbListTables(con)
[1] "flights" "planes"
See what we did there? It’s like con
the database connection has a list of tables in it.
# Still not done for realsies!
left_join(
flights_db,
planes_db %>% rename(year_built = year),
by = "tailnum" ## Important: Be specific about the joining column
) |>
select(year, month, day, dep_time, arr_time, carrier, flight, tailnum,
year_built, type, model)
# Source: SQL [?? x 11]
# Database: DuckDB v1.1.3 [root@Darwin 24.4.0:R 4.4.3/:memory:]
year month day dep_time arr_time carrier flight tailnum year_built type
<int> <int> <int> <int> <int> <chr> <int> <chr> <int> <chr>
1 2013 2 14 1045 1204 UA 67 N16713 1998 Fixed …
2 2013 2 14 1048 1412 B6 373 N659JB 2007 Fixed …
3 2013 2 14 1048 1333 UA 764 N665UA 1998 Fixed …
4 2013 2 14 1051 1203 US 2171 N747UW 2000 Fixed …
5 2013 2 14 1057 1244 WN 366 N414WN 2001 Fixed …
6 2013 2 14 1057 1353 UA 1550 N14228 1999 Fixed …
7 2013 2 14 1058 1211 EV 4694 N14993 2000 Fixed …
8 2013 2 14 1058 1337 DL 1647 N689DL 1998 Fixed …
9 2013 2 14 1105 1210 B6 24 N197JB 2006 Fixed …
10 2013 2 14 1105 1302 MQ 4485 N711MQ 1976 Fixed …
# ℹ more rows
# ℹ 1 more variable: model <chr>
Close your connection!
I’m not going to do it on the slides. We’ll try to process a pretty big data file on a machine of modest proportions. This week’s example shows some of the output.