Parallel Processing

Kieran Healy

Duke University

March 23, 2024

Load the packages, as always

library(here)      # manage file paths
library(socviz)    # data and some useful functions
library(tidyverse) # your friend and mine

library(furrr) # Also loads `future`

Loading required package: future

Split, Apply, Combine

A lot of analysis has this pattern

We start with a dataset

We split it into pieces, usually according to some feature or categorical variable, or by file or something.

We do something—the same thing—to each of those pieces. That is we apply a function or procedure to the pieces. That procedure returns some result for each piece. The result will be of the same form: a number, a vector of counts, summary statistics, a model, a list, whatever.

Finally we combine those results into a final piece of output, usually a tibble or somesuch.

For example

dplyr is all about this.

gss_sm |> 
  count(bigregion)

# A tibble: 4 × 2
  bigregion     n
  <fct>     <int>
1 Northeast   488
2 Midwest     695
3 South      1052
4 West        632

We split into groups, apply the sum() function within the groups, and combine the results into a new tibble showing the resulting sum per group. The various dplyr functions are oriented to doing this in a way that gives you a consistent set of outputs.

For example: split

We can split, apply, combine in various ways.

Base R has the split() function:

out <- mtcars |> 
  split(mtcars$cyl)
summary(out) # mtcars split into a list of data frames by the `cyl` variable

  Length Class      Mode
4 11     data.frame list
6 11     data.frame list
8 11     data.frame list

For example: split

Tidyverse has group_split():

out <- mtcars |> 
  group_split(cyl)
out

<list_of<
  tbl_df<
    mpg : double
    cyl : double
    disp: double
    hp  : double
    drat: double
    wt  : double
    qsec: double
    vs  : double
    am  : double
    gear: double
    carb: double
  >
>[3]>
[[1]]
# A tibble: 11 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
 2  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
 3  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
 4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
 5  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
 6  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
 7  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
 8  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
 9  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
10  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
11  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2

[[2]]
# A tibble: 7 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
3  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
4  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
5  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
6  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
7  19.7     6  145    175  3.62  2.77  15.5     0     1     5     6

[[3]]
# A tibble: 14 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 2  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 3  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
 4  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
 5  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
 6  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
 7  10.4     8  460    215  3     5.42  17.8     0     0     3     4
 8  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
 9  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
10  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
11  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
12  19.2     8  400    175  3.08  3.84  17.0     0     0     3     2
13  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
14  15       8  301    335  3.54  3.57  14.6     0     1     5     8

For example: apply

The application step is “I want to fit a linear model to each piece”

out <- mtcars |> 
  group_split(cyl) |> 
  map(\(df) lm(mpg ~ wt + hp + gear, data = df))

For example: apply

The application step is “I want to fit a linear model to each piece” and get a summary

mtcars |> 
  group_split(cyl) |> 
  map(\(df) lm(mpg ~ wt + hp + gear, data = df))  |> 
  map(summary) |> 
  map_dbl("r.squared")

[1] 0.7301860 0.6597413 0.4995237

For example: combine

In this case the “combine” step is implicitly at the end: we get a vector of R squareds back, and it’s as long as the number of groups.

mtcars |> 
  group_split(cyl) |> 
  map(\(df) lm(mpg ~ wt + hp + gear, data = df))  |> 
  map(summary) |> 
  map_dbl("r.squared")

[1] 0.7301860 0.6597413 0.4995237

For example: apply

This is also what we’re doing more elegantly (staying within a tibble structure) if we nest() and use broom to get a summary out.

mtcars |>
  group_by(cyl) |> 
  nest() |> 
  mutate(model = map(data, \(df) lm(mpg ~ wt + hp + gear, data = df)), 
         perf = map(model, broom::glance)) |> 
  unnest(perf)

# A tibble: 3 × 15
# Groups:   cyl [3]
    cyl data     model  r.squared adj.r.squared sigma statistic p.value    df
  <dbl> <list>   <list>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
1     6 <tibble> <lm>       0.660         0.319  1.20      1.94  0.300      3
2     4 <tibble> <lm>       0.730         0.615  2.80      6.31  0.0211     3
3     8 <tibble> <lm>       0.500         0.349  2.06      3.33  0.0648     3
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

How this happens

In each of these cases, the data is processed sequentially or serially. R splits the data according to your instructions, applies the function or procedure to each one in turn, and combines the outputs in order out the other side. Your computer’s processor is handed each piece in turn.

How this happens

For small tasks that’s fine. But for bigger tasks it gets inefficient quickly.

## From Henrik Bengtsson's documentation for future/furr
slow_sum <- function(x) {
  sum <- 0
  for (value in x) {
    Sys.sleep(1.0)  ## one-second slowdown per value
    sum <- sum + value
  }
  sum
}

# This takes > ten seconds to run.
tictoc::tic()
slow_sum(1:10)

[1] 55

tictoc::toc()

10.047 sec elapsed

If this is the sort of task we have to apply to a bunch of things, it’s going to take ages.

That’s Embarrassing

A feature of many split-apply-combine activities is that it does not matter what order the “apply” part happens to the groups. All that matters is that we can combine the results at the end, no matter what order they come back in. E.g. in the mtcars example the model being fit to the 4-cylinder cars group doesn’t depend in any way on the results of the model being fit to the 8-cylinder group.

This sort of situation is embarrassingly parallel.

That’s Embarrassing

When a task is embarrassingly parallel, and the number of pieces or groups is large enough or complex enough, then if we can do them at the same time then we should. There is some overhead—we have to keep track of where each piece was sent and when the results come back in—but if that’s low enough in comparison to doing things serially, then we should parallelize the task.

Multicore Computing

Parallel computing used to mean “I have a cluster of computers at my disposal”. But modern CPUs are constructed in a semi-modular fashion. They have some number of “cores”, each one of which is like a small processor in its own right. In effect you have a computer cluster already.

Some Terms

Socket: The physical connection on your computer that houses the processor. These days, mostly there’s just one.

Core: The part of the processor that actually performs the computation. Most modern processors have multiple cores. Each one can do wholly independent work.

Process: A single instance of a running task or program (R, Slack, Chrome, etc). A core can only run one process at a time. But, cores are fast. And so, they can run many threads

Thread: A piece of a process that can share memory and resources with other threads. If you have enough power you can do something Intel called hyperthreading, which is a way of dividing up a physical core into (usually) two logical cores that work on the same clock cycle and share some resources on the physical core.

Cluster: A collection of things that are capable of hosting cores. Might be a single socket (on your laptop) or an old-fashioned room full of many physical computers that can be made to act as if they were a single machine.

Multicore Computing

Most of the time, even when we are using them, our computers sit around doing PRETTY MUCH NOTHING.

## How many cores do we have?
parallelly::availableCores()

system 
    16

Remember, processor clock cycles are really fast. They’re measured in billions of cycles per second.

We need to put those cores to work!

Future and furrr

These packages make parallel computing way more straightforward than it used to be. In particular, for Tidyverse-centric workflows, furrr provides a set of future_ functions that are drop-in replacements for map and friends. So map() becomes future_map() and so on.

We set things up like this:

# library(furrr) # We did this already
plan(multisession) # Default will use the available resources
## Note difference between multisession and multicore 
## [tl;dr: the latter is faster, but you can't use it on Windows]

Toy Example

# Another slow function (from 
# Grant McDermott this time)
slow_square <- function(x = 1) {
    x_sq <- x^2 
    out <-  tibble(value = x, value_sq = x_sq)
    Sys.sleep(2) # Zzzz
    out
}

tictoc::tic("Serially")
## This is of course way slower than just writing 
## slow_square(1:15) but nvm that for now
serial_out <- map(1:15, slow_square) |> 
  list_rbind()
tictoc::toc()

Serially: 30.088 sec elapsed

Toy Example

tictoc::tic("Parallelized")
parallel_out <-  future_map_dfr(1:15, slow_square)
tictoc::toc()

Parallelized: 4.86 sec elapsed

identical(serial_out, parallel_out)

[1] TRUE

NOAA Data

Data obtained with:

mkdir raw
cd raw
wget --no-parent -r -l inf --wait 5 --random-wait 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/'

This tries to be polite with the NOAA: it enforces a wait time and in addition randomizes it to make it variably longer. Note also no boundary on depth of folder recursion. There are a lot of files (>15,000). Doing it this way will take several days in real time (though much much less in actual transfer time of course).

NOAA Temperature Data

Raw data directories:

basename(fs::dir_ls(here::here("avhrr")))

  [1] "198109"             "198110"             "198111"            
  [4] "198112"             "198201"             "198202"            
  [7] "198203"             "198204"             "198205"            
 [10] "198206"             "198207"             "198208"            
 [13] "198209"             "198210"             "198211"            
 [16] "198212"             "198301"             "198302"            
 [19] "198303"             "198304"             "198305"            
 [22] "198306"             "198307"             "198308"            
 [25] "198309"             "198310"             "198311"            
 [28] "198312"             "198401"             "198402"            
 [31] "198403"             "198404"             "198405"            
 [34] "198406"             "198407"             "198408"            
 [37] "198409"             "198410"             "198411"            
 [40] "198412"             "198501"             "198502"            
 [43] "198503"             "198504"             "198505"            
 [46] "198506"             "198507"             "198508"            
 [49] "198509"             "198510"             "198511"            
 [52] "198512"             "198601"             "198602"            
 [55] "198603"             "198604"             "198605"            
 [58] "198606"             "198607"             "198608"            
 [61] "198609"             "198610"             "198611"            
 [64] "198612"             "198701"             "198702"            
 [67] "198703"             "198704"             "198705"            
 [70] "198706"             "198707"             "198708"            
 [73] "198709"             "198710"             "198711"            
 [76] "198712"             "198801"             "198802"            
 [79] "198803"             "198804"             "198805"            
 [82] "198806"             "198807"             "198808"            
 [85] "198809"             "198810"             "198811"            
 [88] "198812"             "198901"             "198902"            
 [91] "198903"             "198904"             "198905"            
 [94] "198906"             "198907"             "198908"            
 [97] "198909"             "198910"             "198911"            
[100] "198912"             "199001"             "199002"            
[103] "199003"             "199004"             "199005"            
[106] "199006"             "199007"             "199008"            
[109] "199009"             "199010"             "199011"            
[112] "199012"             "199101"             "199102"            
[115] "199103"             "199104"             "199105"            
[118] "199106"             "199107"             "199108"            
[121] "199109"             "199110"             "199111"            
[124] "199112"             "199201"             "199202"            
[127] "199203"             "199204"             "199205"            
[130] "199206"             "199207"             "199208"            
[133] "199209"             "199210"             "199211"            
[136] "199212"             "199301"             "199302"            
[139] "199303"             "199304"             "199305"            
[142] "199306"             "199307"             "199308"            
[145] "199309"             "199310"             "199311"            
[148] "199312"             "199401"             "199402"            
[151] "199403"             "199404"             "199405"            
[154] "199406"             "199407"             "199408"            
[157] "199409"             "199410"             "199411"            
[160] "199412"             "199501"             "199502"            
[163] "199503"             "199504"             "199505"            
[166] "199506"             "199507"             "199508"            
[169] "199509"             "199510"             "199511"            
[172] "199512"             "199601"             "199602"            
[175] "199603"             "199604"             "199605"            
[178] "199606"             "199607"             "199608"            
[181] "199609"             "199610"             "199611"            
[184] "199612"             "199701"             "199702"            
[187] "199703"             "199704"             "199705"            
[190] "199706"             "199707"             "199708"            
[193] "199709"             "199710"             "199711"            
[196] "199712"             "199801"             "199802"            
[199] "199803"             "199804"             "199805"            
[202] "199806"             "199807"             "199808"            
[205] "199809"             "199810"             "199811"            
[208] "199812"             "199901"             "199902"            
[211] "199903"             "199904"             "199905"            
[214] "199906"             "199907"             "199908"            
[217] "199909"             "199910"             "199911"            
[220] "199912"             "200001"             "200002"            
[223] "200003"             "200004"             "200005"            
[226] "200006"             "200007"             "200008"            
[229] "200009"             "200010"             "200011"            
[232] "200012"             "200101"             "200102"            
[235] "200103"             "200104"             "200105"            
[238] "200106"             "200107"             "200108"            
[241] "200109"             "200110"             "200111"            
[244] "200112"             "200201"             "200202"            
[247] "200203"             "200204"             "200205"            
[250] "200206"             "200207"             "200208"            
[253] "200209"             "200210"             "200211"            
[256] "200212"             "200301"             "200302"            
[259] "200303"             "200304"             "200305"            
[262] "200306"             "200307"             "200308"            
[265] "200309"             "200310"             "200311"            
[268] "200312"             "200401"             "200402"            
[271] "200403"             "200404"             "200405"            
[274] "200406"             "200407"             "200408"            
[277] "200409"             "200410"             "200411"            
[280] "200412"             "200501"             "200502"            
[283] "200503"             "200504"             "200505"            
[286] "200506"             "200507"             "200508"            
[289] "200509"             "200510"             "200511"            
[292] "200512"             "200601"             "200602"            
[295] "200603"             "200604"             "200605"            
[298] "200606"             "200607"             "200608"            
[301] "200609"             "200610"             "200611"            
[304] "200612"             "200701"             "200702"            
[307] "200703"             "200704"             "200705"            
[310] "200706"             "200707"             "200708"            
[313] "200709"             "200710"             "200711"            
[316] "200712"             "200801"             "200802"            
[319] "200803"             "200804"             "200805"            
[322] "200806"             "200807"             "200808"            
[325] "200809"             "200810"             "200811"            
[328] "200812"             "200901"             "200902"            
[331] "200903"             "200904"             "200905"            
[334] "200906"             "200907"             "200908"            
[337] "200909"             "200910"             "200911"            
[340] "200912"             "201001"             "201002"            
[343] "201003"             "201004"             "201005"            
[346] "201006"             "201007"             "201008"            
[349] "201009"             "201010"             "201011"            
[352] "201012"             "201101"             "201102"            
[355] "201103"             "201104"             "201105"            
[358] "201106"             "201107"             "201108"            
[361] "201109"             "201110"             "201111"            
[364] "201112"             "201201"             "201202"            
[367] "201203"             "201204"             "201205"            
[370] "201206"             "201207"             "201208"            
[373] "201209"             "201210"             "201211"            
[376] "201212"             "201301"             "201302"            
[379] "201303"             "201304"             "201305"            
[382] "201306"             "201307"             "201308"            
[385] "201309"             "201310"             "201311"            
[388] "201312"             "201401"             "201402"            
[391] "201403"             "201404"             "201405"            
[394] "201406"             "201407"             "201408"            
[397] "201409"             "201410"             "201411"            
[400] "201412"             "201501"             "201502"            
[403] "201503"             "201504"             "201505"            
[406] "201506"             "201507"             "201508"            
[409] "201509"             "201510"             "201511"            
[412] "201512"             "201601"             "201602"            
[415] "201603"             "201604"             "201605"            
[418] "201606"             "201607"             "201608"            
[421] "201609"             "201610"             "201611"            
[424] "201612"             "201701"             "201702"            
[427] "201703"             "201704"             "201705"            
[430] "201706"             "201707"             "201708"            
[433] "201709"             "201710"             "201711"            
[436] "201712"             "201801"             "201802"            
[439] "201803"             "201804"             "201805"            
[442] "201806"             "201807"             "201808"            
[445] "201809"             "201810"             "201811"            
[448] "201812"             "201901"             "201902"            
[451] "201903"             "201904"             "201905"            
[454] "201906"             "201907"             "201908"            
[457] "201909"             "201910"             "201911"            
[460] "201912"             "202001"             "202002"            
[463] "202003"             "202004"             "202005"            
[466] "202006"             "202007"             "202008"            
[469] "202009"             "202010"             "202011"            
[472] "202012"             "202101"             "202102"            
[475] "202103"             "202104"             "202105"            
[478] "202106"             "202107"             "202108"            
[481] "202109"             "202110"             "202111"            
[484] "202112"             "202201"             "202202"            
[487] "202203"             "202204"             "202205"            
[490] "202206"             "202207"             "202208"            
[493] "202209"             "202210"             "202211"            
[496] "202212"             "202301"             "202302"            
[499] "202303"             "202304"             "202305"            
[502] "202306"             "202307"             "202308"            
[505] "202309"             "202310"             "202311"            
[508] "202312"             "202401"             "202402"            
[511] "202403"             "index.html"         "index.html?C=D;O=A"
[514] "index.html?C=D;O=D" "index.html?C=M;O=A" "index.html?C=M;O=D"
[517] "index.html?C=N;O=A" "index.html?C=N;O=D" "index.html?C=S;O=A"
[520] "index.html?C=S;O=D"

NOAA Temperature Data

Raw data files, in netCDF (Version 4) format

basename(fs::dir_ls(here::here("avhrr", "202402")))

 [1] "oisst-avhrr-v02r01.20240201.nc" "oisst-avhrr-v02r01.20240202.nc"
 [3] "oisst-avhrr-v02r01.20240203.nc" "oisst-avhrr-v02r01.20240204.nc"
 [5] "oisst-avhrr-v02r01.20240205.nc" "oisst-avhrr-v02r01.20240206.nc"
 [7] "oisst-avhrr-v02r01.20240207.nc" "oisst-avhrr-v02r01.20240208.nc"
 [9] "oisst-avhrr-v02r01.20240209.nc" "oisst-avhrr-v02r01.20240210.nc"
[11] "oisst-avhrr-v02r01.20240211.nc" "oisst-avhrr-v02r01.20240212.nc"
[13] "oisst-avhrr-v02r01.20240213.nc" "oisst-avhrr-v02r01.20240214.nc"
[15] "oisst-avhrr-v02r01.20240215.nc" "oisst-avhrr-v02r01.20240216.nc"
[17] "oisst-avhrr-v02r01.20240217.nc" "oisst-avhrr-v02r01.20240218.nc"
[19] "oisst-avhrr-v02r01.20240219.nc" "oisst-avhrr-v02r01.20240220.nc"
[21] "oisst-avhrr-v02r01.20240221.nc" "oisst-avhrr-v02r01.20240222.nc"
[23] "oisst-avhrr-v02r01.20240223.nc" "oisst-avhrr-v02r01.20240224.nc"
[25] "oisst-avhrr-v02r01.20240225.nc" "oisst-avhrr-v02r01.20240226.nc"
[27] "oisst-avhrr-v02r01.20240227.nc" "oisst-avhrr-v02r01.20240228.nc"
[29] "oisst-avhrr-v02r01.20240229.nc"

Some Prep

## Seasons for plotting
season <-  function(in_date){
  br = yday(as.Date(c("2019-03-01",
                      "2019-06-01",
                      "2019-09-01",
                      "2019-12-01")))
  x = yday(in_date)
  x = cut(x, breaks = c(0, br, 366))
  levels(x) = c("Winter", "Spring", "Summer", "Autumn", "Winter")
  x
}

season_lab <-  tibble(yrday = yday(as.Date(c("2019-03-01",
                                             "2019-06-01",
                                             "2019-09-01",
                                             "2019-12-01"))),
                      lab = c("Spring", "Summer", "Autumn", "Winter"))

NOAA Temperature Data

Raw data files

#install.packages("ncdf4")
#install.packages("terra")
library(terra)

## For the filename processing
## This one gives you an unknown number of chunks each with approx n elements
chunk <- function(x, n) split(x, ceiling(seq_along(x)/n))

## This one gives you n chunks each with an approx equal but unknown number of elements
chunk2 <- function(x, n) split(x, cut(seq_along(x), n, labels = FALSE))

## All the daily .nc files:
all_fnames <- as.character(fs::dir_ls(here("avhrr"), recurse = TRUE, glob = "*.nc"))
chunked_fnames <- chunk(all_fnames, 25)

length(all_fnames)

[1] 15540

length(chunked_fnames)

[1] 622

NOAA Temperature Data

The data is in netCDF (Version 4) format. An interesting self-documenting format. Read one in with the netCDF reader.

tmp <- ncdf4::nc_open(all_fnames[10000])
tmp

File /Users/kjhealy/Documents/courses/socdata.co/avhrr/200901/oisst-avhrr-v02r01.20090120.nc (NC_FORMAT_NETCDF4):

     4 variables (excluding dimension variables):
        short anom[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Daily sea surface temperature anomalies
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: -1200
            valid_max: 1200
            units: Celsius
        short err[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Estimated error standard deviation of analysed_sst
            units: Celsius
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: 0
            valid_max: 1000
        short ice[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Sea ice concentration
            units: %
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: 0
            valid_max: 100
        short sst[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Daily sea surface temperature
            units: Celsius
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: -300
            valid_max: 4500

     4 dimensions:
        time  Size:1   *** is unlimited *** 
            long_name: Center time of the day
            units: days since 1978-01-01 12:00:00
        zlev  Size:1 
            long_name: Sea surface height
            units: meters
            actual_range: 0, 0
            positive: down
        lat  Size:720 
            long_name: Latitude
            units: degrees_north
            grids: Uniform grid from -89.875 to 89.875 by 0.25
        lon  Size:1440 
            long_name: Longitude
            units: degrees_east
            grids: Uniform grid from 0.125 to 359.875 by 0.25

    38 global attributes:
        title: NOAA/NCEI 1/4 Degree Daily Optimum Interpolation Sea Surface Temperature (OISST) Analysis, Version 2.1 - Final
        Description: Reynolds, et al.(2007) Daily High-resolution Blended Analyses. Available at ftp://eclipse.ncdc.noaa.gov/pub/OI-daily/daily-sst.pdf  Climatology is based on 1971-2000 OI.v2 SST, Satellite data: Navy  NOAA18 METOP AVHRR, Ice data: NCEP ice
        source: ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfinder_AVHRR, Navy_AVHRR
        id: oisst-avhrr-v02r01.20090120.nc
        naming_authority: gov.noaa.ncei
        summary: NOAAs 1/4-degree Daily Optimum Interpolation Sea Surface Temperature (OISST) (sometimes referred to as Reynolds SST, which however also refers to earlier products at different resolution), currently available as version v02r01, is created by interpolating and extrapolating SST observations from different sources, resulting in a smoothed complete field. The sources of data are satellite (AVHRR) and in situ platforms (i.e., ships and buoys), and the specific datasets employed may change over time. At the marginal ice zone, sea ice concentrations are used to generate proxy SSTs.  A preliminary version of this file is produced in near-real time (1-day latency), and then replaced with a final version after 2 weeks. Note that this is the AVHRR-ONLY DOISST, available from Oct 1981, but there is a companion DOISST product that includes microwave satellite data, available from June 2002
        cdm_data_type: Grid
        history: Final file created using preliminary as first guess, and 3 days of AVHRR data. Preliminary uses only 1 day of AVHRR data.
        date_modified: 2020-05-08T19:05:13Z
        date_created: 2020-05-08T19:05:13Z
        product_version: Version v02r01
        processing_level: NOAA Level 4
        institution: NOAA/National Centers for Environmental Information
        creator_url: https://www.ncei.noaa.gov/
        creator_email: oisst-help@noaa.gov
        keywords: Earth Science > Oceans > Ocean Temperature > Sea Surface Temperature
        keywords_vocabulary: Global Change Master Directory (GCMD) Earth Science Keywords
        platform: Ships, buoys, Argo floats, MetOp-A, MetOp-B
        platform_vocabulary: Global Change Master Directory (GCMD) Platform Keywords
        instrument: Earth Remote Sensing Instruments > Passive Remote Sensing > Spectrometers/Radiometers > Imaging Spectrometers/Radiometers > AVHRR > Advanced Very High Resolution Radiometer
        instrument_vocabulary: Global Change Master Directory (GCMD) Instrument Keywords
        standard_name_vocabulary: CF Standard Name Table (v40, 25 January 2017)
        geospatial_lat_min: -90
        geospatial_lat_max: 90
        geospatial_lon_min: 0
        geospatial_lon_max: 360
        geospatial_lat_units: degrees_north
        geospatial_lat_resolution: 0.25
        geospatial_lon_units: degrees_east
        geospatial_lon_resolution: 0.25
        time_coverage_start: 2009-01-20T00:00:00Z
        time_coverage_end: 2009-01-20T23:59:59Z
        metadata_link: https://doi.org/10.25921/RE9P-PT57
        ncei_template_version: NCEI_NetCDF_Grid_Template_v2.0
        comment: Data was converted from NetCDF-3 to NetCDF-4 format with metadata updates in November 2017.
        sensor: Thermometer, AVHRR
        Conventions: CF-1.6, ACDD-1.3
        references: Reynolds, et al.(2007) Daily High-Resolution-Blended Analyses for Sea Surface Temperature (available at https://doi.org/10.1175/2007JCLI1824.1). Banzon, et al.(2016) A long-term record of blended satellite and in situ sea-surface temperature for climate monitoring, modeling and environmental studies (available at https://doi.org/10.5194/essd-8-165-2016). Huang et al. (2020) Improvements of the Daily Optimum Interpolation Sea Surface Temperature (DOISST) Version v02r01, submitted.Climatology is based on 1971-2000 OI.v2 SST. Satellite data: Pathfinder AVHRR SST and Navy AVHRR SST. Ice data: NCEP Ice and GSFC Ice.

NOAA Temperature Data

For analysis we are going to use the terra package, which understands many GIS type formats. Fundamentally we have a grid or raster of data that’s 1440 x 720 in size. The data has several layers, each one a specific measure, such as sea-surface temperature, sea ice coverage, and so on.

tmp <- terra::rast(all_fnames[10000])
tmp

class       : SpatRaster 
dimensions  : 720, 1440, 4  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : 0, 360, -90, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 
sources     : oisst-avhrr-v02r01.20090120.nc:anom  
              oisst-avhrr-v02r01.20090120.nc:err  
              oisst-avhrr-v02r01.20090120.nc:ice  
              oisst-avhrr-v02r01.20090120.nc:sst  
varnames    : anom (Daily sea surface temperature anomalies) 
              err (Estimated error standard deviation of analysed_sst) 
              ice (Sea ice concentration) 
              ...
names       : anom_zlev=0, err_zlev=0, ice_zlev=0, sst_zlev=0 
unit        :     Celsius,    Celsius,          %,    Celsius 
time (days) : 2009-01-20

Terra can understand rasters that are aggregated into slabs or “bricks” covering the same area repeatedly, and has methods for aggregating these arrays in different ways.

NOAA Temperature Data

Read one in with terra. Get the sst (Sea Surface Temperature) layer only.

tmp <- terra::rast(all_fnames[10000])["sst"]
tmp

class       : SpatRaster 
dimensions  : 720, 1440, 1  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : 0, 360, -90, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 
source      : oisst-avhrr-v02r01.20090120.nc:sst 
varname     : sst (Daily sea surface temperature) 
name        : sst_zlev=0 
unit        :    Celsius 
time (days) : 2009-01-20

NOAA Temperature Data

What reading a chunk of filenames (with all their layers) does:

tmp2 <- terra::rast(chunked_fnames[[10]])
tmp2

class       : SpatRaster 
dimensions  : 720, 1440, 100  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : 0, 360, -90, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 
sources     : oisst-avhrr-v02r01.19820414.nc:anom  
              oisst-avhrr-v02r01.19820414.nc:err  
              oisst-avhrr-v02r01.19820414.nc:ice  
              ... and 97 more source(s)
varnames    : anom (Daily sea surface temperature anomalies) 
              err (Estimated error standard deviation of analysed_sst) 
              ice (Sea ice concentration) 
              ...
names       : anom_zlev=0, err_zlev=0, ice_zlev=0, sst_zlev=0, anom_zlev=0, err_zlev=0, ... 
unit        :     Celsius,    Celsius,          %,    Celsius,     Celsius,    Celsius, ... 
time (days) : 1982-04-14 to 1982-05-08

NOAA Temperature Data

Write a function to get a file, read in the SST raster, and get its area-weighted mean, for the North Atlantic region only.

process_raster <- function(fnames, crop_area = c(-80, 0, 0, 60), var = "sst") {

  tdf <- terra::rast(as.character(fnames))[var] |>
    terra::rotate() |>   # Fix 0-360 lon
    terra::crop(crop_area) # Manually crop to a defined box. Default is Atlantic lat/lon box

  wts <- terra::cellSize(tdf, unit = "km") # For scaling

  data.frame(
    date = terra::time(tdf),
    # global() calculates a quantity for the whole grid on a particular SpatRaster
    # so we get one weighted mean per file that comes in
    wt_mean_sst = terra::global(tdf, "mean", weights = wts, na.rm=TRUE)$weighted_mean
  )
}

NOAA Temperature Data

Try it on one data file:

process_raster(all_fnames[1000]) |> 
  as_tibble()

# A tibble: 1 × 2
  date       wt_mean_sst
  <date>           <dbl>
1 1984-05-27        20.6

Try it on one chunk of data files:

process_raster(chunked_fnames[[500]]) |> 
  as_tibble()

# A tibble: 25 × 2
   date       wt_mean_sst
   <date>           <dbl>
 1 2015-11-02        22.8
 2 2015-11-03        22.8
 3 2015-11-04        22.8
 4 2015-11-05        22.7
 5 2015-11-06        22.7
 6 2015-11-07        22.7
 7 2015-11-08        22.6
 8 2015-11-09        22.5
 9 2015-11-10        22.5
10 2015-11-11        22.4
# ℹ 15 more rows

NOAA Temperature Data

Do it in parallel for all files:

# library(furrr) # We did this already
# plan(multisession) 

tictoc::tic("Terra Method")
df <- future_map(chunked_fnames, process_raster) |>
  list_rbind() |>
  as_tibble() |> 
  mutate(date = ymd(date),
         year = lubridate::year(date),
         month = lubridate::month(date),
         day = lubridate::day(date),
         yrday = lubridate::yday(date),
         season = season(date))
tictoc::toc()

Terra Method: 29.11 sec elapsed

dim(df)

[1] 15540     7

NOAA Temperature Data

df |> 
  slice_sample(n = 10)

# A tibble: 10 × 7
   date       wt_mean_sst  year month   day yrday season
   <date>           <dbl> <dbl> <dbl> <int> <dbl> <fct> 
 1 1995-07-09        23.2  1995     7     9   190 Summer
 2 2004-12-20        20.8  2004    12    20   355 Winter
 3 2017-04-29        20.5  2017     4    29   119 Spring
 4 2010-04-12        20.1  2010     4    12   102 Spring
 5 1983-03-17        19.4  1983     3    17    76 Spring
 6 2015-10-14        23.6  2015    10    14   287 Autumn
 7 2005-09-23        24.2  2005     9    23   266 Autumn
 8 2021-12-06        21.7  2021    12     6   340 Winter
 9 2006-07-10        23.2  2006     7    10   191 Summer
10 1999-02-22        19.1  1999     2    22    53 Winter

NOAA Temperature Data

Make our cores work even harder

## Seas of the world polygons
seas <- sf::read_sf(here("seas")) 

seas

Simple feature collection with 101 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -180 ymin: -85.5625 xmax: 180 ymax: 90
Geodetic CRS:  WGS 84
# A tibble: 101 × 11
   NAME          ID    Longitude Latitude min_X  min_Y max_X  max_Y   area MRGID
   <chr>         <chr>     <dbl>    <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl>
 1 Rio de La Pl… 33        -56.8   -35.1  -59.8 -36.4  -54.9 -31.5  3.18e4  4325
 2 Bass Strait   62A       146.    -39.5  144.  -41.4  150.  -37.5  1.13e5  4366
 3 Great Austra… 62        133.    -36.7  118.  -43.6  146.  -31.5  1.33e6  4276
 4 Tasman Sea    63        161.    -39.7  147.  -50.9  175.  -30    3.34e6  4365
 5 Mozambique C… 45A        40.9   -19.3   32.4 -26.8   49.2 -10.5  1.39e6  4261
 6 Savu Sea      48o       122.     -9.48 119.  -10.9  125.   -8.21 1.06e5  4343
 7 Timor Sea     48i       128.    -11.2  123.  -15.8  133.   -8.18 4.34e5  4344
 8 Bali Sea      48l       116.     -7.93 114.   -9.00 117.   -7.01 3.99e4  4340
 9 Coral Sea     64        157.    -18.2  141.  -30.0  170.   -6.79 4.13e6  4364
10 Flores Sea    48j       120.     -7.51 117.   -8.74 123.   -5.51 1.03e5  4341
# ℹ 91 more rows
# ℹ 1 more variable: geometry <MULTIPOLYGON [°]>

NOAA Temperature Data

## Rasterize the seas polygons using one of the nc files
## as a reference grid for the rasterization process
## They're all on the same grid so it doesn't matter which one
one_raster <- all_fnames[1]
seas_vect <- terra::vect(seas)
tmp_tdf_seas <- terra::rast(one_raster)["sst"] |>
  terra::rotate()
seas_zonal <- terra::rasterize(seas_vect, tmp_tdf_seas, "NAME")

NOAA Temperature Data

plot(seas_zonal, legend = FALSE)

Pointers and wrapping

We’ll need to apply this grid of seas and oceans repeatedly — once for each .nc file — which means it has to get passed to all our worker cores. But it is stored as a pointer, so we can’t do that directly. We need to wrap it:

# If we don't do this it can't be passed around
# across the processes that future_map() will spawn
seas_zonal_wrapped <- terra::wrap(seas_zonal)

NOAA Temperature Data

Write a function very similar to the other one.

process_raster_zonal <- function(fnames, var = "sst") {

  tdf_seas <- terra::rast(as.character(fnames))[var] |>
    terra::rotate() |>
    terra::zonal(terra::unwrap(seas_zonal_wrapped), mean, na.rm = TRUE)

  colnames(tdf_seas) <- c("ID", paste0("d_", terra::time(terra::rast(as.character(fnames))[var])))
  tdf_seas |>
    pivot_longer(-ID, names_to = "date", values_to = "sst_wt_mean")

}

NOAA Temperature Data

Try it on one record:

process_raster_zonal(all_fnames[10000])

# A tibble: 101 × 3
   ID                     date         sst_wt_mean
   <chr>                  <chr>              <dbl>
 1 Adriatic Sea           d_2009-01-20       13.5 
 2 Aegean Sea             d_2009-01-20       15.9 
 3 Alboran Sea            d_2009-01-20       14.8 
 4 Andaman or Burma Sea   d_2009-01-20       27.1 
 5 Arabian Sea            d_2009-01-20       26.6 
 6 Arafura Sea            d_2009-01-20       29.6 
 7 Arctic Ocean           d_2009-01-20       -1.74
 8 Baffin Bay             d_2009-01-20       -1.58
 9 Balearic (Iberian Sea) d_2009-01-20       14.0 
10 Bali Sea               d_2009-01-20       28.7 
# ℹ 91 more rows

We’ll tidy the date later. You can see we get 101 summary records per day.

NOAA Temperature Data

Try it on one chunk of records:

process_raster_zonal(chunked_fnames[[1]])

# A tibble: 2,525 × 3
   ID           date         sst_wt_mean
   <chr>        <chr>              <dbl>
 1 Adriatic Sea d_1981-09-01        23.0
 2 Adriatic Sea d_1981-09-02        23.1
 3 Adriatic Sea d_1981-09-03        22.9
 4 Adriatic Sea d_1981-09-04        22.9
 5 Adriatic Sea d_1981-09-05        22.5
 6 Adriatic Sea d_1981-09-06        22.4
 7 Adriatic Sea d_1981-09-07        22.4
 8 Adriatic Sea d_1981-09-08        22.5
 9 Adriatic Sea d_1981-09-09        22.6
10 Adriatic Sea d_1981-09-10        22.5
# ℹ 2,515 more rows

NOAA Temperature Data

Now future_map() it:

tictoc::tic("Big op")
seameans_df <- future_map(chunked_fnames, process_raster_zonal) |>
  list_rbind() |>
  mutate(date = str_remove(date, "d_"),
         date = ymd(date),
         year = lubridate::year(date),
         month = lubridate::month(date),
         day = lubridate::day(date),
         yrday = lubridate::yday(date),
         season = season(date)) |>
  rename(sea = ID)
tictoc::toc()

Big op: 48.701 sec elapsed

NOAA Temperature Data

seameans_df

# A tibble: 1,569,540 × 8
   sea          date       sst_wt_mean  year month   day yrday season
   <chr>        <date>           <dbl> <dbl> <dbl> <int> <dbl> <fct> 
 1 Adriatic Sea 1981-09-01        23.0  1981     9     1   244 Summer
 2 Adriatic Sea 1981-09-02        23.1  1981     9     2   245 Autumn
 3 Adriatic Sea 1981-09-03        22.9  1981     9     3   246 Autumn
 4 Adriatic Sea 1981-09-04        22.9  1981     9     4   247 Autumn
 5 Adriatic Sea 1981-09-05        22.5  1981     9     5   248 Autumn
 6 Adriatic Sea 1981-09-06        22.4  1981     9     6   249 Autumn
 7 Adriatic Sea 1981-09-07        22.4  1981     9     7   250 Autumn
 8 Adriatic Sea 1981-09-08        22.5  1981     9     8   251 Autumn
 9 Adriatic Sea 1981-09-09        22.6  1981     9     9   252 Autumn
10 Adriatic Sea 1981-09-10        22.5  1981     9    10   253 Autumn
# ℹ 1,569,530 more rows

Parallel Processing

Load the packages, as always

Split, Apply, Combine

A lot of analysis has this pattern

For example

For example: split

For example: split

For example: apply

For example: apply

For example: combine

For example: apply

How this happens

How this happens

That’s Embarrassing

That’s Embarrassing

Multicore Computing

Some Terms

Multicore Computing

Future and furrr

Toy Example

Toy Example

NOAA Data

NOAA Temperature Data

NOAA Temperature Data

Some Prep

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

Try it on one chunk of data files:

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

Pointers and wrapping

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

NOAA Temperature Data

NOAA that’s more like it