Regular Expressions

Kieran Healy

Duke University

February 7, 2024

A brief introduction to regular expressions

Load the packages, as always

library(here)      # manage file paths
library(socviz)    # data and some useful functions

library(tidyverse) # your friend and mine
library(gapminder) # gapminder data
library(stringr) # Loaded automatically but just highlighting it here

Regular
Expressions

Or, waiter, there appears to be a language inside my language

stringr is your gateway to regexps

library(stringr) # It's loaded by default with library(tidyverse)

regexps are their own whole world

This book is a thing of beauty.

Searching for patterns

A regular expression is a way of searching for a piece of text, or pattern, inside some larger body of text, called a string.

The simplest sort of search is like the “Find” functionality in a Word Processor. The pattern is a literal letter, number, punctuation mark, word or series of words; the text is a document searched one line at a time. The next step up is “Find and Replace”.

Every pattern-searching function in stringr has the same basic form:

str_view(<STRING>, <PATTERN>, [...]) # where [...] means "maybe some options"

Functions that replace as well as detect strings all have this form:

str_replace(<STRING>, <PATTERN>, <REPLACEMENT>)

(If you think about it, <STRING>, <PATTERN> and <REPLACEMENT> above are all kinds of pattern: they are meant to “stand for” all kinds of text, not be taken literally.)

Searching for patterns

Here I’ll follow the exposition in Wickham & Grolemund (2017).

x <- c("apple", "banana", "pear")

str_view(x, "an", html=FALSE)

[2] │ b<an><an>a

Searching for patterns

Regular expressions get their real power from wildcards, i.e. tokens that match more than just literal strings, but also more general and more complex patterns.
The most general pattern-matching token is, “Match everything!” This is represented by the period, or .
But … if . matches any character, how do you specifically match the literal character .?

Escaping

You have to “escape” the period to tell the regex you want to match it exactly, rather than interpret it as meaning “match anything”.
regexs use the backslash, \, to signal “escape the next character”.
To match a ., you need the regex \.

Hang on, I see a further problem

We use strings to represent regular expressions. \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string \\.

# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)

\.

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")

[2] │ <a.c>

But … how do you match a literal \?

x <- "a\\b"
writeLines(x)

a\b

#> a\b

str_view(x, "\\\\") # you need four!

[1] │ a<\>b

But … how do you match a literal \?

This is the price we pay for having to express searches for patterns using a language containing these same characters, which we may also want to search for.

I promise this will pay off

Matching start and end

Use ^ to match the start of a string.

x <- c("apple", "banana", "pear")
str_view(x, "^a")

[1] │ <a>pple

Matching start and end

Use ^ to match the start of a string.

x <- c("apple", "banana", "pear")
str_view(x, "^a")

[1] │ <a>pple

Use $ to match the end of a string.

str_view(x, "a$")

[2] │ banan<a>

Matching start and end

To force a regular expression to only match a complete string, anchor it with both ^ and $

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")

[1] │ <apple> pie
[2] │ <apple>
[3] │ <apple> cake

str_view(x, "^apple$")

[2] │ <apple>

Matching character classes

\d matches any digit.

\s matches any whitespace (e.g. space, tab, newline).

abc matches a, b, or c.

^abc matches anything except a, b, or c.

Matching the special characters

Look for a literal character that normally has special meaning in a regex:

str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")

[2] │ <a.c>

str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")

[3] │ <a*c>

This works for most (but not all) regex metacharacters: $ . | ? * + ( ) [ {. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes. These are ] \ ^ and -

Alternation

Use parentheses to make the precedence of the ‘or’ operator | clear:

str_view(c("groy", "grey", "griy", "gray"), "gr(e|a)y")

[2] │ <grey>
[4] │ <gray>

Repeated patterns

? is previous token 0 or 1 times
+ is previous token 1 or more times
* is previous token 0 or more times

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")

[1] │ 1888 is the longest year in Roman numerals: MD<CC><C>LXXXVIII

Repeated patterns

? is previous token 0 or 1 times
+ is previous token 1 or more times
* is previous token 0 or more times

str_view(x, "CC+")

[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII

Repeated patterns

? is previous token 0 or 1 times
+ is previous token 1 or more times
* is previous token 0 or more times

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, 'C[LX]+')

[1] │ 1888 is the longest year in Roman numerals: MDCC<CLXXX>VIII

Exact numbers of repetitions

{n} is previous token exactly n times
{n,} is previous token n or more times
{,m} is previous token at most m times
{n,m} is previous token between n and m times

str_view(x, "C{2}")

[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII

Exact numbers of repetitions

{n} is previous token exactly n times
{n,} is previous token n or more times
{,m} is previous token at most m times
{n,m} is previous token between n and m times

str_view(x, "C{2,}")

[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII

Exact numbers of repetitions

{n} is previous token exactly n times
{n,} is previous token n or more times
{,m} is previous token at most m times
{n,m} is previous token between n and m times

str_view(x, "C{2,3}")

[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII

Exact numbers of repetitions

{n} is previous token exactly n times
{n,} is n previous token or more times
{,m} is previous token at most m times
{n,m} is previous token between n and m times

By default these are greedy matches. You can make them “lazy”, matching the shortest string possible by putting a ? after them. This is often very useful!

str_view(x, 'C{2,3}?')

[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII

Exact numbers of repetitions

{n} is previous token exactly n times
{n,} is previous token n or more times
{,m} is previous token at most m times
{n,m} is previous token between n and m times

By default these are greedy matches. You can make them “lazy”, matching the shortest string possible by putting a ? after them. This is often very useful!

str_view(x, 'C[LX]+?')

[1] │ 1888 is the longest year in Roman numerals: MDCC<CL>XXXVIII

And finally … backreferences

fruit # built into stringr

 [1] "apple"             "apricot"           "avocado"          
 [4] "banana"            "bell pepper"       "bilberry"         
 [7] "blackberry"        "blackcurrant"      "blood orange"     
[10] "blueberry"         "boysenberry"       "breadfruit"       
[13] "canary melon"      "cantaloupe"        "cherimoya"        
[16] "cherry"            "chili pepper"      "clementine"       
[19] "cloudberry"        "coconut"           "cranberry"        
[22] "cucumber"          "currant"           "damson"           
[25] "date"              "dragonfruit"       "durian"           
[28] "eggplant"          "elderberry"        "feijoa"           
[31] "fig"               "goji berry"        "gooseberry"       
[34] "grape"             "grapefruit"        "guava"            
[37] "honeydew"          "huckleberry"       "jackfruit"        
[40] "jambul"            "jujube"            "kiwi fruit"       
[43] "kumquat"           "lemon"             "lime"             
[46] "loquat"            "lychee"            "mandarine"        
[49] "mango"             "mulberry"          "nectarine"        
[52] "nut"               "olive"             "orange"           
[55] "pamelo"            "papaya"            "passionfruit"     
[58] "peach"             "pear"              "persimmon"        
[61] "physalis"          "pineapple"         "plum"             
[64] "pomegranate"       "pomelo"            "purple mangosteen"
[67] "quince"            "raisin"            "rambutan"         
[70] "raspberry"         "redcurrant"        "rock melon"       
[73] "salal berry"       "satsuma"           "star fruit"       
[76] "strawberry"        "tamarillo"         "tangerine"        
[79] "ugli fruit"        "watermelon"

Grouping and backreferences

Find all fruits that have a repeated pair of letters:

str_view(fruit, "(..)\\1", match = TRUE)

 [4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry

Grouping and backreferences

Backreferences and grouping will be very useful for string replacements.

OK that was a lot

Learning and testing regexps

Practice with a tester like https://regexr.com
Or an app like Patterns
The regex engine or “flavor” used by stringr is Perl- or PCRE-like.

Regexps in practice

Example: Politics and Placenames

library(ukelection2019)

Example: Politics and Placenames

library(ukelection2019)

ukvote2019

# A tibble: 3,320 × 13
   cid     constituency electorate party_name candidate votes vote_share_percent
   <chr>   <chr>             <int> <chr>      <chr>     <int>              <dbl>
 1 W07000… Aberavon          50747 Labour     Stephen … 17008               53.8
 2 W07000… Aberavon          50747 Conservat… Charlott…  6518               20.6
 3 W07000… Aberavon          50747 The Brexi… Glenda D…  3108                9.8
 4 W07000… Aberavon          50747 Plaid Cym… Nigel Hu…  2711                8.6
 5 W07000… Aberavon          50747 Liberal D… Sheila K…  1072                3.4
 6 W07000… Aberavon          50747 Independe… Captain …   731                2.3
 7 W07000… Aberavon          50747 Green      Giorgia …   450                1.4
 8 W07000… Aberconwy         44699 Conservat… Robin Mi… 14687               46.1
 9 W07000… Aberconwy         44699 Labour     Emily Ow… 12653               39.7
10 W07000… Aberconwy         44699 Plaid Cym… Lisa Goo…  2704                8.5
# ℹ 3,310 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
#   vrank <int>, turnout <dbl>, fname <chr>, lname <chr>

Example: Politics and Placenames

library(ukelection2019)

ukvote2019 |>
  group_by(constituency)

# A tibble: 3,320 × 13
# Groups:   constituency [650]
   cid     constituency electorate party_name candidate votes vote_share_percent
   <chr>   <chr>             <int> <chr>      <chr>     <int>              <dbl>
 1 W07000… Aberavon          50747 Labour     Stephen … 17008               53.8
 2 W07000… Aberavon          50747 Conservat… Charlott…  6518               20.6
 3 W07000… Aberavon          50747 The Brexi… Glenda D…  3108                9.8
 4 W07000… Aberavon          50747 Plaid Cym… Nigel Hu…  2711                8.6
 5 W07000… Aberavon          50747 Liberal D… Sheila K…  1072                3.4
 6 W07000… Aberavon          50747 Independe… Captain …   731                2.3
 7 W07000… Aberavon          50747 Green      Giorgia …   450                1.4
 8 W07000… Aberconwy         44699 Conservat… Robin Mi… 14687               46.1
 9 W07000… Aberconwy         44699 Labour     Emily Ow… 12653               39.7
10 W07000… Aberconwy         44699 Plaid Cym… Lisa Goo…  2704                8.5
# ℹ 3,310 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
#   vrank <int>, turnout <dbl>, fname <chr>, lname <chr>

Example: Politics and Placenames

library(ukelection2019)

ukvote2019 |>
  group_by(constituency) |>
  slice_max(votes)

# A tibble: 650 × 13
# Groups:   constituency [650]
   cid     constituency electorate party_name candidate votes vote_share_percent
   <chr>   <chr>             <int> <chr>      <chr>     <int>              <dbl>
 1 W07000… Aberavon          50747 Labour     Stephen … 17008               53.8
 2 W07000… Aberconwy         44699 Conservat… Robin Mi… 14687               46.1
 3 S14000… Aberdeen No…      62489 Scottish … Kirsty B… 20205               54  
 4 S14000… Aberdeen So…      65719 Scottish … Stephen … 20388               44.7
 5 S14000… Aberdeenshi…      72640 Conservat… Andrew B… 22752               42.7
 6 S14000… Airdrie & S…      64008 Scottish … Neil Gray 17929               45.1
 7 E14000… Aldershot         72617 Conservat… Leo Doch… 27980               58.4
 8 E14000… Aldridge-Br…      60138 Conservat… Wendy Mo… 27850               70.8
 9 E14000… Altrincham …      73096 Conservat… Graham B… 26311               48  
10 W07000… Alyn & Dees…      62783 Labour     Mark Tami 18271               42.5
# ℹ 640 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
#   vrank <int>, turnout <dbl>, fname <chr>, lname <chr>

Example: Politics and Placenames

library(ukelection2019)

ukvote2019 |>
  group_by(constituency) |>
  slice_max(votes) |>
  ungroup()

# A tibble: 650 × 13
   cid     constituency electorate party_name candidate votes vote_share_percent
   <chr>   <chr>             <int> <chr>      <chr>     <int>              <dbl>
 1 W07000… Aberavon          50747 Labour     Stephen … 17008               53.8
 2 W07000… Aberconwy         44699 Conservat… Robin Mi… 14687               46.1
 3 S14000… Aberdeen No…      62489 Scottish … Kirsty B… 20205               54  
 4 S14000… Aberdeen So…      65719 Scottish … Stephen … 20388               44.7
 5 S14000… Aberdeenshi…      72640 Conservat… Andrew B… 22752               42.7
 6 S14000… Airdrie & S…      64008 Scottish … Neil Gray 17929               45.1
 7 E14000… Aldershot         72617 Conservat… Leo Doch… 27980               58.4
 8 E14000… Aldridge-Br…      60138 Conservat… Wendy Mo… 27850               70.8
 9 E14000… Altrincham …      73096 Conservat… Graham B… 26311               48  
10 W07000… Alyn & Dees…      62783 Labour     Mark Tami 18271               42.5
# ℹ 640 more rows
# ℹ 6 more variables: vote_share_change <dbl>, total_votes_cast <int>,
#   vrank <int>, turnout <dbl>, fname <chr>, lname <chr>

Example: Politics and Placenames

library(ukelection2019)

ukvote2019 |>
  group_by(constituency) |>
  slice_max(votes) |>
  ungroup() |>
  select(constituency, party_name)

# A tibble: 650 × 2
   constituency                    party_name             
   <chr>                           <chr>                  
 1 Aberavon                        Labour                 
 2 Aberconwy                       Conservative           
 3 Aberdeen North                  Scottish National Party
 4 Aberdeen South                  Scottish National Party
 5 Aberdeenshire West & Kincardine Conservative           
 6 Airdrie & Shotts                Scottish National Party
 7 Aldershot                       Conservative           
 8 Aldridge-Brownhills             Conservative           
 9 Altrincham & Sale West          Conservative           
10 Alyn & Deeside                  Labour                 
# ℹ 640 more rows

Example: Politics and Placenames

library(ukelection2019)

ukvote2019 |>
  group_by(constituency) |>
  slice_max(votes) |>
  ungroup() |>
  select(constituency, party_name) |>
  mutate(shire = str_detect(constituency, "shire"),
         field = str_detect(constituency, "field"),
         dale = str_detect(constituency, "dale"),
         pool = str_detect(constituency, "pool"),
         ton = str_detect(constituency, "(ton$)|(ton )"),
         wood = str_detect(constituency, "(wood$)|(wood )"),
         saint = str_detect(constituency, "(St )|(Saint)"),
         port = str_detect(constituency, "(Port)|(port)"),
         ford = str_detect(constituency, "(ford$)|(ford )"),
         by = str_detect(constituency, "(by$)|(by )"),
         boro = str_detect(constituency, "(boro$)|(boro )|(borough$)|(borough )"),
         ley = str_detect(constituency, "(ley$)|(ley )|(leigh$)|(leigh )"))

# A tibble: 650 × 14
   constituency party_name shire field dale  pool  ton   wood  saint port  ford 
   <chr>        <chr>      <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
 1 Aberavon     Labour     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 2 Aberconwy    Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 3 Aberdeen No… Scottish … FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 4 Aberdeen So… Scottish … FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 5 Aberdeenshi… Conservat… TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 6 Airdrie & S… Scottish … FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 7 Aldershot    Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 8 Aldridge-Br… Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 9 Altrincham … Conservat… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
10 Alyn & Dees… Labour     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# ℹ 640 more rows
# ℹ 3 more variables: by <lgl>, boro <lgl>, ley <lgl>

Example: Politics and Placenames

library(ukelection2019)

ukvote2019 |>
  group_by(constituency) |>
  slice_max(votes) |>
  ungroup() |>
  select(constituency, party_name) |>
  mutate(shire = str_detect(constituency, "shire"),
         field = str_detect(constituency, "field"),
         dale = str_detect(constituency, "dale"),
         pool = str_detect(constituency, "pool"),
         ton = str_detect(constituency, "(ton$)|(ton )"),
         wood = str_detect(constituency, "(wood$)|(wood )"),
         saint = str_detect(constituency, "(St )|(Saint)"),
         port = str_detect(constituency, "(Port)|(port)"),
         ford = str_detect(constituency, "(ford$)|(ford )"),
         by = str_detect(constituency, "(by$)|(by )"),
         boro = str_detect(constituency, "(boro$)|(boro )|(borough$)|(borough )"),
         ley = str_detect(constituency, "(ley$)|(ley )|(leigh$)|(leigh )")) |>
  pivot_longer(shire:ley, names_to = "toponym")

# A tibble: 7,800 × 4
   constituency party_name toponym value
   <chr>        <chr>      <chr>   <lgl>
 1 Aberavon     Labour     shire   FALSE
 2 Aberavon     Labour     field   FALSE
 3 Aberavon     Labour     dale    FALSE
 4 Aberavon     Labour     pool    FALSE
 5 Aberavon     Labour     ton     FALSE
 6 Aberavon     Labour     wood    FALSE
 7 Aberavon     Labour     saint   FALSE
 8 Aberavon     Labour     port    FALSE
 9 Aberavon     Labour     ford    FALSE
10 Aberavon     Labour     by      FALSE
# ℹ 7,790 more rows

Example: Politics and Placenames

place_tab <- ukvote2019 |> 
  group_by(constituency) |> 
  slice_max(votes) |> 
  ungroup() |> 
  select(constituency, party_name) |> 
  mutate(shire = str_detect(constituency, "shire"),
         field = str_detect(constituency, "field"),
         dale = str_detect(constituency, "dale"),
         pool = str_detect(constituency, "pool"),
         ton = str_detect(constituency, "(ton$)|(ton )"),
         wood = str_detect(constituency, "(wood$)|(wood )"),
         saint = str_detect(constituency, "(St )|(Saint)"),
         port = str_detect(constituency, "(Port)|(port)"),
         ford = str_detect(constituency, "(ford$)|(ford )"),
         by = str_detect(constituency, "(by$)|(by )"),
         boro = str_detect(constituency, "(boro$)|(boro )|(borough$)|(borough )"),
         ley = str_detect(constituency, "(ley$)|(ley )|(leigh$)|(leigh )")) |> 
  pivot_longer(shire:ley, names_to = "toponym")

Example: Politics and Placenames

place_tab

# A tibble: 7,800 × 4
   constituency party_name toponym value
   <chr>        <chr>      <chr>   <lgl>
 1 Aberavon     Labour     shire   FALSE
 2 Aberavon     Labour     field   FALSE
 3 Aberavon     Labour     dale    FALSE
 4 Aberavon     Labour     pool    FALSE
 5 Aberavon     Labour     ton     FALSE
 6 Aberavon     Labour     wood    FALSE
 7 Aberavon     Labour     saint   FALSE
 8 Aberavon     Labour     port    FALSE
 9 Aberavon     Labour     ford    FALSE
10 Aberavon     Labour     by      FALSE
# ℹ 7,790 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym)

# A tibble: 7,800 × 4
# Groups:   party_name, toponym [120]
   constituency party_name toponym value
   <chr>        <chr>      <chr>   <lgl>
 1 Aberavon     Labour     shire   FALSE
 2 Aberavon     Labour     field   FALSE
 3 Aberavon     Labour     dale    FALSE
 4 Aberavon     Labour     pool    FALSE
 5 Aberavon     Labour     ton     FALSE
 6 Aberavon     Labour     wood    FALSE
 7 Aberavon     Labour     saint   FALSE
 8 Aberavon     Labour     port    FALSE
 9 Aberavon     Labour     ford    FALSE
10 Aberavon     Labour     by      FALSE
# ℹ 7,790 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour"))

# A tibble: 6,816 × 4
# Groups:   party_name, toponym [24]
   constituency party_name toponym value
   <chr>        <chr>      <chr>   <lgl>
 1 Aberavon     Labour     shire   FALSE
 2 Aberavon     Labour     field   FALSE
 3 Aberavon     Labour     dale    FALSE
 4 Aberavon     Labour     pool    FALSE
 5 Aberavon     Labour     ton     FALSE
 6 Aberavon     Labour     wood    FALSE
 7 Aberavon     Labour     saint   FALSE
 8 Aberavon     Labour     port    FALSE
 9 Aberavon     Labour     ford    FALSE
10 Aberavon     Labour     by      FALSE
# ℹ 6,806 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name)

# A tibble: 6,816 × 4
# Groups:   toponym, party_name [24]
   constituency party_name toponym value
   <chr>        <chr>      <chr>   <lgl>
 1 Aberavon     Labour     shire   FALSE
 2 Aberavon     Labour     field   FALSE
 3 Aberavon     Labour     dale    FALSE
 4 Aberavon     Labour     pool    FALSE
 5 Aberavon     Labour     ton     FALSE
 6 Aberavon     Labour     wood    FALSE
 7 Aberavon     Labour     saint   FALSE
 8 Aberavon     Labour     port    FALSE
 9 Aberavon     Labour     ford    FALSE
10 Aberavon     Labour     by      FALSE
# ℹ 6,806 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name) |>
  summarize(freq = sum(value))

# A tibble: 24 × 3
# Groups:   toponym [12]
   toponym party_name    freq
   <chr>   <chr>        <int>
 1 boro    Conservative     7
 2 boro    Labour           1
 3 by      Conservative     6
 4 by      Labour           2
 5 dale    Conservative     3
 6 dale    Labour           1
 7 field   Conservative    10
 8 field   Labour          10
 9 ford    Conservative    17
10 ford    Labour          12
# ℹ 14 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name) |>
  summarize(freq = sum(value)) |>
  mutate(pct = freq/sum(freq))

# A tibble: 24 × 4
# Groups:   toponym [12]
   toponym party_name    freq   pct
   <chr>   <chr>        <int> <dbl>
 1 boro    Conservative     7 0.875
 2 boro    Labour           1 0.125
 3 by      Conservative     6 0.75 
 4 by      Labour           2 0.25 
 5 dale    Conservative     3 0.75 
 6 dale    Labour           1 0.25 
 7 field   Conservative    10 0.5  
 8 field   Labour          10 0.5  
 9 ford    Conservative    17 0.586
10 ford    Labour          12 0.414
# ℹ 14 more rows

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name) |>
  summarize(freq = sum(value)) |>
  mutate(pct = freq/sum(freq)) |>
  filter(party_name == "Conservative")

# A tibble: 12 × 4
# Groups:   toponym [12]
   toponym party_name    freq   pct
   <chr>   <chr>        <int> <dbl>
 1 boro    Conservative     7 0.875
 2 by      Conservative     6 0.75 
 3 dale    Conservative     3 0.75 
 4 field   Conservative    10 0.5  
 5 ford    Conservative    17 0.586
 6 ley     Conservative    26 0.722
 7 pool    Conservative     2 0.286
 8 port    Conservative     3 0.333
 9 saint   Conservative     3 0.5  
10 shire   Conservative    37 0.974
11 ton     Conservative    37 0.507
12 wood    Conservative     7 0.636

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name) |>
  summarize(freq = sum(value)) |>
  mutate(pct = freq/sum(freq)) |>
  filter(party_name == "Conservative") |>
  arrange(desc(pct))

# A tibble: 12 × 4
# Groups:   toponym [12]
   toponym party_name    freq   pct
   <chr>   <chr>        <int> <dbl>
 1 shire   Conservative    37 0.974
 2 boro    Conservative     7 0.875
 3 by      Conservative     6 0.75 
 4 dale    Conservative     3 0.75 
 5 ley     Conservative    26 0.722
 6 wood    Conservative     7 0.636
 7 ford    Conservative    17 0.586
 8 ton     Conservative    37 0.507
 9 field   Conservative    10 0.5  
10 saint   Conservative     3 0.5  
11 port    Conservative     3 0.333
12 pool    Conservative     2 0.286

Example: Politics and Placenames

place_tab |>
  group_by(party_name, toponym) |>
  filter(party_name %in% c("Conservative", "Labour")) |>
  group_by(toponym, party_name) |>
  summarize(freq = sum(value)) |>
  mutate(pct = freq/sum(freq)) |>
  filter(party_name == "Conservative") |>
  arrange(desc(pct))

# A tibble: 12 × 4
# Groups:   toponym [12]
   toponym party_name    freq   pct
   <chr>   <chr>        <int> <dbl>
 1 shire   Conservative    37 0.974
 2 boro    Conservative     7 0.875
 3 by      Conservative     6 0.75 
 4 dale    Conservative     3 0.75 
 5 ley     Conservative    26 0.722
 6 wood    Conservative     7 0.636
 7 ford    Conservative    17 0.586
 8 ton     Conservative    37 0.507
 9 field   Conservative    10 0.5  
10 saint   Conservative     3 0.5  
11 port    Conservative     3 0.333
12 pool    Conservative     2 0.286