# A tibble: 366 × 11
age sex year total elem4 elem8 hs3 hs4 coll3 coll4 median
<chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 25-34 Male 2016 21845 116 468 1427 6386 6015 7432 NA
2 25-34 Male 2015 21427 166 488 1584 6198 5920 7071 NA
3 25-34 Male 2014 21217 151 512 1611 6323 5910 6710 NA
4 25-34 Male 2013 20816 161 582 1747 6058 5749 6519 NA
5 25-34 Male 2012 20464 161 579 1707 6127 5619 6270 NA
6 25-34 Male 2011 20985 190 657 1791 6444 5750 6151 NA
7 25-34 Male 2010 20689 186 641 1866 6458 5587 5951 NA
8 25-34 Male 2009 20440 184 695 1806 6495 5508 5752 NA
9 25-34 Male 2008 20210 172 714 1874 6356 5277 5816 NA
10 25-34 Male 2007 20024 246 757 1930 6361 5137 5593 NA
# ℹ 356 more rows
Here, a “Level of Schooling Attained” variable is spread across the columns, from elem4 to coll4. We need a key column called “education” with the various levels of schooling, and a corresponding value column containing the counts.
Wide to long with pivot_longer()
We’re going to put the columns elem4:coll4 into a new column, creating a new categorical measure named education. The numbers currently under each column will become a new value column corresponding to that level of education.
edu |>pivot_longer(elem4:coll4, names_to ="education")
# A tibble: 2,196 × 7
age sex year total median education value
<chr> <chr> <int> <int> <dbl> <chr> <dbl>
1 25-34 Male 2016 21845 NA elem4 116
2 25-34 Male 2016 21845 NA elem8 468
3 25-34 Male 2016 21845 NA hs3 1427
4 25-34 Male 2016 21845 NA hs4 6386
5 25-34 Male 2016 21845 NA coll3 6015
6 25-34 Male 2016 21845 NA coll4 7432
7 25-34 Male 2015 21427 NA elem4 166
8 25-34 Male 2015 21427 NA elem8 488
9 25-34 Male 2015 21427 NA hs3 1584
10 25-34 Male 2015 21427 NA hs4 6198
# ℹ 2,186 more rows
Wide to long with pivot_longer()
We can name the value column to whatever we like. Here it’s a number of people.
edu |>pivot_longer(elem4:coll4, names_to ="education", values_to ="n")
# A tibble: 2,196 × 7
age sex year total median education n
<chr> <chr> <int> <int> <dbl> <chr> <dbl>
1 25-34 Male 2016 21845 NA elem4 116
2 25-34 Male 2016 21845 NA elem8 468
3 25-34 Male 2016 21845 NA hs3 1427
4 25-34 Male 2016 21845 NA hs4 6386
5 25-34 Male 2016 21845 NA coll3 6015
6 25-34 Male 2016 21845 NA coll4 7432
7 25-34 Male 2015 21427 NA elem4 166
8 25-34 Male 2015 21427 NA elem8 488
9 25-34 Male 2015 21427 NA hs3 1584
10 25-34 Male 2015 21427 NA hs4 6198
# ℹ 2,186 more rows
It is pickier and more talkative than the Base R version. Use it instead.
Where’s my data? Using here()
If we’re loading a file, it’s coming from somewhere.
If it’s a file on our hard drive somewhere, we will need to interact with the file system. We should try to do this in a way that avoids absolute file paths.
# This is not portable!df <-read_csv("/Users/kjhealy/Documents/data/misc/project/data/mydata.csv")
We should also do it in a way that is platform independent.
This makes it easier to share your work, move it around, etc. Projects should be self-contained.
Where’s my data? Using here()
The here package, and here() function builds paths relative to the top level of your R project.
here() # this path will be different for you
[1] "/Users/kjhealy/Documents/courses/socdata.co"
Where’s the data? Using here()
This seminar’s files all live in an RStudio project. It looks like this:
I want to load files from the data folder, but I also want you to be able to load them. I’m writing this from somewhere deep in the slides folder, but you won’t be there. Also, I’m on a Mac, but you may not be.
Where’s the data? Using here()
So:
## Load the file relative to the path from the top of the project, without separators, etcorgans <-read_csv(file =here("files", "data", "organdonation.csv"))
# A tibble: 22,042 × 14
CTY_CODE CTY_DESC END_USE COMM_DESC value_14 value_15 value_16 value_17
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 0000 World Total 00000 Green coffee 5.23e 9 5.12e 9 4.79e 9 5.18e 9
2 0000 World Total 00010 Cocoa beans 1.31e 9 1.43e 9 1.29e 9 1.19e 9
3 0000 World Total 00020 Cane and be… 1.60e 9 1.74e 9 1.77e 9 1.66e 9
4 0000 World Total 00100 Meat produc… 1.21e10 1.28e10 1.07e10 1.10e10
5 0000 World Total 00110 Dairy produ… 1.95e 9 2.14e 9 2.02e 9 1.95e 9
6 0000 World Total 00120 Fruits, fro… 1.46e10 1.58e10 1.71e10 1.83e10
7 0000 World Total 00130 Vegetables 1.09e10 1.13e10 1.25e10 1.28e10
8 0000 World Total 00140 Nuts 2.39e 9 2.80e 9 2.90e 9 3.33e 9
9 0000 World Total 00150 Food oils, … 7.00e 9 6.05e 9 6.22e 9 6.85e 9
10 0000 World Total 00160 Bakery prod… 9.34e 9 9.65e 9 1.07e10 1.11e10
# ℹ 22,032 more rows
# ℹ 6 more variables: value_18 <dbl>, value_19 <dbl>, value_20 <dbl>,
# value_21 <dbl>, value_22 <dbl>, value_23 <dbl>
Let’s transform it to long format.
A Plot’s Components
What we need our code to make
Data represented by visual elements;
like position, length, color, and size;
Each measured on some scale;
Each scale with a labeled guide;
With the plot itself also titled and labeled.
How does ggplot do this?
ggplot’s flow of action
Here’s the whole thing, start to finish
Flow of action
We’ll go through it step by step
Flow of action
ggplot’s flow of action
What we start with
ggplot’s flow of action
Where we’re going
ggplot’s flow of action
Core steps
ggplot’s flow of action
Optional steps
ggplot’s flow of action: required
Tidy data
ggplot’s flow of action: required
Aesthetic mappings
ggplot’s flow of action: required
Geom
Let’s go piece by piece
Start with the data
gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
dim(gapminder)
[1] 1704 6
Create a plot object
Data is the gapminder tibble.
p <-ggplot(data = gapminder)
Map variables to aesthetics
Tell ggplot the variables you want represented by visual elements on the plot
p <-ggplot(data = gapminder,mapping =aes(x = gdpPercap,y = lifeExp))
Map variables to aesthetics
The mapping=aes(...) call links variables to things you will see on the plot.
x and y represent the quantities determining position on the x and y axes.
Other aesthetic mappings can include, e.g., color, shape, size, and fill.
Mappings do not directly specify the particular, e.g., colors, shapes, or line styles that will appear on the plot. Rather, they establish which variables in the data will be represented by which visible elements on the plot.
p has data and mappings but no geom
p
This empty plot has no geoms.
Add a geom
p +geom_point()
A scatterplot of Life Expectancy vs GDP
Try a different geom
p +geom_smooth()
A scatterplot of Life Expectancy vs GDP
Build your plots layer by layer
p <-ggplot(data = gapminder,mapping =aes(x = gdpPercap,y=lifeExp))p +geom_smooth()
Life Expectancy vs GDP, using a smoother.
This process is additive
p <-ggplot(data = gapminder,mapping =aes(x = gdpPercap,y=lifeExp))
This process is additive
p <-ggplot(data = gapminder,mapping =aes(x = gdpPercap,y=lifeExp))p +geom_smooth()
This process is additive
p <-ggplot(data = gapminder,mapping =aes(x = gdpPercap,y=lifeExp))p +geom_smooth() +geom_point()
p <-ggplot(data = gapminder, mapping =aes(x = gdpPercap, y = lifeExp))p +geom_point() +geom_smooth(method ="lm") +scale_x_log10(labels = scales::label_dollar()) +labs(x ="GDP Per Capita", y ="Life Expectancy in Years",title ="Economic Growth and Life Expectancy",subtitle ="Data points are country-years",caption ="Source: Gapminder.")
Mapping vs Setting your plot’s aesthetics
“Can I change the color of the points?”
p <-ggplot(data = gapminder,mapping =aes(x = gdpPercap,y = lifeExp,color ="purple"))## Put in an object for conveniencep_out <- p +geom_point() +geom_smooth(method ="loess") +scale_x_log10()
What has gone wrong here?
p_out
Try again
p <-ggplot(data = gapminder,mapping =aes(x = gdpPercap,y = lifeExp))## Put in an object for conveniencep_out <- p +geom_point(color ="purple") +geom_smooth(method ="loess") +scale_x_log10()
Try again
p_out
Geoms can take many arguments
Here we setcolor, size, and alpha. Meanwhile x and y are mapped.
We also give non-default values to some other arguments
p <-ggplot(data = gapminder,mapping =aes(x = gdpPercap,y = lifeExp)) p_out <- p +geom_point(alpha =0.3) +geom_smooth(color ="orange", se =FALSE, linewidth =8, method ="lm") +scale_x_log10()
Geoms can take many arguments
p_out
alpha for overplotting
p <-ggplot(data = gapminder, mapping =aes(x = gdpPercap, y = lifeExp))p +geom_point(alpha =0.3) +geom_smooth(method ="lm") +scale_x_log10(labels = scales::label_dollar()) +labs(x ="GDP Per Capita", y ="Life Expectancy in Years",title ="Economic Growth and Life Expectancy",subtitle ="Data points are country-years",caption ="Source: Gapminder.")
Pay attention to which scales and guides are drawn, and why
Guides and scales reflect aes() mappings
mapping = aes(color = continent, fill = continent)
Guides and scales reflect aes() mappings
mapping = aes(color = continent, fill = continent)
mapping = aes(color = continent)
Remember:Every mapped variable has a scale
Saving your work
Use ggsave()
## Save the most recent plotggsave(filename ="figures/my_figure.png")## Use here() for more robust file pathsggsave(filename =here("figures", "my_figure.png"))## A plot objectp_out <- p +geom_point(mapping =aes(color =log(pop))) +scale_x_log10()ggsave(filename =here("figures", "lifexp_vs_gdp_gradient.pdf"), plot = p_out)ggsave(here("figures", "lifexp_vs_gdp_gradient.png"), plot = p_out, width =8, height =5)