The cookbook aims as well at making it easier for people new to R to create graphics. It implements a specific theme (using ggplot2
R library) and report template (using pagedown
R library).
All the code below is ready to be copy pasted. This report template itself, as it comes with UNHCR branding style, can also easily be re-used.
The cookbook covers basic use cases. When dealing complex survey data set, based on a population sample where data must be weighted, we are advising to use the dedicated package KoboloadeR.
This document has been largely inspired by the BBC cookbook.
1 Graphics with “style”
1.1 A report or a dashbaord?
Often analysts start building a dashboard when dealing with a new dataset. This is a correct approach only under certain conditions.
A dashboard is a one-screen group of summary visualizations that facilitate rapid assessment of a system (the best example for this being a car dashboard when the driver can monitor speed, heat, fuel, etc.). It displays the most important information needed to achieve one ore more objectives, consolidated and arranged in a format so that the information can be monitored at a glance. A report, by contrast, is an ordered presentation of a detailed data set. Reports are usually longer and more detailed, although they may have summary components.
In short, dashboards tend to focus on Key Performance Indicators (KPIs), while reports tend to focus on underlying data. Dashboards communicate specific points, while reports are aiming at telling a story. Statistical reports are therefore the starting point for any data analysis: this is the format to use to develop narrative interpretation for existing data in order to develop knowledge. It’s only once the knowledge and interpretation keys on a specific data set is developped that end-users will be able to make sense of a dashboard presenting the same kind of data.
1.2 Importance of reproducibility
Reproducibility is key in data analysis:
It can save a lot of time when very similar analysis has to be done in different operations or points in time,
It allows for quickly scanning the analysis workflow and point potential errors,
It allows for peer review and/or can help peers to learn from what is presented. Openness and transparency improve trust in results.
The entry point to analysis reproducibility is to have each step documented through scripts, rather than using point and click graphical user interface. R is a very powerful language for this:
It’s open source (i.e. it can be customized to accommodate specific needs) and free to use (i.e. not only saving financial resources for core humanitarian activities but also avoiding there long and cumbersome software procurement processes);
It is now an industry-standard in data science and data mining (for instance it is now integrated per default in recent versions of Microsoft SQL Server);
It has a strong community, which leverages package (i.e. logic & workflow) development and plenty of tutorial material and user group;
It covers the entire analysis workflow: from data import, data tidying, data reshaping, visualization with chart, modeling and generation of report to communicate results.
This cookbook does not aim at replacing the numerous books available on data science, like the R for Data Science, it is rather a quick reference to be used by colleagues that aims at building skills in the area.
In November 2018, the first HumanitaRian-useR-group took place. An online skype group, with over 150 members at the time of writing is openly accessible and tutorial based on humanitarian situations are published on a blog.
1.3 Chart to convey “stories”
Effective chart are first those that support a message. From the same data set a multiplicity of chart can be produced. The best chart is the one that present in the most powerful way the message that you want to pass-on and the story you want to tell.
Simple rules can help achieving this:
Outline the message: Always use the main conclusion you want to draw within the title of the chart, use the subtitle to present the data that you have used in the chart. annotation in the chart can also help explaining why the chart is an evidence of the message you present. The message resides in the shape of the data. Chart titles should be clear and accurate, includes time increment and units.
Do keep the chart as simple as possible. Edward Tufte, a statistician, said “Graphical elegance is often found in simplicity of design and complexity of data.” A common mistake we all make with charts is overdressing them with unnecessary elements. The usual suspects are excess color, graphical clutter and abuse of special effects. Details like these won’t impress anyone but de-cluttering your charts will.
Focus on legibility: Graph should be designed to highlight trends, patterns and make exceptions more visible. They can also be designed to reveal relationships among multiple values. For instance, for bar graph presenting categories, do use a horizontal bar graph and arrange data from greatest to least in descending order.
Use color to communicate information and not for decoration. Too many colors can confuse and disorient. When designing a graph, color can be both your friend and your enemy. Depending on how we use it, it can either gracefully highlight data and show changes, or create visual overload and confuse the audience. Don’t use more than six colors or six different categories within the same chart: human brain cannot process more than this.
Reshape first your data: A good chart establish a balance between content & message: Too many content -> not legible; Not enough content -> not precise. Therefore, the content (i.e. the data) shall be adjusted content to message. For instance, extra decimal places look impressive and imply accuracy, but they’re often pointless. So, take a step back and round numbers off before plotting. Overstating the numerical precision of your data by showing too many decimal places can make your chart seem accurate, but this specificity is just misleading. Even when you don’t exaggerate the precision of your data, and your numbers are genuinely accurate, overloading your audience with such detail is often useless
By using this cookbook, you can benefit from a simple and lean style and focus on your data, message and story-telling, rather than wasting precious time on chart beautification and report design.
2 Set up your environment
We’ll get to how you can put together the various elements of these graphics, but let’s get the admin out of the way first…
2.1 Software & Libraries
You will need to first install R and R sudio. The next step could be to download this repository and re-run the Rmd file line by line and see the results of each instructions to familiarize yourself with the code.
A few of the steps in this cookbook - and to create charts in R in general - require certain packages to be installed and loaded. So that you do not have to install and load them one by one, you can use the using
function to load them all at once with the following code.
## Getting all necessary package
using <- function(...) {
libs <- unlist(list(...))
req <- unlist(lapply(libs,require,character.only = TRUE))
need <- libs[req == FALSE]
if (length(need) > 0) {
install.packages(need)
lapply(need,require,character.only = TRUE)
}
}
using('tidyverse','gganimate','gghighlight','ggpubr', 'dplyr', 'tidyr', 'gapminder', 'ggplot2', 'ggalt', 'forcats', 'R.utils', 'png', 'grid', 'ggpubr', 'scales', 'markdown', 'pander', 'ISOcodes', 'wbstats', 'sf', 'rnaturalearth', 'rnaturalearthdata', 'ggspatial', 'unhcrdatapackage')
All graphics in the cookbook are created with the library ggplot2
. Without going into too much details, the main concept behind this package is called the grammar of graphics. It’s a very powerful abstraction to describe any kind of plots using a limited number of instructions.
The idea is that you can build every graph from the same few components: a data set (data), a set of geometry (geom : visual marks that represents data points: bar, line, point, area, polygon, etc.) and a coordinate system (coord). To display data values, variables in the data set shall be mapped to aesthetic properties of the geom (aes: like size, color and x and y locations). In addition, some plots visualize a transformation of the original variable that is described by the instruction stat.
Using scripts can appear difficult at the beginning but with a bit of practice, it provides huge gain of productivity compared to point and click interactions.
UNHCR style is delivered through the function: unhcr_style()
. This function essentially modifies certain arguments in the theme
function of ggplot2
.
unhcr_style <- function() {
font <- "Lato"
ggplot2::theme(
#This sets the font, size, type and colour of text for the chart's title
plot.title = ggplot2::element_text(family=font, size=20, face = "bold", color = "#222222"),
#This sets the font, size, type and colour of text for the chart's subtitle, as well as setting a margin between the title and the subtitle
plot.subtitle = ggplot2::element_text(family=font, size=16, margin=ggplot2::margin(9,0,9,0)),
plot.caption = ggplot2::element_blank(),
#This sets the position and alignment of the legend, removes a title and backround for it and sets the requirements for any text within the legend. The legend may often need some more manual tweaking when it comes to its exact position based on the plot coordinates.
legend.position = "top",
legend.text.align = 0,
legend.background = ggplot2::element_blank(),
legend.title = ggplot2::element_blank(),
legend.key = ggplot2::element_blank(),
legend.text = ggplot2::element_text(family=font, size=13, color = "#222222"),
#This sets the text font, size and colour for the axis test, as well as setting the margins and removes lines and ticks. In some cases, axis lines and axis ticks are things we would want to have in the chart
axis.title = ggplot2::element_blank(),
axis.text = ggplot2::element_text(family=font, size=13, color = "#222222"),
axis.text.x = ggplot2::element_text(margin=ggplot2::margin(5, b = 10)),
axis.ticks = ggplot2::element_blank(),
axis.line = ggplot2::element_blank(),
#This removes all minor gridlines and adds major y gridlines. In many cases you will want to change this to remove y gridlines and add x gridlines.
panel.grid.minor = ggplot2::element_blank(),
panel.grid.major.y = ggplot2::element_line(color = "#cbcbcb"),
panel.grid.major.x = ggplot2::element_blank(),
#This sets the panel background as blank, removing the standard grey ggplot background colour from the plot
panel.background = ggplot2::element_blank(),
#This sets the panel background for facet-wrapped plots to white, removing the standard grey ggplot background colour and sets the title size of the facet-wrap title to font size 22
strip.background = ggplot2::element_rect(fill = "white"),
strip.text = ggplot2::element_text(size = 13, hjust = 0)
)
}
unhcr_style()
: has no arguments and is added to the ggplot ‘chain’ after you have created a plot. What it does is generally makes text size, font and color, axis lines, axis text, margins and many other standard chart components into UNHCR style, which has been formulated based on recommendations and feedback from the design team.
Note that colors for lines in the case of a line chart or bars for a bar chart, do not come out of the box from the unhcr_style()
function, but need to be explicitly set in your other standard ggplot
chart functions.
You can modify these settings for your chart, or add additional theme arguments, by calling the theme
function with the arguments you want - but please note that for it to work you must call it after you have called the unhcr_style
function. Otherwise unhcr_style()
will override it.
The following for instance will add some grid lines, by adding extra theme arguments to what is included in the unhcr_style()
function. There are many similar examples throughout the cookbook.
theme(panel.grid.major.x = element_line(color = "#cbcbcb"),
panel.grid.major.y = element_blank())
A specific statement is used to align the chart, it’s title, subtitle and source
ggpubr::ggarrange(left_align(line, c("subtitle", "title")), ncol = 1, nrow = 1)
It use a specific function left_align()
:
#Left align text
left_align <- function(plot_name, pieces){
grob <- ggplot2::ggplotGrob(plot_name)
n <- length(pieces)
grob$layout$l[grob$layout$name %in% pieces] <- 2
return(grob)
}
Another function, format_si()
, can be also installed to format numbers on axis.
## a little help function to better format numbers
format_si <- function(...) {
function(x) {
limits <- c(1e-24, 1e-21, 1e-18, 1e-15, 1e-12,
1e-9, 1e-6, 1e-3, 1e0, 1e3,
1e6, 1e9, 1e12, 1e15, 1e18,
1e21, 1e24)
prefix <- c("y", "z", "a", "f", "p",
"n", "", "m", " ", "k",
"M", "G", "T", "P", "E",
"Z", "Y")
# Vector with array indices according to position in intervals
i <- findInterval(abs(x), limits)
# Set prefix to " " for very small values < 1e-24
i <- ifelse(i == 0, which(limits == 1e0), i)
paste(format(round(x/limits[i], 1),
trim = TRUE, scientific = FALSE, ...),
prefix[i])
}
}
2.2 Report template
Markdown is a simple formatting language designed to make authoring content easy for everyone. Rather than write in complex markup code (e.g. HTML or LaTex), you write in plain text with formatting cues.
R Markdown allows to weave together narrative text and code to produce elegantly formatted output. Within an R Markdown file, R Code Chunks can be embedded with the native Markdown syntax for fenced code regions.
An R Markdown document can be then rendered into the final output format (for instance HTML, but also directly into word or PDF). R Markdown documents contains a metadata section that includes title, author, and date information as well as options for customizing output.
Pagedown is a package that allows to transform an R Markdown file into an htlm files, directly paginated (with CSS for Print) to be saved as PDF. With Pagedown, you only need a modern web browser (e.g., Google Chrome) to generate PDF.
This package requires a recent version of Pandoc (>= 2.2.3). If you use RStudio, you are recommended to install the Preview version (>= 1.2.1070), which has bundled Pandoc 2.x, otherwise you need to install Pandoc separately.
This template demonstrates some of the basic markdown you’ll need to know to create a UNHCR Statistical report pagedown [@R-pagedown].
If you use the option self_contained: false
(see line #22 of this Rmd
file), don’t click on the Knit
button in RStudio. Use instead the xaringan [@R-xaringan] RStudio add-in Infinite Moon Reader.
2.3 Prepare some data
This cookbook is based on real UNHCR data.
Last Let’s now download (& slightly tidy) data from UNHCR popstat API together with indicators from the World Bank and some geodata.
time_series <- unhcrdatapackage::end_year_population_totals
reference <- unhcrdatapackage::reference
time_series2 <- reshape2::melt(time_series,
# ID variables - all the variables to keep but not split apart on
id.vars=c("Year", "CountryOriginCode","CountryAsylumCode","CountryOriginName","CountryAsylumName" ),
# The source columns
measure.vars=c("REF","IDP", "ASY","OOC","STA","VDA"),
# Name of the destination column that will identify the original
# column that the measurement came from
variable.name="Population.type",
value.name="Value")
time_series2 <- merge(x = time_series2, by.x="CountryOriginCode", y = reference, by.y= "iso_3", all.x = TRUE)
# Population, GDP & GNP per Capita from WorldBank
wb_data <- wb( indicator = c("SP.POP.TOTL", "NY.GDP.MKTP.CD", "NY.GDP.PCAP.CD", "NY.GNP.PCAP.CD"),
startdate = 1951, enddate = 2017, return_wide = TRUE)
# Renaming variables for further matching
names(wb_data)[1] <- "CountryAsylumCode"
names(wb_data)[2] <- "Year"
## Getting world map for mapping
world <- ne_countries(scale = "small", returnclass = "sf")
centroids <- st_transform(world$geometry, '+init=epsg:3857') %>%
## Reprojected in order to get centroid
st_centroid() %>%
# this is the crs from d, which has no EPSG code:
st_transform(., '+init=epsg:4326') %>%
# since we want the centroids in long lat:
st_geometry()
world_points <- cbind(world, st_coordinates(centroids))
3 Generate different kind of plots
3.1 Line chart
Some key points to consider when designing bar chart are:
The purpose of the line chart is to show a trend. It is often the best solution when the data presents a time series referring to a single value that changes at regular intervals.
Choose the y-axis scale appropriately so that we can see the trend. Too flat obscures the message and too exaggerated overstates the trend. The right height is two-thirds of the chart area.
The weight of the fever line should be thick enough to stand out against the grid line but still thin enough to show the twists and turns of the line. Keep the grid lines thin.
Unlike a bar chart, a fever line does not necessarily require a zero baseline. Some data trend won’t be discernible starting from zero baseline.
Avoid labeling at long distance, a legend separated from the line requires the readers to do extra work cross-referencing between the key and the line. Label the lines directly. Direct labeling allows the reader to identify the lines quickly and focus on comparing and contrasting the patterns.
Use the legend only when the space is tight and the lines intersect extensively. The order of the legend should match the ranking of the end points since they are the most current data points.
Annotations help to clarify the message.
Let’s see what code is required for such chart:
#Prepare data
line_df <- time_series2 %>%
filter(Population.type == "REF") %>%
group_by(Year) %>%
summarise(Value2 = sum(Value) )
#Make plot
line <- ggplot(line_df, aes(x = Year, y = Value2)) +
geom_line(colour = "#0072bc", size = 1) + # Here we mention that it will be a line chart
# geom_hline(yintercept = 0, size = 1, colour = "#333333") +
unhcr_style() + ## Insert UNHCR Style
scale_y_continuous(label = format_si()) + ## Format axis number
## and the chart labels
labs(title = "More and More refugees",
subtitle = "World wide refugee population 1951-2017",
caption = "UNHCR http://popstats.unhcr.org")
generate this chart:
3.2 Multiple line chart
Some additional points to consider when designing a multiple line chart;
Don’t use dashed lines and shape markers to differentiate each line! You can use solid lines exclusively by limiting the chart to four or fewer lines. Varying weights and shades do the work of differentiating the lines more effectively than distracting patterns and markers.
In a single chart, keep the maximum number of lines to three or possibly four if the lines are not intersecting at many points. The purpose of a multiple-line chart is to compare and contrast different data series.
The most important line should be one color and the other lines should be shades of second colors.
Let’s see what code is required for such chart:
#Prepare data
multiple_line_df <- time_series2 %>%
filter(Population.type == "REF" & !(is.na(REGION_UN))) %>%
group_by(Year, REGION_UN ) %>%
summarise(Value2 = sum(Value) )
#Make plot
multiple_line <- ggplot(multiple_line_df, aes(x = Year, y = Value2,
colour = REGION_UN)) + # Adding reference to color
geom_line(size = 1) + # Here we mention that it will be a line chart
geom_hline(yintercept = 0, size = 1, colour = "#333333") +
scale_y_continuous( label = format_si()) + ## Format axis number
scale_colour_viridis_d() + ## Add color for each lines based on color-blind friendly palette
unhcr_style() + ## Insert UNHCR Style
## and the chart labels
labs(title = "Refugees Population are not equally spread",
subtitle = "World wide refugee population 1951-2017",
caption = "UNHCR http://popstats.unhcr.org")
generate this chart:
3.3 Bar chart
Some key points to consider when designing bar chart are:
Start at the zero baseline! No exception. Do not clip the axis in order to highlight disparities at the top of the axis. Instead, try recalculating your data as percentages, or try another type of chart;
Ensure numerical axis labels are aligned to the decimal point;
Clearly denote currency or units;
When axis lines are present, it is not necessary to label each data value. However, it can be useful to highlight the final value or other important data points.
Convert your data to rounded, easily digestible values for chart labeling
Omit axes and baselines when data values are labeled, simpler, better!
When the range of your data crosses natural numerical milestones, such as from millions to billions, set the entire chart in the larger milestone. A chart should never reflect more than 1,000 millions. etc.
Ensure labels fit neatly under the bars in no more than two lines.
Horizontal bar charts help compare long lists of values or categories. It has the advantage of printing long labels without using two lines or printing vertical text, as would be required for a vertical bar.
Remember to sort your data before charting so that readers can easily compare.
Labeled values eliminate the need for grid lines, while rounding is done to make the values easy to digest.
Don’t use 3D effect!
Note that by default, R will display your data in alphabetical order, but arranging it by size instead is simple: just wrap reorder()
around the x
or y
variable you want to rearrange, and specify which variable you want to reorder it by.
E.g. x = reorder(Country, Value2)
. Ascending order is the default, but you can change it to descending by wrapping desc()
around the variable you’re ordering by.
Let’s see what code is required for such chart:
#Prepare data
bar_df <- time_series2 %>%
filter(Population.type == "REF" & Year == 2016) %>%
group_by( CountryAsylumName, SUBREGION) %>%
summarise(Value2 = sum(Value) ) %>%
arrange(desc(Value2)) %>%
head(10)
#Make plot
bars <- ggplot(bar_df, aes(x = reorder(CountryAsylumName, Value2), ## Reordering CountryAsylumName by Value
y = Value2)) +
geom_bar(stat = "identity",
position = "identity",
fill = "#0072bc") + # here we configure that it will be bar chart
geom_hline(yintercept = 0, size = 1, colour = "#333333") +
coord_flip() + # Add `coord_flip()` to make your vertical bars horizontal:
unhcr_style() + ## Insert UNHCR Style
## and the chart labels
labs(title = "Turkey is by the far the biggest Refugee hosting country",
subtitle = "Top 10 Refugee Population per country in 2017",
caption = "UNHCR http://popstats.unhcr.org") +
scale_y_continuous( label = format_si()) + ## Format axis number
theme(panel.grid.major.x = element_line(color = "#cbcbcb"),
panel.grid.major.y = element_blank()) ### changing grid line that should appear
Generate this chart:
3.4 Stacked bar chart
Let’s see what code is required for such chart:
#prepare data
df1 <- time_series2 %>%
filter(Population.type == "REF" & Year == 2016 & !(is.na(REGION_UN))) %>%
group_by(Year, CountryAsylumName, CountryAsylumCode, REGION_UN ) %>%
summarise(Value2 = sum(Value) )
df2 <- merge(x = df1, y = wb_data, by = c("CountryAsylumCode" ,"Year"), all.x = TRUE)
df2 <- df2[ !(is.na(df2$REGION_UN)) & !(is.na(df2$NY.GNP.PCAP.CD)) , ]
df2$prop <- df2$Value2 / df2$SP.POP.TOTL
stacked_df1 <- df2 %>%
mutate(CountryClass = cut(NY.GNP.PCAP.CD,
breaks = c(0, 1005, 3955, 12235, 150000),
labels = c("Low-income", "Lower-middle income", "Upper-middle income", "High-income"))) %>%
group_by(REGION_UN, CountryClass) %>%
summarise(Value3 = sum(as.numeric(Value2)))
#create plot
stacked_bars <- ggplot(data = stacked_df1,
aes(x = REGION_UN,
y = Value3,
fill = CountryClass)) +
geom_bar(stat = "identity",
position = "fill") +
unhcr_style() +
scale_y_continuous(labels = scales::percent) +
scale_fill_viridis_d(direction = -1) +
geom_hline(yintercept = 0, size = 1, colour = "#333333") +
labs(title = "High Share of Refugees in Africa are hosted in low-income countries",
subtitle = "% of population by Country classification per Region, 2017",
caption = "UNHCR http://popstats.unhcr.org - World Bank") +
theme(legend.position = "top",
legend.justification = "left") +
guides(fill = guide_legend(reverse = TRUE))
generate this chart:
This example shows proportions, but you might want to make a stacked bar chart showing number values instead - this is easy to change!
The value passed to the position
argument will determine if your stacked chart shows proportions or actual values.
position = "fill"
will draw your stacks as proportions, and position = "identity"
will draw number values.
3.5 Grouped bar chart
Making a grouped bar chart is very similar to making a bar chart.
You just need to change position = "identity"
to position = "dodge"
, and set the fill
aesthetically instead:
Let’s see what code is required for such chart:
#Prepare data
grouped_bar_df <- time_series2 %>%
filter(Population.type == "REF") %>%
filter(Year == 2006 | Year == 2016) %>%
group_by( CountryAsylumName, Year) %>%
summarise(Value2 = sum(Value) ) %>%
select(CountryAsylumName, Year, Value2) %>%
spread(Year, Value2) %>%
mutate(gap = `2016` - `2006`) %>%
arrange(desc(gap)) %>%
head(10) %>%
gather(key = Year,
value = Value2,
-CountryAsylumName,
-gap)
#Make plot
grouped_bars <- ggplot(grouped_bar_df,
aes(x = CountryAsylumName,
y = Value2,
fill = as.factor(Year))) +
coord_flip() +
geom_bar(stat = "identity", position = "dodge") +
geom_hline(yintercept = 0, size = 1, colour = "#333333") +
unhcr_style() +
scale_fill_manual(values = c("#0072bc", "#FAAB18")) +
labs(title = "Biggest Increase Population",
subtitle = "10 Biggest change in Refugee Population, 2006-2016",
caption = "UNHCR http://popstats.unhcr.org") +
scale_y_continuous( label = format_si()) + ## Format axis number
theme(panel.grid.major.x = element_line(color = "#cbcbcb"),
panel.grid.major.y = element_blank()) ### changing grid line that should appear
generate this chart:
3.6 Dumbbell chart
Dumbbell plot (also known as Dumbbell chart, Connected dot plot) is great for displaying changes between two points in time, two conditions or differences between two groups. Another way of showing difference is a dumbbell chart:
Let’s see what code is required for such chart:
library("ggalt")
library("tidyr")
#Prepare data
dumbbell_df <- time_series2 %>%
filter(Population.type == "REF") %>%
filter(Year == 2006 | Year == 2016) %>%
group_by( CountryAsylumName, Year) %>%
summarise(Value2 = sum(Value) ) %>%
select(CountryAsylumName, Year, Value2) %>%
spread(Year, Value2) %>%
mutate(gap = `2006` - `2016`) %>%
arrange(desc(gap)) %>%
head(10)
# Make plot
dumbell <- ggplot(dumbbell_df, aes(x = `2006`, xend = `2016`,
y = reorder(CountryAsylumName, gap),
group = CountryAsylumName)) +
geom_dumbbell(colour = "#dddddd",
size = 3,
colour_x = "#0072bc",
colour_xend = "#FAAB18") +
unhcr_style() +
labs(title = "Where did Refugee Population decreased in the past 10 years?",
subtitle = "Biggest decrease in Refugee Population, 2006-2016",
caption = "UNHCR http://popstats.unhcr.org") +
scale_x_continuous( label = format_si()) + ## Format axis number
theme(panel.grid.major.x = element_line(color = "#cbcbcb"),
panel.grid.major.y = element_blank()) ### changing grid line that should appear
generate this chart:
3.7 Histogram
Histogram are used to plot the frequency of score occurrences in a continuous variable (numeric, integer) that has been divided into classes, called bins.
A histogram is not a bar chart: histograms are used to show distributions (or the shape) of variables while bar charts are used to compare variables.
Some data sets have a distinct shape. Data hardly ever fall into perfect patterns, so you have to decide whether the data shape. The two interpretation key are the following:
Symmetric bell shape: if you cut it down the middle and the left-hand and right-hand sides resemble mirror images of each other. The middle of the chart is the place when the largest occurrence appears: in this case, we have something closed to a normal distribution. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.
Skewed right or left:looks like a lopsided mound, with a tail going off to the right or left. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
A uniform distribution would be the extreme case.
Let’s see what code is required for such chart:
# Prepare Data
hist_df <- df2 %>%
mutate(ref.per.local = (Value2 / SP.POP.TOTL) * 100) %>%
arrange(desc(Value2)) %>%
head(50)
# Chart
histo <- ggplot(hist_df, aes(ref.per.local)) +
geom_histogram( colour = "white", fill = "#0072bc") +
geom_hline(yintercept = 0, size = 1, colour = "#333333") +
unhcr_style() +
scale_x_continuous(limits = c(0, 20)) +
labs(ylab = "Count of countries",
title = "Only 2 countries have more than 5 refugees per 100 locals",
subtitle = "Distribution of refugee to local ratio for top 50 refugee hosting countries in 2016",
caption = "UNHCR http://popstats.unhcr.org - World Bank")
generate this chart:
3.8 Scatterplot
Scatter plot are used to check for correlation. A trend line can be added to compare two set of measures to determine if as one set goes up, the other set correspondingly goes up or down and how strongly.
Scatter plot are also a good way to identify clusters of observation.
Let’s see what code is required for such chart:
## Chart
scatter <- ggplot(df2, aes(y = Value2, x = NY.GDP.MKTP.CD)) +
geom_point(aes(col = REGION_UN)) +
#geom_smooth(method = "loess", se = F) +
unhcr_style() +
scale_x_continuous( label = format_si(), ) + ## Format axis number
scale_y_continuous( label = format_si(),
limits = c(0, 1000000)) + ## Format axis number
scale_color_viridis_d(direction = -1) +
labs(title = "Refugee hosting is not correlated with Economic Wealth",
subtitle = "Refugee population Vs GDP",
y = "Refugee",
x = "Gross domestic product (GDP)",
caption = "2016 Figures, UNHCR http://popstats.unhcr.org, World bank") +
theme(axis.title = element_text(size = 12))
generate this chart:
3.9 Maps
A map is a graphic representation or scale model of spatial concepts. It is a means for conveying geographic information. Maps are a universal medium for communication, easily understood and appreciated by most people, regardless of language or culture. Maps are not realistic representations of the actual world. All maps are estimations, generalizations, and interpretations of true geographic conditions.
One key rule when creating a map is:
- Absolute value: Proportional symbol
- Relative value (ratio): Choropleth
Let’s see what code is required for such chart:
# Merge data with geographic coordinates
world <- merge(x = world , y = df2, by.y = "CountryAsylumCode" , by.x = "iso_a3")
df3 <- merge(x = df2 , y = world_points, by.x = "CountryAsylumCode" , by.y = "iso_a3")
# plot
map <- ggplot(data = world) +
geom_sf(fill = "antiquewhite", colour = "#7f7f7f", size = 0.2) +
coord_sf(xlim = c(-25, 65), ylim = c(25, 75), expand = FALSE) + ## Clipping on Mediterranean Sea
geom_point(data = df3, aes(x = X, y = Y , size = Value2 ),
alpha = 0.6, colour = "red") +
scale_size_area( max_size = 20) +
xlab("") +
ylab("") +
ggtitle("Refugee Distribution") +
unhcr_style() +
theme(panel.grid.major = element_line(color = gray(.5),
linetype = "dashed", size = 0.5),
panel.background = element_rect(fill = "aliceblue"),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
legend.position = "none"
)
generate this chart:
4 Work with small multiples
Small multiple charts are easy to create with ggplot: it’s called faceting.
4.1 Facets
- With four or more data series, an array of individual charts can display a pattern and allows better comparison among all lines.
If you have data that you want to visualize split up by some variable, you need to use facet_wrap
or facet_grid
.
Add the variable you want to divide by to this line of code: facet_wrap( ~ variable)
.
An additional argument to facet wrap, ncol
, allows you to specify the number of columns:
#Prepare data
facet <- time_series2 %>%
filter(Population.type == "REF" & !(is.na(REGION_UN))) %>%
group_by(Year, REGION_UN ) %>%
summarise(Value2 = sum(Value) )
#Make plot
facet_plot <- ggplot() +
geom_area(data = facet, aes(x = Year, y = Value2, fill = REGION_UN)) +
scale_colour_viridis_d() + ## Add color for each lines based on color-blind friendly palette
facet_wrap( ~ REGION_UN, ncol = 5) +
scale_y_continuous(labels = format_si()) +
unhcr_style() +
geom_hline(yintercept = 0, size = 1, colour = "#333333") +
theme(legend.position = "none",
axis.text.x = element_blank()) +
labs(title = "Africa & Asia are hosting the biggest Refugee population",
subtitle = "Refugee Population growth by continent, 1951-2016")
4.2 Free scales
You may have noticed in the chart above that Oceania, with its relatively small population, has disappeared completely.
By default, faceting uses fixed axis scales across the small multiples. It’s always best to use the same y axis scale across small multiples, to avoid misleading, but sometimes you may need to set these independently for each multiple, which we can do by adding the argument scales = "free"
.
If you just want to free the scales for one axis set the argument to free_x
or free_y
.
#Make plot
facet_plot_free <- ggplot() +
geom_area(data = facet, aes(x = Year, y = Value2, fill = REGION_UN)) +
facet_wrap(~ REGION_UN, scales = "free") +
unhcr_style() +
scale_colour_viridis_d() + ## Add color for each lines based on color-blind friendly palette
geom_hline(yintercept = 0, size = 1, colour = "#333333") +
theme(legend.position = "none",
axis.text.x = element_blank(),
axis.text.y = element_blank()) +
labs(title = "It's all relative",
subtitle = "Refugee Population growth by continent, 1951-2016")
5 Make changes to the legend
5.1 Remove the legend
Remove the legend to become one - it’s better to label data directly with text annotations.
Use guides(colour=FALSE)
to remove the legend for a specific aesthetic (replace colour
with the relevant aesthetic).
multiple_line2 <- multiple_line + guides(colour = FALSE)
generate this chart:
You can also remove all legends in one go using theme(legend.position = "none")
:
multiple_line2 <- multiple_line + theme(legend.position = "none")
5.2 Change the position of the legend
The legend’s default position is at the top of your plot. Move it to the left, right or bottom outside the plot with:
multiple_line2 <- multiple_line + theme(legend.position = "bottom")
generate this chart:
To be really precise about where we want our legend to go, instead of specifying “right” or “top” to change the general position of where the legend appears in our chart, we can give it specific coordinates.
For example legend.position=c(0.98,0.1)
will move the legend to the bottom right. For reference, c(0,0) is bottom left, c(1,0) is bottom right, c(0,1) is top left and so on). Finding the exact position may involve some trial and error.
To check the exact position where the legend appears in your finalized plot you will have to check the file that is saved out after you run your finalise_plot()
function, as the position will be relevant to the dimensions of the plot.
multiple_line2 <- multiple_line +
theme(legend.position = c(0.1,0.5),
legend.direction = "vertical") +
labs(title = "Refugees Population are not equally spread",
subtitle = "World wide refugee population 1951-2017",
caption = "UNHCR http://popstats.unhcr.org")
generate this chart:
To get the legend flush against the left side of your chart, it may be easier to set a negative left margin for the legend using legend.margin
. The syntax is margin(top, right, bottom, left)
.
You’ll have to experiment to find the correct number to set the margin to for your chart - save it out with finalise_plot()
and see how it looks.
multiple_line2 <- multiple_line +
theme(legend.margin = margin(0, 0, 0, -200))
generate this chart:
5.3 Remove the legend title
Remove the legend title by tweaking your theme()
. Don’t forget that for any changes to the theme to work, they must be added after you’ve called unhcr_style()
!
multiple_line2 <- multiple_line +
theme(legend.title = element_blank())
generate this chart:
5.4 Reverse the order of your legend
Sometimes you need to change the order of your legend for it to match the order of your bars. For this, you need guides
:
multiple_line2 <- multiple_line +
guides(fill = guide_legend(reverse = TRUE))
generate this chart:
5.5 Rearrange the layout of your legend
If you’ve got many values in your legend, you may need to rearrange the layout for aesthetic reasons.
You can specify the number of rows you want your legend to have as an argument to guides
. The below code snippet, for instance, will create a legend with 2 rows:
multiple_line2 <- multiple_line +
theme(legend.direction = "horizontal") +
guides(fill = guide_legend(nrow = 2, byrow = T))
generate this chart:
You may need to change fill
in the code above to whatever aesthetic your legend is describing, e.g. size
, colour
, etc.
5.6 Change the appearance of your legend symbols
You can override the default appearance of the legend symbols, without changing the way they appear in the plot, by adding the argument override.aes
to guides
.
The below will make the size of the legend symbols larger, for instance:
multiple_line2 <- multiple_line +
guides(fill = guide_legend(override.aes = list(size = 2)))
generate this chart:
5.7 Add space between your legend labels
The default ggplot legend has almost no space between individual legend items. Not ideal.
You can add space by changing the scale labels manually.
For instance, if you have set the color of your geoms to be dependent on your data, you will get a legend for the color, and you can tweak the exact labels to get some extra space in by using the below snippet:
# multiple_line2 <- multiple_line +
# scale_colour_manual(labels = function(x) paste0(" ", x))
generate this chart:
If your legend is showing something different, you will need to change the code accordingly. For instance, for fill, you will need scale_fill_manual()
instead.
6 Make changes to the axes
6.1 Add/remove gridlines
The theme only has grid lines: remove the grid lines on the y axis with panel.grid.major.y = element_blank()
)
bars2 <- bars +
theme(panel.grid.major.x = element_blank())
6.2 Change the axis text manually
You can change the axis text labels freely with scale_y_continuous
or scale_x_continuous
:
bars2 <- bars + scale_y_continuous(limits = c(0, 1000000),
breaks = seq(0, 1000000, by = 200000),
labels = c("0","200,", "400,", "600,", "800,", "1M"))
This will also specify the limits of your plot as well as where you want axis ticks.
6.3 Add thousand separators to your axis labels
You can specify that you want your axis text to have thousand separators with an argument to scale_y_continuous
.
There are two ways of doing this, one in base R which is a bit fiddly:
bars2 <- bars + scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE))
The second way relies on the scales
package, but is much more concise:
bars2 <- bars + scale_y_continuous(labels = scales::comma)
6.4 Add text to your axis labels
This is also easy to add with an argument to scale_y_continuous
:
bars2 <- bars + scale_y_continuous(labels = function(x) paste0(x, " Ref."))
6.5 Change the plot limits
The long way of setting the limits of your plot explicitly is with scale_y_continuous
as above. But if you don’t need to specify the breaks or labels the shorthand way of doing it is with xlim
or ylim
:
bars2 <- bars + ylim(c(0,500000))
6.6 Add axis titles
Our default theme has no axis titles, but you may wish to add them in manually. This is done by modifying theme()
- note that you must do this after the call to unhcr_style()
or your changes will be overridden:
bars2 <- bars +
theme(axis.title = element_text(size = 18))
6.7 Modify axis titles
If you add in axis titles, they will by default be the column names in your data set. You can change this to anything you want in your call to labs()
.
For instance, if you wish your x axis title to be “I’m an axis” and your y axis label to be blank, this would be the format:
bars3 <- bars2 +
labs(x = "Country", y = "Population")
6.8 Add axis ticks
You can add axis tick marks by adding axis.ticks.x
or axis.ticks.y
to your theme
:
multiple_line2 <- multiple_line +
theme(
axis.ticks.x = element_line(colour = "#333333"),
axis.ticks.length = unit(0.26, "cm"))
7 Add annotations
7.1 Insert text within chart
The easiest way to add a text annotation to your plot is using geom_label
:
multiple_line2 <- multiple_line +
geom_label(aes(x = 1990, y = 5000000, label = "I'm an annotation!"),
hjust = 0,
vjust = 0.5,
colour = "#555555",
fill = "white",
label.size = NA,
family = "Lato",
size = 6)
The exact positioning of the annotation will depend on the x
and y
arguments (which is a bit fiddly!) and the text alignment, using hjust
and vjust
- but more on that below.
Add line breaks where necessary in your label with \n
, and set the line height with lineheight
.
multiple_line2 <- multiple_line +
geom_label(aes(x = 1990, y = 5000000,
label = "I'm quite a long\nannotation over\nthree rows"),
hjust = 0,
vjust = 0.5,
lineheight = 0.8,
colour = "#555555",
fill = "white",
label.size = NA,
family = "Lato",
size = 6)
Let’s get our direct labels in there!
multiple_line2 <- multiple_line +
theme(legend.position = "none") +
xlim(c(1950, 2028)) +
geom_label(aes(x = 2017, y = 5531693, label = "Africa"),
hjust = 0,
vjust = 0.5,
colour = "Black",
fill = "white",
label.size = NA,
family = "Lato",
size = 6) +
geom_label(aes(x = 2017, y = 693600, label = "America"),
hjust = 0,
vjust = 0.5,
colour = "Black",
fill = "white",
label.size = NA,
family = "Lato",
size = 6) +
geom_label(aes(x = 2017, y = 8608597, label = "Asia"),
hjust = 0,
vjust = 0.5,
colour = "Black",
fill = "white",
label.size = NA,
family = "Lato",
size = 6) +
geom_label(aes(x = 2017, y = 2300833, label = "Europe"),
hjust = 0,
vjust = 0.5,
colour = "Black",
fill = "white",
label.size = NA,
family = "Lato",
size = 6) +
geom_label(aes(x = 2017, y = 53671, label = "Oceania"),
hjust = 0,
vjust = 0.5,
colour = "Black",
fill = "white",
label.size = NA,
family = "Lato",
size = 6)
7.2 Left-align/right-align text
The arguments hjust
and vjust
dictate horizontal and vertical text alignment. They can have a value between 0 and 1, where 0 is left-justified and 1 is right-justified (or bottom- and top-justified for vertical alignment).
7.3 Add labels based on your data
The above method for adding annotations to your chart lets you specify the x and y coordinates exactly. This is very useful if we want to add a text annotation in a specific place, but would be very tedious to repeat.
Fortunately, if you want to add labels to all your data points, you can simply set the position based on your data instead.
Let’s say we want to add data labels to our bar chart:
labelled.bars <- bars +
geom_label(aes(x = CountryAsylumName, y = Value2, label = round(Value2, 0)),
hjust = 1,
vjust = 0.5,
colour = "white",
fill = NA,
label.size = NA,
family = "Lato",
size = 6)
The above code automatically adds one text label for each continent, without us having to add geom_label
five separate times.
(If you’re confused about why we’re setting the x
as the continents and y
as life expectancy, when the chart appears to be drawing them the other way around, it’s because we’ve flipped the coordinates of the plot using coord_flip()
, which you can read more about here.)
7.4 Add left-aligned labels to bar charts
If you’d rather add left-aligned labels for your bars, just set the x
argument based on your data, but specify the y
argument directly instead, with a numeric value.
The exact value of y
will depend on the range of your data.
labelled.bars.v2 <- bars +
geom_label(aes(x = CountryAsylumName,
y = 4,
label = round(Value2, 0)),
hjust = 0,
vjust = 0.5,
colour = "white",
fill = NA,
label.size = NA,
family = "Lato",
size = 6)
7.5 Add a line
Add a line with geom_segment
:
multiple_line2 <- multiple_line +
geom_segment(aes(x = 1979, y = 4500000, xend = 1965, yend = 4300000),
colour = "#555555",
size = 4)
The size
argument specifies the thickness of the line.
7.6 Add a curved line
For a curved line, use geom_curve
instead of geom_segment
:
multiple_line2 <- multiple_line + geom_curve(aes(x = 1979, y = 4500000, xend = 1965, yend = 4300000),
colour = "#555555",
curvature = -0.2,
size = 0.5)
The curvature
argument sets the amount of curve: 0 is a straight line, negative values give a left-hand curve and positive values give a right-hand curve.
7.7 Add an arrow
Turning a line into an arrow is fairly straightforward: just add the arrow
argument to your geom_segment
or geom_curve
:
multiple_line2 <- multiple_line + geom_curve(aes(x = 1979, y = 4500000, xend = 1965, yend = 4300000),
colour = "#555555",
size = 0.5,
curvature = -0.2,
arrow = arrow(length = unit(0.03, "npc")))
The first argument to unit
sets the size of the arrowhead.
7.8 Add a line across the whole plot
The easiest way to add a line across the whole plot is with geom_vline()
, for a vertical line, or geom_hline()
, for a horizontal one.
Optional additional arguments allow you to specify the size, color and type of line (the default option is a solid one).
multiple_line2 <- multiple_line +
geom_hline(yintercept = 10000000, size = 1, colour = "red", linetype = "dashed")
The line obviously doesn’t add much in this example, but this is useful if you want to highlight something, e.g. a threshold level, or an average value.
It’s also especially useful because our design style - as you may already have noticed from the charts on this page - is to add a vertical or horizontal baseline to our charts. This is the code to use:
multiple_line2 <- multiple_line +
geom_hline(yintercept = 8000000, size = 1, colour = "#333333")
8 Do something else entirely
8.1 Increase or decrease margins
You can change the margin around almost any element of your plot - the title, subtitles, legend - or the plot itself.
You shouldn’t ordinarily need to change the default margins from the theme but if you do, the syntax is theme(ELEMENT=element_text(margin=margin(0, 5, 10, 0)))
.
The numbers specify the top, right, bottom, and left margin respectively - but you can also specify directly which margin you want to change. For example, let’s try giving the subtitle an extra-large bottom margin:
bars2 <- bars +
theme(plot.subtitle = element_text(margin = margin(b = 75)))
Hm… maybe not.
8.2 Exporting your plot and x-axis margins
You do need to think about your x-axis margin sizes when you are producing a plot that is beyond the default height, which is 450px. This could be the case for example if you are creating a bar chart with lots of bars and want to make sure there is some breathing space between each bar and labels. If you do leave the margins as they are for plots with a greater height, then you could get a larger gap between the axis and your labels.
Here is a guide that we work to when it comes to the margins and the height of your bar chart (with coord_flip applied to it):
size | t | b |
---|---|---|
550px | 5 | 10 |
650px | 7 | 10 |
750px | 10 | 10 |
850px | 14 | 10 |
So what you’d need to do is add this code to your chart if for example you wanted the height of your plot to be 650px instead of 450px.
bar_chart_tall <- bars +
theme(axis.text.x = element_text(margin = margin(t = 14, b = 10)))
#bar_chart_tall
Although it is much less likely, but if you do want to do the equivalent for a line chart and export it at a larger than default height, you need to do the same but change your values for t to negative values based on the table above.
8.3 Reorder bars manually
Sometimes you need to order your data in a way that isn’t alphabetical or reordered by size.
To order these correctly you need to set your data’s factor levels before making the plot.
Specify the order you want the categories to be plotted in the levels
argument:
dataset$column <- factor(dataset$column, levels = c("18-24","25-64","65+"))
You can also use this to reorder the stacks of a stacked bar chart.
8.4 Colour bars conditionally
You can set aesthetic values like fill, alpha, size conditionally with ifelse()
.
The syntax is fill = ifelse(logical_condition, fill_if_true, fill_if_false)
.
highlighted <- ggplot(bar_df,
aes(x = reorder(CountryAsylumName, Value2), y = Value2)) +
geom_bar(stat = "identity", position = "identity",
fill = ifelse(bar_df$CountryAsylumName == "Turkey", "#0072bc", "#CCCCCC")) +
geom_hline(yintercept = 0, size = 1, colour = "#333333") +
unhcr_style() +
coord_flip() +
scale_y_continuous(label = format_si()) + ## Format axis number
labs(title = "Turkey is by the far the biggest Refugee hosting country",
subtitle = "Top 10 Refugee Population per country in 2017") +
theme(panel.grid.major.x = element_line(color = "#cbcbcb"),
panel.grid.major.y = element_blank())