VISUALISE species AS x, bill_len AS y FROM ggsql:penguins
DRAW boxplotBox plots
Boxplots are a popular way to display a summary of a distribution of single continuous variables. It is good to keep in mind boxplots hide the actual distribution of the data behind a summary, for example when the data is bi- or multi-modal. For every group, a boxplot displays the following 6 things:
- The 25th percentile, or Q1, as the start of the box.
- The 50th percentile, i.e. median or Q2, as a line across the box.
- The 75th percentile, or Q3, as the end of the box. Together with Q1 we can compute the interquartile range: IQR = Q3 - Q1.
- The minimum data value or Q1 - 1.5 * IQR, whichever is larger. This is displayed as the lower whisker.
- The maximum data value or Q3 + 1.5 * IQR, whichever is smaller. This is displayed as the upper whisker.
- Outliers outside the whiskers, if present. These are drawn as individual points.
Code
Explanation
- The
VISUALISE ... FROM ggsql:penguinsloads the built-in penguins dataset. species AS xsets a categorical variable to separate different groups.bill_len AS ysets the numeric variable to summarise.DRAW boxplotgives instructions to draw the boxplot layer.
Variations
Dodging
You can refine groups beyond the axis categorical variable, and the boxplots will be displayed in a dodged way.
VISUALISE species AS x, bill_len AS y, island AS fill FROM ggsql:penguins
DRAW boxplotHowever, dodging might be unproductive or counterintuitive in some cases. For example if we double-encode groups, like species as both x and fill in the plot below, dodging looks bad.
VISUALISE species AS x, bill_len AS y, species AS fill FROM ggsql:penguins
DRAW boxplotWe can disable the dodging by setting position => 'identity'.
VISUALISE species AS x, bill_len AS y, species AS fill FROM ggsql:penguins
DRAW boxplot SETTING position => 'identity'Horizontal
To draw the boxplots horizontally, simply swap the x and y mapping. The orientation is detected automatically based on which variable is continuous and which is discrete.
VISUALISE bill_len AS x, species AS y, island AS fill FROM ggsql:penguins
DRAW boxplotWith individual datapoints
Because a boxplot is a summary, it may be a good idea to supplement them with individual datapoints so that you can’t be accused of ‘hiding’ the distribution. The datapoints can be jittered by setting position => 'jitter'. When you do this, make sure to turn outliers => false to not draw the outlier points twice across the two layers.
VISUALISE species AS x, bill_len AS y FROM ggsql:penguins
DRAW point SETTING position => 'jitter'
DRAW boxplot SETTING outliers => false