R for Data Science

Data Visualisation

Learnt from R for Data Science - Data visualisation

You can learn more example from ggplot2

Prerequisites

# Install
# install.packages("gridExtra")
# install.packages("tidyverse")
# install.packages("maps")
# install.packages("mapproj")

library(tidyverse)
library(gridExtra)

First steps

The `mpg` data frame

mpg contains observations collected by the US Environmental Protection Agency on 38 models of car.

mpg

A tibble: 234 × 11
manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
<chr>	<chr>	<dbl>	<int>	<int>	<chr>	<chr>	<int>	<int>	<chr>	<chr>
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact
audi	a4	3.1	2008	6	auto(av)	f	18	27	p	compact
audi	a4 quattro	1.8	1999	4	manual(m5)	4	18	26	p	compact
audi	a4 quattro	1.8	1999	4	auto(l5)	4	16	25	p	compact
audi	a4 quattro	2.0	2008	4	manual(m6)	4	20	28	p	compact
audi	a4 quattro	2.0	2008	4	auto(s6)	4	19	27	p	compact
audi	a4 quattro	2.8	1999	6	auto(l5)	4	15	25	p	compact
audi	a4 quattro	2.8	1999	6	manual(m5)	4	17	25	p	compact
audi	a4 quattro	3.1	2008	6	auto(s6)	4	17	25	p	compact
audi	a4 quattro	3.1	2008	6	manual(m6)	4	15	25	p	compact
audi	a6 quattro	2.8	1999	6	auto(l5)	4	15	24	p	midsize
audi	a6 quattro	3.1	2008	6	auto(s6)	4	17	25	p	midsize
audi	a6 quattro	4.2	2008	8	auto(s6)	4	16	23	p	midsize
chevrolet	c1500 suburban 2wd	5.3	2008	8	auto(l4)	r	14	20	r	suv
chevrolet	c1500 suburban 2wd	5.3	2008	8	auto(l4)	r	11	15	e	suv
chevrolet	c1500 suburban 2wd	5.3	2008	8	auto(l4)	r	14	20	r	suv
chevrolet	c1500 suburban 2wd	5.7	1999	8	auto(l4)	r	13	17	r	suv
chevrolet	c1500 suburban 2wd	6.0	2008	8	auto(l4)	r	12	17	r	suv
chevrolet	corvette	5.7	1999	8	manual(m6)	r	16	26	p	2seater
chevrolet	corvette	5.7	1999	8	auto(l4)	r	15	23	p	2seater
chevrolet	corvette	6.2	2008	8	manual(m6)	r	16	26	p	2seater
chevrolet	corvette	6.2	2008	8	auto(s6)	r	15	25	p	2seater
chevrolet	corvette	7.0	2008	8	manual(m6)	r	15	24	p	2seater
chevrolet	k1500 tahoe 4wd	5.3	2008	8	auto(l4)	4	14	19	r	suv
chevrolet	k1500 tahoe 4wd	5.3	2008	8	auto(l4)	4	11	14	e	suv
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
toyota	toyota tacoma 4wd	3.4	1999	6	auto(l4)	4	15	19	r	pickup
toyota	toyota tacoma 4wd	4.0	2008	6	manual(m6)	4	15	18	r	pickup
toyota	toyota tacoma 4wd	4.0	2008	6	auto(l5)	4	16	20	r	pickup
volkswagen	gti	2.0	1999	4	manual(m5)	f	21	29	r	compact
volkswagen	gti	2.0	1999	4	auto(l4)	f	19	26	r	compact
volkswagen	gti	2.0	2008	4	manual(m6)	f	21	29	p	compact
volkswagen	gti	2.0	2008	4	auto(s6)	f	22	29	p	compact
volkswagen	gti	2.8	1999	6	manual(m5)	f	17	24	r	compact
volkswagen	jetta	1.9	1999	4	manual(m5)	f	33	44	d	compact
volkswagen	jetta	2.0	1999	4	manual(m5)	f	21	29	r	compact
volkswagen	jetta	2.0	1999	4	auto(l4)	f	19	26	r	compact
volkswagen	jetta	2.0	2008	4	auto(s6)	f	22	29	p	compact
volkswagen	jetta	2.0	2008	4	manual(m6)	f	21	29	p	compact
volkswagen	jetta	2.5	2008	5	auto(s6)	f	21	29	r	compact
volkswagen	jetta	2.5	2008	5	manual(m5)	f	21	29	r	compact
volkswagen	jetta	2.8	1999	6	auto(l4)	f	16	23	r	compact
volkswagen	jetta	2.8	1999	6	manual(m5)	f	17	24	r	compact
volkswagen	new beetle	1.9	1999	4	manual(m5)	f	35	44	d	subcompact
volkswagen	new beetle	1.9	1999	4	auto(l4)	f	29	41	d	subcompact
volkswagen	new beetle	2.0	1999	4	manual(m5)	f	21	29	r	subcompact
volkswagen	new beetle	2.0	1999	4	auto(l4)	f	19	26	r	subcompact
volkswagen	new beetle	2.5	2008	5	manual(m5)	f	20	28	r	subcompact
volkswagen	new beetle	2.5	2008	5	auto(s6)	f	20	29	r	subcompact
volkswagen	passat	1.8	1999	4	manual(m5)	f	21	29	p	midsize
volkswagen	passat	1.8	1999	4	auto(l5)	f	18	29	p	midsize
volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

?mpg

mpg                  package:ggplot2                   R Documentation

_F_u_e_l _e_c_o_n_o_m_y _d_a_t_a _f_r_o_m _1_9_9_9 _t_o _2_0_0_8 _f_o_r _3_8 _p_o_p_u_l_a_r _m_o_d_e_l_s _o_f _c_a_r_s

_D_e_s_c_r_i_p_t_i_o_n:

     This dataset contains a subset of the fuel economy data that the
     EPA makes available on <https://fueleconomy.gov/>. It contains
     only models which had a new release every year between 1999 and
     2008 - this was used as a proxy for the popularity of the car.

_U_s_a_g_e:

     mpg
     
_F_o_r_m_a_t:

     A data frame with 234 rows and 11 variables:

     manufacturer manufacturer name

     model model name

     displ engine displacement, in litres

     year year of manufacture

     cyl number of cylinders

     trans type of transmission

     drv the type of drive train, where f = front-wheel drive, r = rear
          wheel drive, 4 = 4wd

     cty city miles per gallon

     hwy highway miles per gallon

     fl fuel type

     class "type" of car

Creating a ggplot

ggplot(data = mpg) +                                #  creates an empty graph, using the dataset
  geom_point(mapping = aes(x = displ, y = hwy,  color = class, size= class, alpha = class))
# Each geom function in ggplot2 takes a mapping argument. 
# This defines how variables in your dataset are mapped to visual properties. 
# The mapping argument is always paired with aes(), and the x and y arguments of aes() 
# specify which variables to map to the x and y axes. 
# ggplot2 looks for the mapped variables in the data argument, in this case, mpg.

Warning message:
"[1m[22mUsing [32msize[39m for a discrete variable is not advised."
Warning message:
"[1m[22mUsing alpha for a discrete variable is not advised."

png

A graphing template

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Aesthetic mappings

# Left
p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

# Right
p2 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

options(repr.plot.width = 10, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

Warning message:
"[1m[22mUsing alpha for a discrete variable is not advised."
Warning message:
"[1m[22mThe shape palette can deal with a maximum of 6 discrete values because more
than 6 becomes difficult to discriminate
[36mℹ[39m you have requested 7 values. Consider specifying shapes manually if you need
  that many have them."
Warning message:
"[1m[22mRemoved 62 rows containing missing values (`geom_point()`)."

png

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

png

?geom_point

geom_point               package:ggplot2               R Documentation

_P_o_i_n_t_s

_D_e_s_c_r_i_p_t_i_o_n:

     The point geom is used to create scatterplots. The scatterplot is
     most useful for displaying the relationship between two continuous
     variables. It can be used to compare one continuous and one
     categorical variable, or two categorical variables, but a
     variation like 'geom_jitter()', 'geom_count()', or 'geom_bin2d()'
     is usually more appropriate. A _bubblechart_ is a scatterplot with
     a third variable mapped to the size of points.

_U_s_a_g_e:

     geom_point(
       mapping = NULL,
       data = NULL,
       stat = "identity",
       position = "identity",
       ...,
       na.rm = FALSE,
       show.legend = NA,
       inherit.aes = TRUE
     )
     
_A_r_g_u_m_e_n_t_s:

 mapping: Set of aesthetic mappings created by 'aes()'. If specified
          and 'inherit.aes = TRUE' (the default), it is combined with
          the default mapping at the top level of the plot. You must
          supply 'mapping' if there is no plot mapping.

    data: The data to be displayed in this layer. There are three
          options:

          If 'NULL', the default, the data is inherited from the plot
          data as specified in the call to 'ggplot()'.

          A 'data.frame', or other object, will override the plot data.
          All objects will be fortified to produce a data frame. See
          'fortify()' for which variables will be created.

          A 'function' will be called with a single argument, the plot
          data. The return value must be a 'data.frame', and will be
          used as the layer data. A 'function' can be created from a
          'formula' (e.g. '~ head(.x, 10)').

    stat: The statistical transformation to use on the data for this
          layer, either as a 'ggproto' 'Geom' subclass or as a string
          naming the stat stripped of the 'stat_' prefix (e.g.
          '"count"' rather than '"stat_count"')

position: Position adjustment, either as a string naming the adjustment
          (e.g. '"jitter"' to use 'position_jitter'), or the result of
          a call to a position adjustment function. Use the latter if
          you need to change the settings of the adjustment.

     ...: Other arguments passed on to 'layer()'. These are often
          aesthetics, used to set an aesthetic to a fixed value, like
          'colour = "red"' or 'size = 3'. They may also be parameters
          to the paired geom/stat.

   na.rm: If 'FALSE', the default, missing values are removed with a
          warning. If 'TRUE', missing values are silently removed.

show.legend: logical. Should this layer be included in the legends?
          'NA', the default, includes if any aesthetics are mapped.
          'FALSE' never includes, and 'TRUE' always includes. It can
          also be a named logical vector to finely select the
          aesthetics to display.

inherit.aes: If 'FALSE', overrides the default aesthetics, rather than
          combining with them. This is most useful for helper functions
          that define both data and aesthetics and shouldn't inherit
          behaviour from the default plot specification, e.g.
          'borders()'.

_O_v_e_r_p_l_o_t_t_i_n_g:

     The biggest potential problem with a scatterplot is overplotting:
     whenever you have more than a few points, points may be plotted on
     top of one another. This can severely distort the visual
     appearance of the plot. There is no one solution to this problem,
     but there are some techniques that can help. You can add
     additional information with 'geom_smooth()', 'geom_quantile()' or
     'geom_density_2d()'. If you have few unique 'x' values,
     'geom_boxplot()' may also be useful.

     Alternatively, you can summarise the number of points at each
     location and display that in some way, using 'geom_count()',
     'geom_hex()', or 'geom_density2d()'.

     Another technique is to make the points transparent (e.g.
     'geom_point(alpha = 0.05)') or very small (e.g. 'geom_point(shape
     = ".")').

_A_e_s_t_h_e_t_i_c_s:

     'geom_point()' understands the following aesthetics (required
     aesthetics are in bold):

        * *'x'*

        * *'y'*

        * 'alpha'

        * 'colour'

        * 'fill'

        * 'group'

        * 'shape'

        * 'size'

        * 'stroke'

     Learn more about setting these aesthetics in
     'vignette("ggplot2-specs")'.

_E_x_a_m_p_l_e_s:

     p <- ggplot(mtcars, aes(wt, mpg))
     p + geom_point()
     
     # Add aesthetic mappings
     p + geom_point(aes(colour = factor(cyl)))
     p + geom_point(aes(shape = factor(cyl)))
     # A "bubblechart":
     p + geom_point(aes(size = qsec))
     
     # Set aesthetics to fixed value
     ggplot(mtcars, aes(wt, mpg)) + geom_point(colour = "red", size = 3)
     
     
     # Varying alpha is useful for large datasets
     d <- ggplot(diamonds, aes(carat, price))
     d + geom_point(alpha = 1/10)
     d + geom_point(alpha = 1/20)
     d + geom_point(alpha = 1/100)
     
     
     # For shapes that have a border (like 21), you can colour the inside and
     # outside separately. Use the stroke aesthetic to modify the width of the
     # border
     ggplot(mtcars, aes(wt, mpg)) +
       geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 5)
     
     
     # You can create interesting shapes by layering multiple points of
     # different sizes
     p <- ggplot(mtcars, aes(mpg, wt, shape = factor(cyl)))
     p +
       geom_point(aes(colour = factor(cyl)), size = 4) +
       geom_point(colour = "grey90", size = 1.5)
     p +
       geom_point(colour = "black", size = 4.5) +
       geom_point(colour = "pink", size = 4) +
       geom_point(aes(shape = factor(cyl)))
     
     # geom_point warns when missing values have been dropped from the data set
     # and not plotted, you can turn this off by setting na.rm = TRUE
     set.seed(1)
     mtcars2 <- transform(mtcars, mpg = ifelse(runif(32) < 0.2, NA, mpg))
     ggplot(mtcars2, aes(wt, mpg)) +
       geom_point()
     ggplot(mtcars2, aes(wt, mpg)) +
       geom_point(na.rm = TRUE)

     p <- ggplot(mtcars, aes(mpg, wt, shape = factor(cyl)))
     p1 <- p +
       geom_point(aes(colour = factor(cyl)), size = 4) +
       geom_point(colour = "grey90", size = 1.5)
      p2 <- p +
       geom_point(colour = "black", size = 4.5) +
       geom_point(colour = "pink", size = 4) +
       geom_point(aes(shape = factor(cyl)))

options(repr.plot.width = 10, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

png

# exercise 3.3
ggplot(data = mpg) +                                #  creates an empty graph, using the dataset
  geom_point(mapping = aes(x = displ, y = hwy,  color = displ<5))
options(repr.plot.width = 10, repr.plot.height = 5)

png

To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) + 
  facet_wrap(~ class, nrow = 2)

png

To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) + 
  facet_grid(drv ~ cyl)

png

If you prefer to not facet in the rows or columns dimension, use a . instead of a variable name, e.g. + facet_grid(. ~ cyl).

p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  facet_grid(drv ~ .)

p2 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  facet_grid(. ~ cyl)

options(repr.plot.width = 10, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

png

Geometric objects

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. To change the geom in your plot, change the geom function that you add to ggplot(). For instance, to make the plots above, you can use this

# left
p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

# right
p2 <- ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

options(repr.plot.width = 10, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

png

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

png

ggplot2 provides over 40 geoms, and extension packages provide even more see examples or find more in the cheetsheets

? geom_smooth

geom_smooth              package:ggplot2               R Documentation

_S_m_o_o_t_h_e_d _c_o_n_d_i_t_i_o_n_a_l _m_e_a_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     Aids the eye in seeing patterns in the presence of overplotting.
     'geom_smooth()' and 'stat_smooth()' are effectively aliases: they
     both use the same arguments. Use 'stat_smooth()' if you want to
     display the results with a non-standard geom.

_U_s_a_g_e:

     geom_smooth(
       mapping = NULL,
       data = NULL,
       stat = "smooth",
       position = "identity",
       ...,
       method = NULL,
       formula = NULL,
       se = TRUE,
       na.rm = FALSE,
       orientation = NA,
       show.legend = NA,
       inherit.aes = TRUE
     )
     
     stat_smooth(
       mapping = NULL,
       data = NULL,
       geom = "smooth",
       position = "identity",
       ...,
       method = NULL,
       formula = NULL,
       se = TRUE,
       n = 80,
       span = 0.75,
       fullrange = FALSE,
       level = 0.95,
       method.args = list(),
       na.rm = FALSE,
       orientation = NA,
       show.legend = NA,
       inherit.aes = TRUE
     )
     
_A_r_g_u_m_e_n_t_s:

 mapping: Set of aesthetic mappings created by 'aes()'. If specified
          and 'inherit.aes = TRUE' (the default), it is combined with
          the default mapping at the top level of the plot. You must
          supply 'mapping' if there is no plot mapping.

    data: The data to be displayed in this layer. There are three
          options:

          If 'NULL', the default, the data is inherited from the plot
          data as specified in the call to 'ggplot()'.

          A 'data.frame', or other object, will override the plot data.
          All objects will be fortified to produce a data frame. See
          'fortify()' for which variables will be created.

          A 'function' will be called with a single argument, the plot
          data. The return value must be a 'data.frame', and will be
          used as the layer data. A 'function' can be created from a
          'formula' (e.g. '~ head(.x, 10)').

position: Position adjustment, either as a string naming the adjustment
          (e.g. '"jitter"' to use 'position_jitter'), or the result of
          a call to a position adjustment function. Use the latter if
          you need to change the settings of the adjustment.

     ...: Other arguments passed on to 'layer()'. These are often
          aesthetics, used to set an aesthetic to a fixed value, like
          'colour = "red"' or 'size = 3'. They may also be parameters
          to the paired geom/stat.

  method: Smoothing method (function) to use, accepts either 'NULL' or
          a character vector, e.g. '"lm"', '"glm"', '"gam"', '"loess"'
          or a function, e.g. 'MASS::rlm' or 'mgcv::gam', 'stats::lm',
          or 'stats::loess'. '"auto"' is also accepted for backwards
          compatibility.  It is equivalent to 'NULL'.

          For 'method = NULL' the smoothing method is chosen based on
          the size of the largest group (across all panels).
          'stats::loess()' is used for less than 1,000 observations;
          otherwise 'mgcv::gam()' is used with 'formula = y ~ s(x, bs =
          "cs")' with 'method = "REML"'. Somewhat anecdotally, 'loess'
          gives a better appearance, but is O(N^2) in memory, so does
          not work for larger datasets.

          If you have fewer than 1,000 observations but want to use the
          same 'gam()' model that 'method = NULL' would use, then set
          method = "gam", formula = y ~ s(x, bs = "cs").

 formula: Formula to use in smoothing function, eg. 'y ~ x', 'y ~
          poly(x, 2)', 'y ~ log(x)'. 'NULL' by default, in which case
          'method = NULL' implies 'formula = y ~ x' when there are
          fewer than 1,000 observations and 'formula = y ~ s(x, bs =
          "cs")' otherwise.

      se: Display confidence interval around smooth? ('TRUE' by
          default, see 'level' to control.)

   na.rm: If 'FALSE', the default, missing values are removed with a
          warning. If 'TRUE', missing values are silently removed.

orientation: The orientation of the layer. The default ('NA')
          automatically determines the orientation from the aesthetic
          mapping. In the rare event that this fails it can be given
          explicitly by setting 'orientation' to either '"x"' or '"y"'.
          See the _Orientation_ section for more detail.

show.legend: logical. Should this layer be included in the legends?
          'NA', the default, includes if any aesthetics are mapped.
          'FALSE' never includes, and 'TRUE' always includes. It can
          also be a named logical vector to finely select the
          aesthetics to display.

inherit.aes: If 'FALSE', overrides the default aesthetics, rather than
          combining with them. This is most useful for helper functions
          that define both data and aesthetics and shouldn't inherit
          behaviour from the default plot specification, e.g.
          'borders()'.

geom, stat: Use to override the default connection between
          'geom_smooth()' and 'stat_smooth()'.

       n: Number of points at which to evaluate smoother.

    span: Controls the amount of smoothing for the default loess
          smoother. Smaller numbers produce wigglier lines, larger
          numbers produce smoother lines. Only used with loess, i.e.
          when 'method = "loess"', or when 'method = NULL' (the
          default) and there are fewer than 1,000 observations.

fullrange: If 'TRUE', the smoothing line gets expanded to the range of
          the plot, potentially beyond the data. This does not extend
          the line into any additional padding created by 'expansion'.

   level: Level of confidence interval to use (0.95 by default).

method.args: List of additional arguments passed on to the modelling
          function defined by 'method'.

_D_e_t_a_i_l_s:

     Calculation is performed by the (currently undocumented)
     'predictdf()' generic and its methods.  For most methods the
     standard error bounds are computed using the 'predict()' method -
     the exceptions are 'loess()', which uses a t-based approximation,
     and 'glm()', where the normal confidence interval is constructed
     on the link scale and then back-transformed to the response scale.

_O_r_i_e_n_t_a_t_i_o_n:

     This geom treats each axis differently and, thus, can thus have
     two orientations. Often the orientation is easy to deduce from a
     combination of the given mappings and the types of positional
     scales in use. Thus, ggplot2 will by default try to guess which
     orientation the layer should have. Under rare circumstances, the
     orientation is ambiguous and guessing may fail. In that case the
     orientation can be specified directly using the 'orientation'
     parameter, which can be either '"x"' or '"y"'. The value gives the
     axis that the geom should run along, '"x"' being the default
     orientation you would expect for the geom.

_A_e_s_t_h_e_t_i_c_s:

     'geom_smooth()' understands the following aesthetics (required
     aesthetics are in bold):

        * *'x'*

        * *'y'*

        * 'alpha'

        * 'colour'

        * 'fill'

        * 'group'

        * 'linetype'

        * 'linewidth'

        * 'weight'

        * 'ymax'

        * 'ymin'

     Learn more about setting these aesthetics in
     'vignette("ggplot2-specs")'.

_C_o_m_p_u_t_e_d _v_a_r_i_a_b_l_e_s:

     These are calculated by the 'stat' part of layers and can be
     accessed with delayed evaluation. 'stat_smooth()' provides the
     following variables, some of which depend on the orientation:

        * 'after_stat(y)' _or_ 'after_stat(x)'
          Predicted value.

        * 'after_stat(ymin)' _or_ 'after_stat(xmin)'
          Lower pointwise confidence interval around the mean.

        * 'after_stat(ymax)' _or_ 'after_stat(xmax)'
          Upper pointwise confidence interval around the mean.

        * 'after_stat(se)'
          Standard error.

_S_e_e _A_l_s_o:

     See individual modelling functions for more details: 'lm()' for
     linear smooths, 'glm()' for generalised linear smooths, and
     'loess()' for local smooths.

_E_x_a_m_p_l_e_s:

     ggplot(mpg, aes(displ, hwy)) +
       geom_point() +
       geom_smooth()
     
     # If you need the fitting to be done along the y-axis set the orientation
     ggplot(mpg, aes(displ, hwy)) +
       geom_point() +
       geom_smooth(orientation = "y")
     
     # Use span to control the "wiggliness" of the default loess smoother.
     # The span is the fraction of points used to fit each local regression:
     # small numbers make a wigglier curve, larger numbers make a smoother curve.
     ggplot(mpg, aes(displ, hwy)) +
       geom_point() +
       geom_smooth(span = 0.3)
     
     # Instead of a loess smooth, you can use any other modelling function:
     ggplot(mpg, aes(displ, hwy)) +
       geom_point() +
       geom_smooth(method = lm, se = FALSE)
     
     ggplot(mpg, aes(displ, hwy)) +
       geom_point() +
       geom_smooth(method = lm, formula = y ~ splines::bs(x, 3), se = FALSE)
     
     # Smooths are automatically fit to each group (defined by categorical
     # aesthetics or the group aesthetic) and for each facet.
     
     ggplot(mpg, aes(displ, hwy, colour = class)) +
       geom_point() +
       geom_smooth(se = FALSE, method = lm)
     ggplot(mpg, aes(displ, hwy)) +
       geom_point() +
       geom_smooth(span = 0.8) +
       facet_wrap(~drv)
     
     
     binomial_smooth <- function(...) {
       geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
     }
     # To fit a logistic regression, you need to coerce the values to
     # a numeric vector lying between 0 and 1.
     ggplot(rpart::kyphosis, aes(Age, Kyphosis)) +
       geom_jitter(height = 0.05) +
       binomial_smooth()
     
     ggplot(rpart::kyphosis, aes(Age, as.numeric(Kyphosis) - 1)) +
       geom_jitter(height = 0.05) +
       binomial_smooth()
     
     ggplot(rpart::kyphosis, aes(Age, as.numeric(Kyphosis) - 1)) +
       geom_jitter(height = 0.05) +
       binomial_smooth(formula = y ~ splines::ns(x, 2))
     
     # But in this case, it's probably better to fit the model yourself
     # so you can exercise more control and see whether or not it's a good model.

binomial_smooth <- function(...) {
    geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
# To fit a logistic regression, you need to coerce the values to
# a numeric vector lying between 0 and 1.
ggplot(rpart::kyphosis, aes(Age, as.numeric(Kyphosis) - 1)) +
    geom_jitter(height = 0.05) +
    binomial_smooth(formula = y ~ splines::ns(x, 2))

options(repr.plot.width = 10, repr.plot.height = 5)

png

p1 = ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))
              
p2 = ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
    
p3 = ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, color = drv),
    show.legend = FALSE
  )

options(repr.plot.width = 30, repr.plot.height = 5)
grid.arrange(p1, p2, p3, ncol=3)

[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

png

To display multiple geoms in the same plot, add multiple geom functions to ggplot():

p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))    # some duplication in the code here!

p2 <- ggplot(data = mpg, mapping = aes(x = displ, y = cty)) +   # that is better!
  geom_point() + 
  geom_smooth()

options(repr.plot.width = 20, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

png

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

options(repr.plot.width = 10, repr.plot.height = 5)

[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

png

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

png

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

png

p1 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

p2 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv), se = FALSE)

p3 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point(size = 4) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv, linewidth = 2), se = FALSE)

p4 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = drv, size = 4)) + 
  geom_smooth(se = FALSE)

p5 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point(size = 4) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv, linewidth = 2, lty = drv), se = FALSE)

p6 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy, fill = drv)) + 
  geom_point(shape=21, color="white", size=4)  # alpha = factor(cyl)

options(repr.plot.width = 20, repr.plot.height = 15)
grid.arrange(p1, p2, p3, p4, p5, p6, ncol=2, nrow = 3)

[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
[1m[22m`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

png

Statistical transformations

The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond.

Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
smoothers fit a model to your data and then plot predictions from the model.
boxplots compute a robust summary of the distribution and then display a specially formatted box.

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.

Every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

options(repr.plot.width = 10, repr.plot.height = 5)

png

#  we should use a stat explicitly, and there are 3 reasons:
# 1. to override the default stat, by default it is stat_count()
demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

png

# 2. to override the default mapping from transformed variables to aesthetics
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))

png

# to draw greater attention to the statistical transformation

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

png

ggplot(data = diamonds) + 
  geom_col(mapping = aes(x = cut, y = depth))

options(repr.plot.width = 10, repr.plot.height = 5)

png

Position Adjustments

There’s one more piece of magic associated with bar charts. You can colour a bar chart using either the colour aesthetic, or, more usefully, fill:

p1 <- ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))
p2 <- ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

options(repr.plot.width = 20, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

png

Note what happens if you map the fill aesthetic to another variable, like clarity: the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))
options(repr.plot.width = 10, repr.plot.height = 5)

png

The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: “identity”, “dodge” or “fill”.

position = "identity" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA.

p1 <- ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")

p2 <- ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) + 
  geom_bar(fill = NA, position = "identity")

options(repr.plot.width = 20, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

png

position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

options(repr.plot.width = 10, repr.plot.height = 5)

png

position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

png

There’s one other type of adjustment that’s not useful for bar charts, but it can be very useful for scatterplots. Recall our first scatterplot, where the values of hwy and displ are rounded so the points appear on a grid and many points overlap each other ( the plot displays only 126 points, even though there are 234 observations in the dataset). This problem is known as overplotting.

You can avoid this gridding by setting the position adjustment to “jitter”. position = "jitter" adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise. ggplot2 comes with a shorthand for geom_point(position = "jitter"): geom_jitter().

To learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class), position = "jitter")

options(repr.plot.width = 10, repr.plot.height = 5)

png

# Exercise 3.8
p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) + 
  geom_point()

p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) + 
  geom_jitter() # width = 0.5, height = 0.5

options(repr.plot.width = 20, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

png

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color=drv)) + 
  geom_boxplot(mapping = aes(group=drv))

options(repr.plot.width = 10, repr.plot.height = 5)

png

Coordinate systems

Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.

coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.

p1 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy, color=class)) + 
  geom_boxplot()
p2 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy, color=class)) + 
  geom_boxplot() +
  coord_flip()

options(repr.plot.width = 20, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

png

coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2 (which unfortunately we don’t have the space to cover in this book).

library(maps) # for coord_quickmap()
library(mapproj) # for coord_map()

if (require("maps")) {
    nz <- map_data("nz")

    p1 <- ggplot(nz, aes(long, lat, group = group)) +
        geom_polygon(fill = "white", colour = "black")

    p2 <- ggplot(nz, aes(long, lat, group = group)) +
        geom_polygon(fill = "white", colour = "black") +
        coord_quickmap()

    options(repr.plot.width = 10, repr.plot.height = 5)
    grid.arrange(p1, p2, ncol=2)
}

png

nz <- map_data("nz")

p1 <- ggplot(nz, aes(long, lat, group = group)) +
    geom_polygon(fill = "white", colour = "black")

p2 <- ggplot(nz, aes(long, lat, group = group)) +
    geom_polygon(fill = "white", colour = "black") +
    coord_map()

options(repr.plot.width = 10, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

Error in map_data("nz"): could not find function "map_data"
Traceback:

coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(title = "Using labs", x = "polar", y = "bear")

p1 <- bar + coord_flip()
p2 <- bar + coord_polar()

options(repr.plot.width = 10, repr.plot.height = 5)
grid.arrange(p1, p2, ncol=2)

png

# exercises 3.9 What does the plot below tell you about the relationship between city and highway mpg? 
# Why is coord_fixed() important? What does geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

Tags: R Data Science Data Visualisation DataViz

Day 2. Trực quan hóa dữ liệu với ggplot

Data Visualisation

Prerequisites

First steps

The `mpg` data frame

Creating a ggplot

A graphing template

Aesthetic mappings

Facets

Geometric objects

Statistical transformations

Position Adjustments

Coordinate systems

Data Visualisation

Prerequisites

First steps

The mpg data frame

Creating a ggplot

A graphing template

Aesthetic mappings

Facets

Geometric objects

Statistical transformations

Position Adjustments

Coordinate systems

The `mpg` data frame