R4DS - Data transformation

Day 6. Data types

By Chí Trung HÀ

C12. Logical Vectors

library(tidyverse)
library(nycflights13)
library(dplyr)

12.2. Comparisions

x <- c(1, 2, 3, 5, 7, 11, 13)
x * 2

df <- tibble(x)
df |> 
  mutate(y = x * 2)

<ol class=list-inline><li>2</li><li>4</li><li>6</li><li>10</li><li>14</li><li>22</li><li>26</li></ol>

A tibble: 7 x 2
xy
<dbl><dbl>
1 2
2 4
3 6
510
714
1122
1326
flights |> 
  mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
    .keep = "used"
  )
A tibble: 336776 x 4
dep_timearr_delaydaytimeapprox_ontime
<int><dbl><lgl><lgl>
517 11FALSE TRUE
533 20FALSEFALSE
542 33FALSEFALSE
544-18FALSE TRUE
554-25FALSEFALSE
554 12FALSE TRUE
555 19FALSE TRUE
557-14FALSE TRUE
557 -8FALSE TRUE
558 8FALSE TRUE
558 -2FALSE TRUE
558 -3FALSE TRUE
558 7FALSE TRUE
558-14FALSE TRUE
559 31FALSEFALSE
559 -4FALSE TRUE
559 -8FALSE TRUE
600 -7FALSE TRUE
600 12FALSE TRUE
601 -6 TRUE TRUE
602 -8 TRUE TRUE
602 16 TRUE TRUE
606-12 TRUE TRUE
606 -8 TRUE TRUE
607-17 TRUE TRUE
608 32 TRUEFALSE
611 14 TRUE TRUE
613 4 TRUE TRUE
615-21 TRUEFALSE
615 -9 TRUE TRUE
............
2123-24FALSEFALSE
2127 -9FALSE TRUE
2128-31FALSEFALSE
2129 -2FALSE TRUE
2131-30FALSEFALSE
2140-30FALSEFALSE
2142 11FALSE TRUE
2145-25FALSEFALSE
2147 3FALSE TRUE
2149-23FALSEFALSE
2150-16FALSE TRUE
2159194FALSEFALSE
2203 8FALSE TRUE
2207 7FALSE TRUE
2211 57FALSEFALSE
2231-21FALSEFALSE
2233 42FALSEFALSE
2235130FALSEFALSE
2237 -8FALSE TRUE
2240-17FALSE TRUE
2240-20FALSEFALSE
2241-16FALSE TRUE
2307 1FALSE TRUE
2349-25FALSEFALSE
NA NA NA NA
NA NA NA NA
NA NA NA NA
NA NA NA NA
NA NA NA NA
NA NA NA NA
flights |> 
  mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
  ) |> 
  filter(daytime & approx_ontime)
A tibble: 172286 x 21
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrier...tailnumorigindestair_timedistancehourminutetime_hourdaytimeapprox_ontime
<int><int><int><int><int><dbl><int><int><dbl><chr>...<chr><chr><chr><dbl><dbl><dbl><dbl><dttm><lgl><lgl>
201311601600 1 844 850 -6B6...N644JBEWRPBI14710236 02013-01-01 06:00:00TRUETRUE
201311602610-8 812 820 -8DL...N971DLLGAMSP17010206102013-01-01 06:00:00TRUETRUE
201311602605-3 821 805 16MQ...N730MQLGADTW105 5026 52013-01-01 06:00:00TRUETRUE
201311606610-4 858 910-12AA...N633AAEWRMIA15210856102013-01-01 06:00:00TRUETRUE
201311606610-4 837 845 -8DL...N3739PJFKATL128 7606102013-01-01 06:00:00TRUETRUE
201311607607 0 858 915-17UA...N53442EWRMIA15710856 72013-01-01 06:00:00TRUETRUE
20131161160011 945 931 14UA...N532UAJFKSFO36625866 02013-01-01 06:00:00TRUETRUE
201311613610 3 925 921 4B6...N635JBJFKRSW17510746102013-01-01 06:00:00TRUETRUE
201311615615 0 833 842 -9DL...N326NBEWRATL120 7466152013-01-01 06:00:00TRUETRUE
201311622630-810171014 3US...N807AWEWRPHX34221336302013-01-01 06:00:00TRUETRUE
20131162361013 920 915 5AA...N3EMAALGAMIA15310966102013-01-01 06:00:00TRUETRUE
201311623627-4 933 932 1UA...N459UALGAIAH22914166272013-01-01 06:00:00TRUETRUE
201311624630-6 840 830 10MQ...N518MQLGAMSP16610206302013-01-01 06:00:00TRUETRUE
201311627630-310181018 0US...N535UWJFKPHX33021536302013-01-01 06:00:00TRUETRUE
201311628630-211371140 -3AA...N3BAAAJFKSJU19215986302013-01-01 06:00:00TRUETRUE
201311629630-1 824 810 14AA...N3CYAALGAORD140 7336302013-01-01 06:00:00TRUETRUE
201311629630-1 721 740-19WN...N273WNLGABWI 40 1856302013-01-01 06:00:00TRUETRUE
201311629630-1 824 833 -9US...N426USEWRCLT 91 5296302013-01-01 06:00:00TRUETRUE
20131163260824 740 728 12EV...N13553EWRIAD 52 2126 82013-01-01 06:00:00TRUETRUE
201311637645-8 930 935 -5B6...N709JBLGAMCO144 9506452013-01-01 06:00:00TRUETRUE
201311639640-1 739 749-10B6...N805JBJFKBOS 41 1876402013-01-01 06:00:00TRUETRUE
201311643646-3 922 940-18UA...N497UAEWRPBI14610236462013-01-01 06:00:00TRUETRUE
201311643645-2 837 848-11US...N178USEWRCLT 91 5296452013-01-01 06:00:00TRUETRUE
201311644636 8 931 940 -9UA...N75435EWRFLL15110656362013-01-01 06:00:00TRUETRUE
201311645647-2 815 810 5B6...N796JBJFKBUF 63 3016472013-01-01 06:00:00TRUETRUE
201311646645 1 910 916 -6UA...N569UALGADEN24316206452013-01-01 06:00:00TRUETRUE
201311646645 110231030 -7UA...N38727EWRSNA38024346452013-01-01 06:00:00TRUETRUE
201311651655-4 936 942 -6B6...N558JBJFKLAS32322486552013-01-01 06:00:00TRUETRUE
201311652655-3 932 921 11B6...N178JBJFKMSY19111826552013-01-01 06:00:00TRUETRUE
201311655655 010211030 -9DL...N3763DJFKSLC29419906552013-01-01 06:00:00TRUETRUE
............................................................
201393019081911 -321072110 -3EV...N16987EWRCLT 74 52919112013-09-30 19:00:00TRUETRUE
201393019091905 420552057 -29E...N937XJJFKORD108 74019 52013-09-30 19:00:00TRUETRUE
201393019091901 822122211 1UA...N513UAEWRSFO307256519 12013-09-30 19:00:00TRUETRUE
201393019151900 1521382151-13B6...N563JBJFKMCO120 94419 02013-09-30 19:00:00TRUETRUE
201393019151900 1520252015 10WN...N719SWLGAMKE104 73819 02013-09-30 19:00:00TRUETRUE
201393019171920 -320532055 -29E...N600LRLGABUF 50 29219202013-09-30 19:00:00TRUETRUE
201393019251930 -520332049-16EV...N825ASLGAIAD 43 22919302013-09-30 19:00:00TRUETRUE
201393019261930 -421462201-15DL...N303DQEWRATL 95 74619302013-09-30 19:00:00TRUETRUE
201393019271930 -320542056 -2EV...N13956EWRMKE110 72519302013-09-30 19:00:00TRUETRUE
201393019281932 -421372143 -6EV...N18556EWRTYS 82 63119322013-09-30 19:00:00TRUETRUE
201393019291934 -521462136 10EV...N18557EWRCHS 92 62819342013-09-30 19:00:00TRUETRUE
201393019301930 022172231-14DL...N365NBLGATPA133101019302013-09-30 19:00:00TRUETRUE
201393019301940-1021202125 -5MQ...N839MQJFKRDU 70 42719402013-09-30 19:00:00TRUETRUE
201393019311905 2620492033 16EV...N748EVLGAROC 41 25419 52013-09-30 19:00:00TRUETRUE
201393019321935 -322212232-11DL...N927DALGAMCO132 95019352013-09-30 19:00:00TRUETRUE
201393019341930 422362250-14AA...N3EXAAJFKSEA340242219302013-09-30 19:00:00TRUETRUE
201393019341935 -122382250-12AA...N3JFAALGAMIA150109619352013-09-30 19:00:00TRUETRUE
201393019351935 022112229-18B6...N589JBJFKTPA132100519352013-09-30 19:00:00TRUETRUE
201393019351930 522302233 -3UA...N38446EWRTPA132 99719302013-09-30 19:00:00TRUETRUE
201393019361910 2622042203 1DL...N905DLJFKMCO126 94419102013-09-30 19:00:00TRUETRUE
201393019391940 -120412100-19EV...N829ASJFKIAD 42 22819402013-09-30 19:00:00TRUETRUE
201393019391950-1121292140-11MQ...N735MQLGACMH 73 47919502013-09-30 19:00:00TRUETRUE
201393019401900 4022282232 -4DL...N723TWJFKSFO323258619 02013-09-30 19:00:00TRUETRUE
201393019441950 -622082215 -7MQ...N507MQLGAATL100 76219502013-09-30 19:00:00TRUETRUE
201393019532000 -721192132-13UA...N853UALGAORD107 73320 02013-09-30 20:00:00TRUETRUE
201393019552000 -522192230-11DL...N992DLLGAATL 99 76220 02013-09-30 20:00:00TRUETRUE
201393019551935 2021412159-189E...N928XJJFKCVG 86 58919352013-09-30 19:00:00TRUETRUE
201393019551942 1323002250 10B6...N623JBLGAFLL141107619422013-09-30 19:00:00TRUETRUE
201393019552000 -521122114 -2US...N957UWLGABOS 35 18420 02013-09-30 20:00:00TRUETRUE
201393019582005 -721192130-11MQ...N511MQEWRORD102 71920 52013-09-30 20:00:00TRUETRUE

Floating point comparision

x <- c(1 / 49 * 49, sqrt(2) ^ 2)
x

<ol class=list-inline><li>1</li><li>2</li></ol>

x == c(1, 2)

<ol class=list-inline><li>FALSE</li><li>FALSE</li></ol>

print(x, digits = 16)
[1] 0.9999999999999999 2.0000000000000004
near(x, c(1, 2))

<ol class=list-inline><li>TRUE</li><li>TRUE</li></ol>

Missing values

NA > 5
10 == NA

NA == NA

<NA>

<NA>

<NA>

# so this code doesn't work
flights |> 
  filter(dep_time == NA)
A tibble: 0 x 19
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
<int><int><int><int><int><dbl><int><int><dbl><chr><int><chr><chr><chr><dbl><dbl><dbl><dbl><dttm>

is.na()

flights |> 
  filter(is.na(dep_time))
A tibble: 8255 x 19
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
<int><int><int><int><int><dbl><int><int><dbl><chr><int><chr><chr><chr><dbl><dbl><dbl><dbl><dttm>
201311NA1630NANA1815NAEV4308N18120EWRRDUNA 41616302013-01-01 16:00:00
201311NA1935NANA2240NAAA 791N3EHAALGADFWNA138919352013-01-01 19:00:00
201311NA1500NANA1825NAAA1925N3EVAALGAMIANA109615 02013-01-01 15:00:00
201311NA 600NANA 901NAB6 125N618JBJFKFLLNA1069 6 02013-01-01 06:00:00
201312NA1540NANA1747NAEV4352N10575EWRCVGNA 56915402013-01-02 15:00:00
201312NA1620NANA1746NAEV4406N13949EWRPITNA 31916202013-01-02 16:00:00
201312NA1355NANA1459NAEV4434N10575EWRMHTNA 20913552013-01-02 13:00:00
201312NA1420NANA1644NAEV4935N759EVEWRATLNA 74614202013-01-02 14:00:00
201312NA1321NANA1536NAEV3849N13550EWRINDNA 64513212013-01-02 13:00:00
201312NA1545NANA1910NAAA 133NA JFKLAXNA247515452013-01-02 15:00:00
201312NA1330NANA1640NAAA 753N3FBAALGADFWNA138913302013-01-02 13:00:00
201312NA1601NANA1735NAUA 623NA EWRORDNA 71916 12013-01-02 16:00:00
201313NA 645NANA 757NAEV4241N14972EWRDCANA 199 6452013-01-03 06:00:00
201313NA1030NANA1210NAAA 321N487AALGAORDNA 73310302013-01-03 10:00:00
201313NA1125NANA1305NAAA 327N3AMAALGAORDNA 73311252013-01-03 11:00:00
201313NA 835NANA1150NAAA 717N3GXAALGADFWNA1389 8352013-01-03 08:00:00
201313NA 920NANA1245NAAA 721N201AALGADFWNA1389 9202013-01-03 09:00:00
201313NA1020NANA1330NAAA 731N3FVAALGADFWNA138910202013-01-03 10:00:00
201313NA1220NANA1415NAAA1757N573AALGASTLNA 88812202013-01-03 12:00:00
201313NA 630NANA 830NAMQ4599N500MQLGAMSPNA1020 6302013-01-03 06:00:00
201313NA 857NANA1209NAUA 714NA EWRMIANA1085 8572013-01-03 08:00:00
201313NA 645NANA 952NAUA 719NA EWRDFWNA1372 6452013-01-03 06:00:00
201314NA 845NANA1015NA9E3405NA JFKDCANA 213 8452013-01-04 08:00:00
201314NA1830NANA2044NA9E3716NA EWRDTWNA 48818302013-01-04 18:00:00
201314NA 920NANA1245NAAA 721N541AALGADFWNA1389 9202013-01-04 09:00:00
201314NA1245NANA1550NAAA 745N3BGAALGADFWNA138912452013-01-04 12:00:00
201314NA1430NANA1735NAAA 883N200AAEWRDFWNA137214302013-01-04 14:00:00
201314NA1530NANA1725NAAA2223N569AALGASTLNA 88815302013-01-04 15:00:00
201315NA1400NANA1518NAEV5712N827ASJFKIADNA 22814 02013-01-05 14:00:00
201315NA 840NANA1001NA9E3422NA JFKBOSNA 187 8402013-01-05 08:00:00
.........................................................
2013924NA1625NANA1750NAMQ3622N524MQLGABNANA 76416252013-09-24 16:00:00
2013925NA1259NANA1507NAEV5207N615QXLGACLTNA 54412592013-09-25 12:00:00
2013925NA 845NANA1018NAEV5286N615QXLGABTVNA 258 8452013-09-25 08:00:00
2013925NA1755NANA1932NAEV5287N722EVLGAMSNNA 81217552013-09-25 17:00:00
2013925NA 600NANA 716NAEV5716N877ASJFKIADNA 228 6 02013-09-25 06:00:00
2013925NA 836NANA 944NAB62280N258JBEWRBOSNA 200 8362013-09-25 08:00:00
2013925NA1300NANA1409NAUS2148NA LGABOSNA 18413 02013-09-25 13:00:00
2013925NA1900NANA2014NAUS2160NA LGABOSNA 18419 02013-09-25 19:00:00
2013925NA1300NANA1450NAMQ3388N817MQLGACMHNA 47913 02013-09-25 13:00:00
2013925NA1655NANA1840NAMQ3411N735MQLGARDUNA 43116552013-09-25 16:00:00
2013925NA1559NANA1719NAMQ3748N530MQEWRORDNA 71915592013-09-25 15:00:00
2013926NA 915NANA1141NAEV5109N748EVLGACHSNA 641 9152013-09-26 09:00:00
2013926NA1400NANA1512NAUS2183NA LGADCANA 21414 02013-09-26 14:00:00
2013926NA1240NANA1525NAWN4720N691WNEWRHOUNA141112402013-09-26 12:00:00
2013927NA 600NANA 730NAAA 301N584AALGAORDNA 733 6 02013-09-27 06:00:00
2013927NA2100NANA2211NAUS2164NA LGABOSNA 18421 02013-09-27 21:00:00
2013927NA1329NANA1444NAMQ3760N505MQEWRORDNA 71913292013-09-27 13:00:00
2013927NA1600NANA1739NAUA 269NA LGAORDNA 73316 02013-09-27 16:00:00
2013928NA1803NANA1927NAEV5563N724EVLGABTVNA 25818 32013-09-28 18:00:00
2013928NA 910NANA1220NAAA 1N320AAJFKLAXNA2475 9102013-09-28 09:00:00
2013928NA1635NANA1827NAUS 581NA EWRCLTNA 52916352013-09-28 16:00:00
2013929NA2054NANA2302NAEV4536N13988EWRCVGNA 56920542013-09-29 20:00:00
2013929NA1830NANA2010NAMQ3134N508MQEWRORDNA 71918302013-09-29 18:00:00
2013929NA 700NANA 833NAUA 331NA LGAORDNA 733 7 02013-09-29 07:00:00
2013930NA1842NANA2019NAEV5274N740EVLGABNANA 76418422013-09-30 18:00:00
2013930NA1455NANA1634NA9E3393NA JFKDCANA 21314552013-09-30 14:00:00
2013930NA2200NANA2312NA9E3525NA LGASYRNA 19822 02013-09-30 22:00:00
2013930NA1210NANA1330NAMQ3461N535MQLGABNANA 76412102013-09-30 12:00:00
2013930NA1159NANA1344NAMQ3572N511MQLGACLENA 41911592013-09-30 11:00:00
2013930NA 840NANA1020NAMQ3531N839MQLGARDUNA 431 8402013-09-30 08:00:00
flights |> 
  filter(month == 1, day == 1) |> 
  arrange(desc(is.na(dep_time)), dep_time)
A tibble: 842 x 19
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
<int><int><int><int><int><dbl><int><int><dbl><chr><int><chr><chr><chr><dbl><dbl><dbl><dbl><dttm>
201311 NA1630NA NA1815 NAEV4308N18120EWRRDU NA 41616302013-01-01 16:00:00
201311 NA1935NA NA2240 NAAA 791N3EHAALGADFW NA138919352013-01-01 19:00:00
201311 NA1500NA NA1825 NAAA1925N3EVAALGAMIA NA109615 02013-01-01 15:00:00
201311 NA 600NA NA 901 NAB6 125N618JBJFKFLL NA1069 6 02013-01-01 06:00:00
201311517 515 2 830 819 11UA1545N14228EWRIAH2271400 5152013-01-01 05:00:00
201311533 529 4 850 830 20UA1714N24211LGAIAH2271416 5292013-01-01 05:00:00
201311542 540 2 923 850 33AA1141N619AAJFKMIA1601089 5402013-01-01 05:00:00
201311544 545-110041022-18B6 725N804JBJFKBQN1831576 5452013-01-01 05:00:00
201311554 600-6 812 837-25DL 461N668DNLGAATL116 762 6 02013-01-01 06:00:00
201311554 558-4 740 728 12UA1696N39463EWRORD150 719 5582013-01-01 05:00:00
201311555 600-5 913 854 19B6 507N516JBEWRFLL1581065 6 02013-01-01 06:00:00
201311557 600-3 709 723-14EV5708N829ASLGAIAD 53 229 6 02013-01-01 06:00:00
201311557 600-3 838 846 -8B6 79N593JBJFKMCO140 944 6 02013-01-01 06:00:00
201311558 600-2 753 745 8AA 301N3ALAALGAORD138 733 6 02013-01-01 06:00:00
201311558 600-2 849 851 -2B6 49N793JBJFKPBI1491028 6 02013-01-01 06:00:00
201311558 600-2 853 856 -3B6 71N657JBJFKTPA1581005 6 02013-01-01 06:00:00
201311558 600-2 924 917 7UA 194N29129JFKLAX3452475 6 02013-01-01 06:00:00
201311558 600-2 923 937-14UA1124N53441EWRSFO3612565 6 02013-01-01 06:00:00
201311559 600-1 941 910 31AA 707N3DUAALGADFW2571389 6 02013-01-01 06:00:00
201311559 559 0 702 706 -4B61806N708JBJFKBOS 44 187 5592013-01-01 05:00:00
201311559 600-1 854 902 -8UA1187N76515EWRLAS3372227 6 02013-01-01 06:00:00
201311600 600 0 851 858 -7B6 371N595JBLGAFLL1521076 6 02013-01-01 06:00:00
201311600 600 0 837 825 12MQ4650N542MQLGAATL134 762 6 02013-01-01 06:00:00
201311601 600 1 844 850 -6B6 343N644JBEWRPBI1471023 6 02013-01-01 06:00:00
201311602 610-8 812 820 -8DL1919N971DLLGAMSP1701020 6102013-01-01 06:00:00
201311602 605-3 821 805 16MQ4401N730MQLGADTW105 502 6 52013-01-01 06:00:00
201311606 610-4 858 910-12AA1895N633AAEWRMIA1521085 6102013-01-01 06:00:00
201311606 610-4 837 845 -8DL1743N3739PJFKATL128 760 6102013-01-01 06:00:00
201311607 607 0 858 915-17UA1077N53442EWRMIA1571085 6 72013-01-01 06:00:00
201311608 600 8 807 735 32MQ3768N9EAMQEWRORD139 719 6 02013-01-01 06:00:00
.........................................................
20131121282125 322432240 3MQ4449N810MQJFKDCA 54 21321252013-01-01 21:00:00
20131121292120 923422351 -9B6 97N625JBJFKDEN223162621202013-01-01 21:00:00
20131121342045 49 202352 28UA1106N27733EWRFLL152106520452013-01-01 20:00:00
20131121362145 -9 25 39-14B6 515N198JBEWRFLL154106521452013-01-01 21:00:00
20131121402135 5 210 224-14B6 701N284JBJFKSJU189159821352013-01-01 21:00:00
20131121572155 2 43 41 2B6 43N537JBJFKMCO140 94421552013-01-01 21:00:00
20131121582200 -222542307-13EV4103N14998EWRBWI 36 16922 02013-01-01 22:00:00
20131122051720285 462040246AA1999N5DNAAEWRMIA146108517202013-01-01 17:00:00
20131122092145 24 58 37 21B6 35N608JBJFKPBI143102821452013-01-01 21:00:00
20131122092155 1424002337 23B61109N216JBJFKRDU 86 42721552013-01-01 21:00:00
20131122112145 2623392311 28B6 104N228JBJFKBUF 64 30121452013-01-01 21:00:00
20131122172229-12 249 315-26B6 713N547JBJFKSJU191159822292013-01-01 22:00:00
20131122172130 47 140 27 73B6 21N516JBJFKTPA163100521302013-01-01 21:00:00
2013112221200014123312124127EV4462N13566EWRBUF 56 28220 02013-01-01 20:00:00
20131122242200 2423242316 8EV4206N16561EWRPWM 47 28422 02013-01-01 22:00:00
20131122292159 30 149 100 49B6 11N531JBJFKFLL153106921592013-01-01 21:00:00
20131122402245 -523402356-16B6 608N279JBJFKPWM 44 27322452013-01-01 22:00:00
20131122502255 -523522359 -7B61018N521JBJFKBOS 37 18722552013-01-01 22:00:00
20131123022200 6223422253 49EV4276N13903EWRBDL 24 11622 02013-01-01 22:00:00
20131123062245 21 28 5 23B6 30N281JBJFKROC 59 26422452013-01-01 22:00:00
20131123072245 22 322357 35B6 128N178JBJFKBTV 59 26622452013-01-01 22:00:00
20131123102255 15 24 15 9B6 112N646JBJFKBUF 57 30122552013-01-01 22:00:00
20131123122000192 212110191EV4312N13958EWRDCA 44 19920 02013-01-01 20:00:00
20131123232200 83 222313 69EV4257N13538EWRBTV 44 26622 02013-01-01 22:00:00
20131123262130116 131 18 73B6 199N594JBJFKLAS290224821302013-01-01 21:00:00
20131123272250 37 322359 33B6 22N639JBJFKSYR 45 20922502013-01-01 22:00:00
20131123431724379 3141938456EV4321N21197EWRMCI222109217242013-01-01 17:00:00
20131123532359 -6 425 445-20B6 739N591JBJFKPSE195161723592013-01-01 23:00:00
20131123532359 -6 418 442-24B6 707N794JBJFKSJU185159823592013-01-01 23:00:00
20131123562359 -3 425 437-12B6 727N588JBJFKBQN186157623592013-01-01 23:00:00

Exercises:

  1. Use mutate(), is.na(), and count() together to describe how the missing values in dep_time, sched_dep_time and dep_delay are connected.

12.3 Boolean Algebra

!, &, xor, |

df <- tibble(x = c(TRUE, FALSE, NA))

df |> 
  mutate(
    and = x & NA,
    or = x | NA
  )
A tibble: 3 x 3
xandor
<lgl><lgl><lgl>
TRUE NATRUE
FALSEFALSE NA
NA NA NA
flights |> 
  filter(month %in% c(11, 12))
A tibble: 55403 x 19
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
<int><int><int><int><int><dbl><int><int><dbl><chr><int><chr><chr><chr><dbl><dbl><dbl><dbl><dttm>
2013111 52359 6352 345 7B6 745N568JBJFKPSE205161723592013-11-01 23:00:00
2013111 3522501051232356 87B61816N353JBJFKSYR 36 20922502013-11-01 22:00:00
2013111455 500 -5641 651-10US1895N192UWEWRCLT 88 529 5 02013-11-01 05:00:00
2013111539 545 -6856 827 29UA1714N38727LGAIAH2291416 5452013-11-01 05:00:00
2013111542 545 -3831 855-24AA2243N5CLAAJFKMIA1471089 5452013-11-01 05:00:00
2013111549 600-11912 923-11UA 303N595UAJFKSFO3592586 6 02013-11-01 06:00:00
2013111550 600-10705 659 6US2167N748UWLGADCA 57 214 6 02013-11-01 06:00:00
2013111554 600 -6659 701 -2US2134N742PSLGABOS 40 184 6 02013-11-01 06:00:00
2013111554 600 -6826 827 -1DL 563N912DELGAATL126 762 6 02013-11-01 06:00:00
2013111554 600 -6749 751 -2DL 731N315NBLGADTW 93 502 6 02013-11-01 06:00:00
2013111555 600 -5847 854 -7B6 605N640JBEWRFLL1491065 6 02013-11-01 06:00:00
2013111555 600 -5839 846 -7B6 583N661JBJFKMCO136 944 6 02013-11-01 06:00:00
2013111555 600 -5929 943-14B61403N746JBJFKSJU1961598 6 02013-11-01 06:00:00
2013111556 600 -4834 851-17UA 407N834UAEWRTPA142 997 6 02013-11-01 06:00:00
2013111558 600 -2727 730 -3UA 279N459UAEWRORD118 719 6 02013-11-01 06:00:00
2013111558 600 -2650 658 -8US1909N950UWLGAPHL 38 96 6 02013-11-01 06:00:00
2013111558 600 -2914 905 9AA1175N3GHAALGAMIA1551096 6 02013-11-01 06:00:00
2013111558 600 -2720 715 5WN 464N390SWEWRMDW117 711 6 02013-11-01 06:00:00
2013111559 600 -1756 730 26AA 301N4YHAALGAORD119 733 6 02013-11-01 06:00:00
2013111600 600 0709 716 -7EV5716N820ASJFKIAD 49 228 6 02013-11-01 06:00:00
2013111600 600 0725 721 4UA1198N23707LGAORD121 733 6 02013-11-01 06:00:00
2013111601 600 1853 856 -3B6 371N570JBLGAFLL1501076 6 02013-11-01 06:00:00
2013111601 610 -9803 813-10DL1919N922DLLGAMSP1531020 6102013-11-01 06:00:00
2013111602 600 2843 815 28FL 345N353ATLGAATL131 762 6 02013-11-01 06:00:00
2013111603 600 3717 711 6EV4533N11109EWRBUF 53 282 6 02013-11-01 06:00:00
2013111604 610 -6855 855 0AA1103N3FTAALGADFW2051389 6102013-11-01 06:00:00
2013111606 615 -9746 750 -4MQ3525N834MQLGARDU 74 431 6152013-11-01 06:00:00
2013111606 615 -9807 817-10US1899N199UWJFKCLT 93 541 6152013-11-01 06:00:00
2013111606 610 -4752 745 7WN2609N440LVLGASTL142 888 6102013-11-01 06:00:00
2013111607 611 -4857 912-15B6 601N597JBJFKFLL1491069 6112013-11-01 06:00:00
.........................................................
2013123121552039 76 2532355 NAB61205N627JBJFKPDX NA245420392013-12-31 20:00:00
2013123121552150 5 110 51 19B61901N729JBJFKFLL164106921502013-12-31 21:00:00
2013123121592155 4 55 46 9B62053N593JBJFKPBI155102821552013-12-31 21:00:00
2013123122062110 56 442339 65B6 775N184JBJFKMSY195118221102013-12-31 21:00:00
2013123122112159 12 100 45 15B61183N715JBJFKMCO148 94421592013-12-31 21:00:00
2013123122182219 -1 315 304 11B61203N625JBJFKSJU202159822192013-12-31 22:00:00
2013123122352245-1023512355 -4B6 234N355JBJFKBTV 49 26622452013-12-31 22:00:00
2013123122452250 -523592356 3B61816N318JBJFKSYR 51 20922502013-12-31 22:00:00
2013123123102255 15 72356 11B6 718N279JBJFKBOS 40 18722552013-12-31 22:00:00
2013123123212250 31 46 8 38B62002N179JBJFKBUF 66 30122502013-12-31 22:00:00
2013123123282330 -2 412 409 3B61389N651JBEWRSJU198160823302013-12-31 23:00:00
2013123123322245 47 58 3 55B6 486N334JBJFKROC 60 26422452013-12-31 22:00:00
2013123123552359 -4 430 440-10B61503N509JBJFKSJU195159823592013-12-31 23:00:00
2013123123562359 -3 436 445 -9B6 745N665JBJFKPSE200161723592013-12-31 23:00:00
20131231 NA1520 NA NA1705 NAAA 341N568AALGAORD NA 73315202013-12-31 15:00:00
20131231 NA2025 NA NA2205 NAAA 371N482AALGAORD NA 73320252013-12-31 20:00:00
20131231 NA1932 NA NA2305 NAB6 161N516JBJFKSMF NA252119322013-12-31 19:00:00
20131231 NA1505 NA NA1725 NAEV4181N24103EWRMCI NA109215 52013-12-31 15:00:00
20131231 NA1000 NA NA1252 NAUA1124NA EWREGE NA172510 02013-12-31 10:00:00
20131231 NA 840 NA NA1205 NAUA1151NA EWRSEA NA2402 8402013-12-31 08:00:00
20131231 NA 754 NA NA1118 NAUA1455NA EWRLAX NA2454 7542013-12-31 07:00:00
20131231 NA2000 NA NA2146 NAUA1482NA EWRORD NA 71920 02013-12-31 20:00:00
20131231 NA1500 NA NA1817 NAUA1483NA EWRAUS NA150415 02013-12-31 15:00:00
20131231 NA1430 NA NA1750 NAUA1493NA EWRLAX NA245414302013-12-31 14:00:00
20131231 NA 855 NA NA1142 NAUA1506NA EWRJAC NA1874 8552013-12-31 08:00:00
20131231 NA 705 NA NA 931 NAUA1729NA EWRDEN NA1605 7 52013-12-31 07:00:00
20131231 NA 825 NA NA1029 NAUS1831NA JFKCLT NA 541 8252013-12-31 08:00:00
20131231 NA1615 NA NA1800 NAMQ3301N844MQLGARDU NA 43116152013-12-31 16:00:00
20131231 NA 600 NA NA 735 NAUA 219NA EWRORD NA 719 6 02013-12-31 06:00:00
20131231 NA 830 NA NA1154 NAUA 443NA JFKLAX NA2475 8302013-12-31 08:00:00

Exercises

  1. Find all flights where arr_delay is missing but dep_delay is not. Find all flights where neither arr_time nor sched_arr_time are missing, but arr_delay is.
  2. How many flights have a missing dep_time? What other variables are missing in these rows? What might these rows represent?
  3. Assuming that a missing dep_time implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and the average delay of non-cancelled flights?

12.4 Summarizes

Logical summarizes

Nummeric summaries of logical vectors

Logical subsetting

Exercises

12.5 Conditional transformations

if_else()

case_when()

Compatible types

Exercises

12.6 Summary

C13. Numbers

13.2 Making numbers

13.3 Counts

13.4 Numeric transformations

13.5 General transformations

13.6 Numeric summaries

13.7 Summary

C14. Strings

14.2 Creating a string

14.3 Creating many strings from data

14.4 Extracting data from strings

14.5 Letters

14.6 Non-English text

14.7 Summary

C15 Regular Expressions

library(tidyverse)
library(babynames)

15.2 Pattern basics

15.3 Key functions

15.4 Pattern details

15.5 Pattern control

15.6 Practice

15.7 Regular expressions in other places

15.8 Summary

C16. Factors

16.2 Factor basics

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
x1 <- c("Dec", "Apr", "Jan", "Mar")

y1 <- factor(x1, levels = month_levels)
y1

sort(y1)

<ol class=list-inline><li>Dec</li><li>Apr</li><li>Jan</li><li>Mar</li></ol>

<summary style=display:list-item;cursor:pointer> Levels: </summary> <ol class=list-inline>
  • 'Jan'
  • 'Feb'
  • 'Mar'
  • 'Apr'
  • 'May'
  • 'Jun'
  • 'Jul'
  • 'Aug'
  • 'Sep'
  • 'Oct'
  • 'Nov'
  • 'Dec'
  • </ol>

    <ol class=list-inline><li>Jan</li><li>Mar</li><li>Apr</li><li>Dec</li></ol>

    <summary style=display:list-item;cursor:pointer> Levels: </summary> <ol class=list-inline>
  • 'Jan'
  • 'Feb'
  • 'Mar'
  • 'Apr'
  • 'May'
  • 'Jun'
  • 'Jul'
  • 'Aug'
  • 'Sep'
  • 'Oct'
  • 'Nov'
  • 'Dec'
  • </ol>
    x2 <- c("Dec", "Apr", "Jam", "Mar")
    y2 <- factor(x2, levels = month_levels)
    y2
    
    sort(y2)
    

    <ol class=list-inline><li>Dec</li><li>Apr</li><li><NA></li><li>Mar</li></ol>

    <summary style=display:list-item;cursor:pointer> Levels: </summary> <ol class=list-inline>
  • 'Jan'
  • 'Feb'
  • 'Mar'
  • 'Apr'
  • 'May'
  • 'Jun'
  • 'Jul'
  • 'Aug'
  • 'Sep'
  • 'Oct'
  • 'Nov'
  • 'Dec'
  • </ol>

    <ol class=list-inline><li>Mar</li><li>Apr</li><li>Dec</li></ol>

    <summary style=display:list-item;cursor:pointer> Levels: </summary> <ol class=list-inline>
  • 'Jan'
  • 'Feb'
  • 'Mar'
  • 'Apr'
  • 'May'
  • 'Jun'
  • 'Jul'
  • 'Aug'
  • 'Sep'
  • 'Oct'
  • 'Nov'
  • 'Dec'
  • </ol>
    levels(y2)
    

    <ol class=list-inline><li>‘Jan’</li><li>‘Feb’</li><li>‘Mar’</li><li>‘Apr’</li><li>‘May’</li><li>‘Jun’</li><li>‘Jul’</li><li>‘Aug’</li><li>‘Sep’</li><li>‘Oct’</li><li>‘Nov’</li><li>‘Dec’</li></ol>

    library(readr)
    csv <- "
    month,value
    Jan,12
    Feb,56
    Mar,12"
    
    df <- read_csv(csv, col_types = cols(month = col_factor(month_levels)))
    df$month
    

    <ol class=list-inline><li>Jan</li><li>Feb</li><li>Mar</li></ol>

    <summary style=display:list-item;cursor:pointer> Levels: </summary> <ol class=list-inline>
  • 'Jan'
  • 'Feb'
  • 'Mar'
  • 'Apr'
  • 'May'
  • 'Jun'
  • 'Jul'
  • 'Aug'
  • 'Sep'
  • 'Oct'
  • 'Nov'
  • 'Dec'
  • </ol>

    16.3 General Social Survey

    library(tidyverse)
    library(gridExtra)
    gss_cat
    
    A tibble: 21483 x 9
    yearmaritalageracerincomepartyidreligdenomtvhours
    <int><fct><int><fct><fct><fct><fct><fct><int>
    2000Never married26White$8000 to 9999 Ind,near rep Protestant Southern baptist 12
    2000Divorced 48White$8000 to 9999 Not str republicanProtestant Baptist-dk which NA
    2000Widowed 67WhiteNot applicableIndependent Protestant No denomination 2
    2000Never married39WhiteNot applicableInd,near rep Orthodox-christianNot applicable 4
    2000Divorced 25WhiteNot applicableNot str democrat None Not applicable 1
    2000Married 25White$20000 - 24999Strong democrat Protestant Southern baptist NA
    2000Never married36White$25000 or moreNot str republicanChristian Not applicable 3
    2000Divorced 44White$7000 to 7999 Ind,near dem Protestant Lutheran-mo synodNA
    2000Married 44White$25000 or moreNot str democrat Protestant Other 0
    2000Married 47White$25000 or moreStrong republican Protestant Southern baptist 3
    2000Married 53White$25000 or moreNot str democrat Protestant Other 2
    2000Married 52White$25000 or moreInd,near rep None Not applicable NA
    2000Married 52White$25000 or moreStrong democrat Protestant Southern baptist 1
    2000Married 51White$25000 or moreStrong republican Protestant United methodist NA
    2000Divorced 52White$25000 or moreInd,near dem None Not applicable 1
    2000Married 40Black$25000 or moreStrong democrat Protestant Baptist-dk which 7
    2000Widowed 77WhiteNot applicableStrong republican Jewish Not applicable NA
    2000Never married44White$25000 or moreIndependent None Not applicable 3
    2000Married 40White$10000 - 14999Not str democrat Catholic Not applicable 3
    2000Married 45BlackNot applicableIndependent Protestant United methodist NA
    2000Married 48White$25000 or moreInd,near dem Catholic Not applicable 1
    2000Married 49WhiteRefused Strong republican Protestant United methodist 2
    2000Never married19WhiteNot applicableIndependent None Not applicable 2
    2000Widowed 54White$25000 or moreInd,near rep Christian Not applicable 1
    2000Widowed 82WhiteNot applicableNot str democrat Protestant Other 3
    2000Widowed 83WhiteNot applicableStrong democrat Protestant Episcopal NA
    2000Widowed 89WhiteNot applicableNot str democrat Protestant Other lutheran 4
    2000Widowed 88WhiteNot applicableStrong republican Protestant Afr meth ep zion NA
    2000Divorced 72WhiteNot applicableStrong democrat Protestant Southern baptist 7
    2000Widowed 82WhiteNot applicableIndependent Protestant Am bapt ch in usaNA
    ...........................
    2014Divorced 38White$3000 to 3999 Not str republicanProtestantOther 1
    2014Widowed 46White$25000 or moreStrong democrat None Not applicable 2
    2014Married 49WhiteNot applicableInd,near rep ProtestantOther 6
    2014Never married34White$25000 or moreIndependent ProtestantUnited methodist 2
    2014Married 54WhiteNot applicableIndependent ProtestantOther NA
    2014Married 34White$15000 - 19999Ind,near dem Buddhism Not applicable 1
    2014Married 69WhiteNot applicableInd,near dem Jewish Not applicable 3
    2014Divorced 36WhiteNot applicableIndependent None Not applicable 0
    2014Married 65White$25000 or moreNot str democrat None Not applicable 2
    2014Married 48White$20000 - 24999Strong democrat ProtestantOther 0
    2014Married 38White$10000 - 14999Not str democrat ProtestantNo denomination 2
    2014Never married30White$4000 to 4999 Ind,near dem None Not applicable 2
    2014Married 48White$8000 to 9999 Not str republicanCatholic Not applicable 0
    2014Divorced 49White$25000 or moreInd,near rep Other Not applicable 2
    2014Married 54White$25000 or moreInd,near dem ProtestantOther NA
    2014Married 49White$25000 or moreNot str republicanCatholic Not applicable NA
    2014Married 53White$25000 or moreNot str democrat None Not applicable 0
    2014Married 52White$25000 or moreNot str democrat None Not applicable 1
    2014Widowed 82WhiteNot applicableStrong democrat ProtestantOther 2
    2014Married 63WhiteNot applicableInd,near dem No answer No answer 2
    2014Divorced 54White$25000 or moreInd,near rep Catholic Not applicable 3
    2014Married 62White$25000 or moreInd,near rep ProtestantOther NA
    2014Never married40White$1000 to 2999 Not str republicanNone Not applicable 2
    2014Married 33WhiteNot applicableIndependent Christian No denomination 0
    2014Widowed 75WhiteDon't know Strong republican ProtestantBaptist-dk which 4
    2014Widowed 89WhiteNot applicableNot str republicanProtestantUnited methodist 3
    2014Divorced 56White$25000 or moreIndependent None Not applicable 4
    2014Never married24White$10000 - 14999Ind,near dem None Not applicable 4
    2014Never married27White$25000 or moreNot str democrat Catholic Not applicable NA
    2014Widowed 71White$20000 - 24999Ind,near rep ProtestantOther 2
    gss_cat |>
      count(race)
    
    
    A tibble: 3 x 2
    racen
    <fct><int>
    Other 1959
    Black 3129
    White16395

    Exercises

    1. Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?

    2. What is the most common relig in this survey? What’s the most common partyid?

    3. Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualization?

    16.4 Modifying factor order

    relig_summary <- gss_cat |>
      group_by(relig) |>
      summarize(
        tvhours = mean(tvhours, na.rm = TRUE),
        n = n()
      )
    
    ggplot(relig_summary, aes(x = tvhours, y = relig)) + 
      geom_point()
    

    png

    ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
      geom_point()
    

    png

    relig_summary |>
      mutate(
        relig = fct_reorder(relig, tvhours)
      ) |>
      ggplot(aes(x = tvhours, y = relig)) +
      geom_point()
    

    png

    rincome_summary <- gss_cat |>
      group_by(rincome) |>
      summarize(
        age = mean(age, na.rm = TRUE),
        n = n()
      )
    
    ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) + 
      geom_point()
    

    png

    ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, "Not applicable"))) +
      geom_point()
    

    png

    by_age <- gss_cat |>
      filter(!is.na(age)) |> 
      count(age, marital) |>
      group_by(age) |>
      mutate(
        prop = n / sum(n)
      )
    
    p1 <- ggplot(by_age, aes(x = age, y = prop, color = marital)) +
      geom_line(linewidth = 1) + 
      scale_color_brewer(palette = "Set1")
    
    p2 <- ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +
      geom_line(linewidth = 1) +
      scale_color_brewer(palette = "Set1") + 
      labs(color = "marital") 
    
    options(repr.plot.width = 20, repr.plot.height = 5)
    grid.arrange(p1, p2, ncol=2)
    
    

    png

    gss_cat |>
      mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
      ggplot(aes(x = marital)) +
      geom_bar()
      options(repr.plot.width = 10, repr.plot.height = 5)
    
    

    png

    1. There are some suspiciously high numbers in tvhours. Is the mean a good summary?

    2. For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.

    3. Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?

    16.5 Modifying factor levels

    gss_cat |> count(partyid)
    
    A tibble: 10 x 2
    partyidn
    <fct><int>
    No answer 154
    Don't know 1
    Other party 393
    Strong republican 2314
    Not str republican3032
    Ind,near rep 1791
    Independent 4119
    Ind,near dem 2499
    Not str democrat 3690
    Strong democrat 3490
    gss_cat |>
      mutate(
        partyid = fct_recode(partyid,
          "Republican, strong"    = "Strong republican",
          "Republican, weak"      = "Not str republican",
          "Independent, near rep" = "Ind,near rep",
          "Independent, near dem" = "Ind,near dem",
          "Democrat, weak"        = "Not str democrat",
          "Democrat, strong"      = "Strong democrat"
        )
      ) |>
      count(partyid)
    
    A tibble: 10 x 2
    partyidn
    <fct><int>
    No answer 154
    Don't know 1
    Other party 393
    Republican, strong 2314
    Republican, weak 3032
    Independent, near rep1791
    Independent 4119
    Independent, near dem2499
    Democrat, weak 3690
    Democrat, strong 3490
    gss_cat |>
      mutate(
        partyid = fct_recode(partyid,
          "Republican, strong"    = "Strong republican",
          "Republican, weak"      = "Not str republican",
          "Independent, near rep" = "Ind,near rep",
          "Independent, near dem" = "Ind,near dem",
          "Democrat, weak"        = "Not str democrat",
          "Democrat, strong"      = "Strong democrat",
          "Other"                 = "No answer",
          "Other"                 = "Don't know",
          "Other"                 = "Other party"
        )
      )
    
    A tibble: 21483 x 9
    yearmaritalageracerincomepartyidreligdenomtvhours
    <int><fct><int><fct><fct><fct><fct><fct><int>
    2000Never married26White$8000 to 9999 Independent, near repProtestant Southern baptist 12
    2000Divorced 48White$8000 to 9999 Republican, weak Protestant Baptist-dk which NA
    2000Widowed 67WhiteNot applicableIndependent Protestant No denomination 2
    2000Never married39WhiteNot applicableIndependent, near repOrthodox-christianNot applicable 4
    2000Divorced 25WhiteNot applicableDemocrat, weak None Not applicable 1
    2000Married 25White$20000 - 24999Democrat, strong Protestant Southern baptist NA
    2000Never married36White$25000 or moreRepublican, weak Christian Not applicable 3
    2000Divorced 44White$7000 to 7999 Independent, near demProtestant Lutheran-mo synodNA
    2000Married 44White$25000 or moreDemocrat, weak Protestant Other 0
    2000Married 47White$25000 or moreRepublican, strong Protestant Southern baptist 3
    2000Married 53White$25000 or moreDemocrat, weak Protestant Other 2
    2000Married 52White$25000 or moreIndependent, near repNone Not applicable NA
    2000Married 52White$25000 or moreDemocrat, strong Protestant Southern baptist 1
    2000Married 51White$25000 or moreRepublican, strong Protestant United methodist NA
    2000Divorced 52White$25000 or moreIndependent, near demNone Not applicable 1
    2000Married 40Black$25000 or moreDemocrat, strong Protestant Baptist-dk which 7
    2000Widowed 77WhiteNot applicableRepublican, strong Jewish Not applicable NA
    2000Never married44White$25000 or moreIndependent None Not applicable 3
    2000Married 40White$10000 - 14999Democrat, weak Catholic Not applicable 3
    2000Married 45BlackNot applicableIndependent Protestant United methodist NA
    2000Married 48White$25000 or moreIndependent, near demCatholic Not applicable 1
    2000Married 49WhiteRefused Republican, strong Protestant United methodist 2
    2000Never married19WhiteNot applicableIndependent None Not applicable 2
    2000Widowed 54White$25000 or moreIndependent, near repChristian Not applicable 1
    2000Widowed 82WhiteNot applicableDemocrat, weak Protestant Other 3
    2000Widowed 83WhiteNot applicableDemocrat, strong Protestant Episcopal NA
    2000Widowed 89WhiteNot applicableDemocrat, weak Protestant Other lutheran 4
    2000Widowed 88WhiteNot applicableRepublican, strong Protestant Afr meth ep zion NA
    2000Divorced 72WhiteNot applicableDemocrat, strong Protestant Southern baptist 7
    2000Widowed 82WhiteNot applicableIndependent Protestant Am bapt ch in usaNA
    ...........................
    2014Divorced 38White$3000 to 3999 Republican, weak ProtestantOther 1
    2014Widowed 46White$25000 or moreDemocrat, strong None Not applicable 2
    2014Married 49WhiteNot applicableIndependent, near repProtestantOther 6
    2014Never married34White$25000 or moreIndependent ProtestantUnited methodist 2
    2014Married 54WhiteNot applicableIndependent ProtestantOther NA
    2014Married 34White$15000 - 19999Independent, near demBuddhism Not applicable 1
    2014Married 69WhiteNot applicableIndependent, near demJewish Not applicable 3
    2014Divorced 36WhiteNot applicableIndependent None Not applicable 0
    2014Married 65White$25000 or moreDemocrat, weak None Not applicable 2
    2014Married 48White$20000 - 24999Democrat, strong ProtestantOther 0
    2014Married 38White$10000 - 14999Democrat, weak ProtestantNo denomination 2
    2014Never married30White$4000 to 4999 Independent, near demNone Not applicable 2
    2014Married 48White$8000 to 9999 Republican, weak Catholic Not applicable 0
    2014Divorced 49White$25000 or moreIndependent, near repOther Not applicable 2
    2014Married 54White$25000 or moreIndependent, near demProtestantOther NA
    2014Married 49White$25000 or moreRepublican, weak Catholic Not applicable NA
    2014Married 53White$25000 or moreDemocrat, weak None Not applicable 0
    2014Married 52White$25000 or moreDemocrat, weak None Not applicable 1
    2014Widowed 82WhiteNot applicableDemocrat, strong ProtestantOther 2
    2014Married 63WhiteNot applicableIndependent, near demNo answer No answer 2
    2014Divorced 54White$25000 or moreIndependent, near repCatholic Not applicable 3
    2014Married 62White$25000 or moreIndependent, near repProtestantOther NA
    2014Never married40White$1000 to 2999 Republican, weak None Not applicable 2
    2014Married 33WhiteNot applicableIndependent Christian No denomination 0
    2014Widowed 75WhiteDon't know Republican, strong ProtestantBaptist-dk which 4
    2014Widowed 89WhiteNot applicableRepublican, weak ProtestantUnited methodist 3
    2014Divorced 56White$25000 or moreIndependent None Not applicable 4
    2014Never married24White$10000 - 14999Independent, near demNone Not applicable 4
    2014Never married27White$25000 or moreDemocrat, weak Catholic Not applicable NA
    2014Widowed 71White$20000 - 24999Independent, near repProtestantOther 2
    gss_cat |>
      mutate(
        partyid = fct_collapse(partyid,
          "other" = c("No answer", "Don't know", "Other party"),
          "rep" = c("Strong republican", "Not str republican"),
          "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
          "dem" = c("Not str democrat", "Strong democrat")
        )
      ) |>
      count(partyid)
    
    A tibble: 4 x 2
    partyidn
    <fct><int>
    other 548
    rep 5346
    ind 8409
    dem 7180
    gss_cat |>
      mutate(relig = fct_lump_lowfreq(relig)) |>
      count(relig)
    
    A tibble: 2 x 2
    relign
    <fct><int>
    Protestant10846
    Other 10637
    gss_cat |>
      mutate(relig = fct_lump_n(relig, n = 10)) |>
      count(relig, sort = TRUE)
    
    A tibble: 10 x 2
    relign
    <fct><int>
    Protestant 10846
    Catholic 5124
    None 3523
    Christian 689
    Other 458
    Jewish 388
    Buddhism 147
    Inter-nondenominational 109
    Moslem/islam 104
    Orthodox-christian 95

    Exercises

    1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?

    2. How could you collapse rincome into a small set of categories?

    3. Notice there are 9 groups (excluding other) in the fct_lump example above. Why not 10? (Hint: type ?fct_lump, and find the default for the argument other_level is “Other”.)

    16.6 Ordered factors

    ordered(c("a", "b", "c"))
    

    <ol class=list-inline><li>a</li><li>b</li><li>c</li></ol>

    <summary style=display:list-item;cursor:pointer> Levels: </summary> <ol class=list-inline>
  • 'a'
  • 'b'
  • 'c'
  • </ol>

    16.7 Summary

    C17. Dates and Times

    library(tidyverse)
    library(nycflights13)
    

    17.2 Creating date/times

    Year:

    • %Y 4 digit year
    • %y 2 digit year

    Month:

    • %m Number
    • %b Abbreviated name
    • %B Full name

    Day:

    • %d One or two digits
    • %e Two digits

    Time:

    • %H 24-hour hour
    • %I 12-hour hour
    • %p AM/PM
    • %M Minutes
    • %S Seconds
    • %OS Seconds with decimal component
    • %Z Time zone name
    • %z Offset from UTC

    Others:

    • %. Skip one non-digit
    • %* Skip any number of non-digits
    today()
    now()
    
    [1] "2024-03-12 11:34:12 AEST"
    

    During import

    csv <- "
      date
      01/02/15
    "
    
    read_csv(csv, col_types = cols(date = col_date("%m/%d/%y")))
    read_csv(csv, col_types = cols(date = col_date("%d/%m/%y")))
    read_csv(csv, col_types = cols(date = col_date("%y/%m/%d")))
    
    A spec_tbl_df: 1 x 1
    date
    <date>
    2015-01-02
    A spec_tbl_df: 1 x 1
    date
    <date>
    2015-02-01
    A spec_tbl_df: 1 x 1
    date
    <date>
    2001-02-15

    From strings

    ymd("2017-01-31")
    mdy("January 31st, 2017")
    dmy("31-Jan-2017")
    
    ymd("2017-01-31", tz = "UTC")
    
    
    [1] "2017-01-31 UTC"
    

    From individual components

    • make_date()
    • make_datetime()
    flights |> 
      select(year, month, day, hour, minute) |> 
      mutate(departure = make_datetime(year, month, day, hour, minute))
    
    A tibble: 336776 x 6
    yearmonthdayhourminutedeparture
    <int><int><int><dbl><dbl><dttm>
    2013115152013-01-01 05:15:00
    2013115292013-01-01 05:29:00
    2013115402013-01-01 05:40:00
    2013115452013-01-01 05:45:00
    2013116 02013-01-01 06:00:00
    2013115582013-01-01 05:58:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013115592013-01-01 05:59:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116102013-01-01 06:10:00
    2013116 52013-01-01 06:05:00
    2013116102013-01-01 06:10:00
    2013116102013-01-01 06:10:00
    2013116 72013-01-01 06:07:00
    2013116 02013-01-01 06:00:00
    2013116 02013-01-01 06:00:00
    2013116102013-01-01 06:10:00
    2013116152013-01-01 06:15:00
    2013116152013-01-01 06:15:00
    ..................
    201393021252013-09-30 21:25:00
    201393021292013-09-30 21:29:00
    201393021302013-09-30 21:30:00
    201393020592013-09-30 20:59:00
    201393021402013-09-30 21:40:00
    201393021402013-09-30 21:40:00
    201393021292013-09-30 21:29:00
    201393021452013-09-30 21:45:00
    201393021372013-09-30 21:37:00
    201393021562013-09-30 21:56:00
    201393021592013-09-30 21:59:00
    201393018452013-09-30 18:45:00
    201393022 52013-09-30 22:05:00
    201393021402013-09-30 21:40:00
    201393020592013-09-30 20:59:00
    201393022452013-09-30 22:45:00
    201393021132013-09-30 21:13:00
    201393020 12013-09-30 20:01:00
    201393022452013-09-30 22:45:00
    201393022452013-09-30 22:45:00
    201393022502013-09-30 22:50:00
    201393022462013-09-30 22:46:00
    201393022552013-09-30 22:55:00
    201393023592013-09-30 23:59:00
    201393018422013-09-30 18:42:00
    201393014552013-09-30 14:55:00
    201393022 02013-09-30 22:00:00
    201393012102013-09-30 12:10:00
    201393011592013-09-30 11:59:00
    2013930 8402013-09-30 08:40:00
    make_datetime_100 <- function(year, month, day, time) {
      make_datetime(year, month, day, time %/% 100, time %% 100)
    }
    
    flights_dt <- flights |> 
      filter(!is.na(dep_time), !is.na(arr_time)) |> 
      mutate(
        dep_time = make_datetime_100(year, month, day, dep_time),
        arr_time = make_datetime_100(year, month, day, arr_time),
        sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
        sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
      ) |> 
      select(origin, dest, ends_with("delay"), ends_with("time"))
    
    flights_dt
    
    A tibble: 328063 x 9
    origindestdep_delayarr_delaydep_timesched_dep_timearr_timesched_arr_timeair_time
    <chr><chr><dbl><dbl><dttm><dttm><dttm><dttm><dbl>
    EWRIAH 2 112013-01-01 05:17:002013-01-01 05:15:002013-01-01 08:30:002013-01-01 08:19:00227
    LGAIAH 4 202013-01-01 05:33:002013-01-01 05:29:002013-01-01 08:50:002013-01-01 08:30:00227
    JFKMIA 2 332013-01-01 05:42:002013-01-01 05:40:002013-01-01 09:23:002013-01-01 08:50:00160
    JFKBQN-1-182013-01-01 05:44:002013-01-01 05:45:002013-01-01 10:04:002013-01-01 10:22:00183
    LGAATL-6-252013-01-01 05:54:002013-01-01 06:00:002013-01-01 08:12:002013-01-01 08:37:00116
    EWRORD-4 122013-01-01 05:54:002013-01-01 05:58:002013-01-01 07:40:002013-01-01 07:28:00150
    EWRFLL-5 192013-01-01 05:55:002013-01-01 06:00:002013-01-01 09:13:002013-01-01 08:54:00158
    LGAIAD-3-142013-01-01 05:57:002013-01-01 06:00:002013-01-01 07:09:002013-01-01 07:23:00 53
    JFKMCO-3 -82013-01-01 05:57:002013-01-01 06:00:002013-01-01 08:38:002013-01-01 08:46:00140
    LGAORD-2 82013-01-01 05:58:002013-01-01 06:00:002013-01-01 07:53:002013-01-01 07:45:00138
    JFKPBI-2 -22013-01-01 05:58:002013-01-01 06:00:002013-01-01 08:49:002013-01-01 08:51:00149
    JFKTPA-2 -32013-01-01 05:58:002013-01-01 06:00:002013-01-01 08:53:002013-01-01 08:56:00158
    JFKLAX-2 72013-01-01 05:58:002013-01-01 06:00:002013-01-01 09:24:002013-01-01 09:17:00345
    EWRSFO-2-142013-01-01 05:58:002013-01-01 06:00:002013-01-01 09:23:002013-01-01 09:37:00361
    LGADFW-1 312013-01-01 05:59:002013-01-01 06:00:002013-01-01 09:41:002013-01-01 09:10:00257
    JFKBOS 0 -42013-01-01 05:59:002013-01-01 05:59:002013-01-01 07:02:002013-01-01 07:06:00 44
    EWRLAS-1 -82013-01-01 05:59:002013-01-01 06:00:002013-01-01 08:54:002013-01-01 09:02:00337
    LGAFLL 0 -72013-01-01 06:00:002013-01-01 06:00:002013-01-01 08:51:002013-01-01 08:58:00152
    LGAATL 0 122013-01-01 06:00:002013-01-01 06:00:002013-01-01 08:37:002013-01-01 08:25:00134
    EWRPBI 1 -62013-01-01 06:01:002013-01-01 06:00:002013-01-01 08:44:002013-01-01 08:50:00147
    LGAMSP-8 -82013-01-01 06:02:002013-01-01 06:10:002013-01-01 08:12:002013-01-01 08:20:00170
    LGADTW-3 162013-01-01 06:02:002013-01-01 06:05:002013-01-01 08:21:002013-01-01 08:05:00105
    EWRMIA-4-122013-01-01 06:06:002013-01-01 06:10:002013-01-01 08:58:002013-01-01 09:10:00152
    JFKATL-4 -82013-01-01 06:06:002013-01-01 06:10:002013-01-01 08:37:002013-01-01 08:45:00128
    EWRMIA 0-172013-01-01 06:07:002013-01-01 06:07:002013-01-01 08:58:002013-01-01 09:15:00157
    EWRORD 8 322013-01-01 06:08:002013-01-01 06:00:002013-01-01 08:07:002013-01-01 07:35:00139
    JFKSFO11 142013-01-01 06:11:002013-01-01 06:00:002013-01-01 09:45:002013-01-01 09:31:00366
    JFKRSW 3 42013-01-01 06:13:002013-01-01 06:10:002013-01-01 09:25:002013-01-01 09:21:00175
    JFKSJU 0-212013-01-01 06:15:002013-01-01 06:15:002013-01-01 10:39:002013-01-01 11:00:00182
    EWRATL 0 -92013-01-01 06:15:002013-01-01 06:15:002013-01-01 08:33:002013-01-01 08:42:00120
    ...........................
    LGADTW 5 -92013-09-30 21:15:002013-09-30 21:10:002013-09-30 22:46:002013-09-30 22:55:00 72
    EWRSDF -8-202013-09-30 21:16:002013-09-30 21:24:002013-09-30 23:17:002013-09-30 23:37:00 94
    EWRMCI 74 582013-09-30 21:19:002013-09-30 20:05:002013-09-30 23:10:002013-09-30 22:12:00147
    JFKJAX -1-242013-09-30 21:19:002013-09-30 21:20:002013-09-30 23:30:002013-09-30 23:54:00113
    JFKLAX 21-252013-09-30 21:21:002013-09-30 21:00:002013-09-30 23:49:002013-09-30 00:14:00296
    EWRDCA -5-112013-09-30 21:22:002013-09-30 21:27:002013-09-30 22:26:002013-09-30 22:37:00 35
    LGACHO -2-242013-09-30 21:23:002013-09-30 21:25:002013-09-30 22:23:002013-09-30 22:47:00 45
    EWRCLT -2 -92013-09-30 21:27:002013-09-30 21:29:002013-09-30 23:14:002013-09-30 23:23:00 72
    JFKDEN -2-312013-09-30 21:28:002013-09-30 21:30:002013-09-30 23:28:002013-09-30 23:59:00213
    LGARIC 30 -22013-09-30 21:29:002013-09-30 20:59:002013-09-30 22:30:002013-09-30 22:32:00 45
    JFKDCA -9-302013-09-30 21:31:002013-09-30 21:40:002013-09-30 22:25:002013-09-30 22:55:00 36
    JFKLAX 0-302013-09-30 21:40:002013-09-30 21:40:002013-09-30 00:10:002013-09-30 00:40:00298
    EWRPWM 13 112013-09-30 21:42:002013-09-30 21:29:002013-09-30 22:50:002013-09-30 22:39:00 47
    JFKSJU 0-252013-09-30 21:45:002013-09-30 21:45:002013-09-30 01:15:002013-09-30 01:40:00192
    LGAFLL 10 32013-09-30 21:47:002013-09-30 21:37:002013-09-30 00:30:002013-09-30 00:27:00139
    EWRBOS -7-232013-09-30 21:49:002013-09-30 21:56:002013-09-30 22:45:002013-09-30 23:08:00 37
    EWRMHT -9-162013-09-30 21:50:002013-09-30 21:59:002013-09-30 22:50:002013-09-30 23:06:00 39
    JFKBUF1941942013-09-30 21:59:002013-09-30 18:45:002013-09-30 23:44:002013-09-30 20:30:00 50
    LGABGR -2 82013-09-30 22:03:002013-09-30 22:05:002013-09-30 23:39:002013-09-30 23:31:00 61
    LGABNA 27 72013-09-30 22:07:002013-09-30 21:40:002013-09-30 22:57:002013-09-30 22:50:00 97
    EWRSTL 72 572013-09-30 22:11:002013-09-30 20:59:002013-09-30 23:39:002013-09-30 22:42:00120
    JFKPWM-14-212013-09-30 22:31:002013-09-30 22:45:002013-09-30 23:35:002013-09-30 23:56:00 48
    EWRSFO 80 422013-09-30 22:33:002013-09-30 21:13:002013-09-30 01:12:002013-09-30 00:30:00318
    JFKMCO1541302013-09-30 22:35:002013-09-30 20:01:002013-09-30 00:59:002013-09-30 22:49:00123
    JFKBTV -8 -82013-09-30 22:37:002013-09-30 22:45:002013-09-30 23:45:002013-09-30 23:53:00 43
    JFKSYR -5-172013-09-30 22:40:002013-09-30 22:45:002013-09-30 23:34:002013-09-30 23:51:00 41
    JFKBUF-10-202013-09-30 22:40:002013-09-30 22:50:002013-09-30 23:47:002013-09-30 00:07:00 52
    JFKROC -5-162013-09-30 22:41:002013-09-30 22:46:002013-09-30 23:45:002013-09-30 00:01:00 47
    JFKBOS 12 12013-09-30 23:07:002013-09-30 22:55:002013-09-30 23:59:002013-09-30 23:58:00 33
    JFKPSE-10-252013-09-30 23:49:002013-09-30 23:59:002013-09-30 03:25:002013-09-30 03:50:00196
    flights_dt |> 
      ggplot(aes(x = dep_time)) + 
      geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
    

    png

    # or within a single day
    flights_dt |> 
      filter(dep_time < ymd(20130102)) |> 
      ggplot(aes(x = dep_time)) + 
      geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
    

    png

    From other types

    • as_date()
    • as_datetime()
    as_datetime(today())
    
    as_date(now())
    
    [1] "2024-03-12 UTC"
    
    as_datetime(86300)
    
    as_date(365 * 10 + 2)
    
    [1] "1970-01-01 23:58:20 UTC"
    

    Exercises

    today(tzone = "UTC")
    
    1. For each of the following date-times, show how you’d parse it using a readr column specification and a lubridate function.
    d1 <- "January 1, 2010"
    d2 <- "2015-Mar-07"
    d3 <- "06-Jun-2017"
    d4 <- c("August 19 (2015)", "July 1 (2015)")
    d5 <- "12/30/14" # Dec 30, 2014
    t1 <- "1705"
    t2 <- "11:15:10.12 PM"
    
    
    mdy(d1)
    as_date(d2)
    dmy(d3)
    mdy(d4)
    mdy(d5)
    
    as_datetime(as.numeric(t1))
    hms(t2)
    
    
    

    <ol class=list-inline><li></li><li></li></ol>

    [1] "1970-01-01 00:28:25 UTC"
    

    11H 15M 10.12S

    17.3 Date-time components

    datetime <- ymd_hms(now())      # "2026-07-08 12:34:56"
    
    year(datetime)
    #> 2026
    month(datetime)
    #>  7
    mday(datetime)
    #>  8
    yday(datetime)
    #> 189
    wday(datetime)
    #> 4
    

    2024

    3

    12

    72

    3

    flights_dt |> 
      mutate(wday = wday(dep_time, label = TRUE)) |> 
      ggplot(aes(x = wday)) +
      geom_bar()
    

    png

    flights_dt |> 
      mutate(minute = minute(dep_time)) |> 
      group_by(minute) |> 
      summarize(
        avg_delay = mean(dep_delay, na.rm = TRUE),
        n = n()
      ) |> 
      ggplot(aes(x = minute, y = avg_delay)) +
      geom_line()
    

    png

    sched_dep <- flights_dt |> 
      mutate(minute = minute(sched_dep_time)) |> 
      group_by(minute) |> 
      summarize(
        avg_delay = mean(arr_delay, na.rm = TRUE),
        n = n()
      )
    
    ggplot(sched_dep, aes(x = minute, y = avg_delay)) +
      geom_line()
    

    png

    flights_dt |> 
      mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |> 
      ggplot(aes(x = dep_hour)) +
      geom_freqpoly(binwidth = 60 * 30)
    

    png

    Modify components

    datetime <- ymd_hms("2026-07-08 12:34:56")
    datetime
    year(datetime) <- 2030
    datetime
    month(datetime) <- 01
    datetime
    hour(datetime) <- hour(datetime) + 1
    datetime
    
    update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
    datetime
    
    [1] "2026-07-08 12:34:56 UTC"
    
    
    
    [1] "2030-07-08 12:34:56 UTC"
    
    
    
    [1] "2030-01-08 12:34:56 UTC"
    
    
    
    [1] "2030-01-08 13:34:56 UTC"
    
    
    
    [1] "2030-02-02 02:34:56 UTC"
    
    
    
    [1] "2030-01-08 13:34:56 UTC"
    

    Exercises

    1. How does the distribution of flight times within a day change over the course of the year?
    2. Compare dep_time, sched_dep_time and dep_delay. Are they consistent? Explain your findings.
    3. Compare air_time with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)
    4. How does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?
    5. On what day of the week should you leave if you want to minimise the chance of a delay?
    6. What makes the distribution of diamonds$carat and flights$sched_dep_time similar?
    7. Confirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.

    17.4 Time spans

    • Durations, which represent an exact number of seconds.
    • Periods, which represent human units like weeks and months.
    • Intervals, which represent a starting and ending point.

    Duration

    # How old is Hadley?
    h_age <- today() - ymd("1979-05-16")
    h_age
    
    Time difference of 16372 days
    
    as.duration(h_age)
    

    1414540800s (~44.82 years)

    dseconds(15)
    dminutes(10)
    dhours(c(12, 24))
    ddays(0:5)
    dweeks(3)
    dyears(1)
    

    15s

    600s (~10 minutes)

    <ol class=list-inline><li>43200s (~12 hours)</li><li>86400s (~1 days)</li></ol>

    <ol class=list-inline><li>0s</li><li>86400s (~1 days)</li><li>172800s (~2 days)</li><li>259200s (~3 days)</li><li>345600s (~4 days)</li><li>432000s (~5 days)</li></ol>

    1814400s (~3 weeks)

    31557600s (~1 years)

    # you can add and multiply durations:
    2 * dyears(1)
    dyears(1) + dweeks(12) + dhours(15)
    ## add and subtract durations to and from days
    
    tomorrow <- today() + ddays(1)
    last_year <- today() - dyears(1)
    

    63115200s (~2 years)

    38869200s (~1.23 years)

    one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
    
    one_am
    # be careful with time zone (on that day timezone has changed)
    one_am + ddays(1)
    
    [1] "2026-03-08 01:00:00 EST"
    
    
    
    [1] "2026-03-09 02:00:00 EDT"
    

    Periods

    one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
    hours(c(12, 24))
    
    days(7)
    months(1:6)
    

    <ol class=list-inline><li>12H 0M 0S</li><li>24H 0M 0S</li></ol>

    7d 0H 0M 0S

    <ol class=list-inline><li>1m 0d 0H 0M 0S</li><li>2m 0d 0H 0M 0S</li><li>3m 0d 0H 0M 0S</li><li>4m 0d 0H 0M 0S</li><li>5m 0d 0H 0M 0S</li><li>6m 0d 0H 0M 0S</li></ol>

    10 * (months(6) + days(1))
    days(50) + hours(25) + minutes(2)
    

    60m 10d 0H 0M 0S

    50d 25H 2M 0S

    # A leap year
    ymd("2024-01-01") + dyears(1)
    ymd("2024-01-01") + years(1)
    
    
    # Daylight saving time
    one_am + ddays(1)
    one_am + days(1)
    
    
    [1] "2024-12-31 06:00:00 UTC"
    
    [1] "2026-03-09 02:00:00 EDT"
    
    
    
    [1] "2026-03-09 01:00:00 EDT"
    
    flights_dt |> 
      filter(arr_time < dep_time) 
    
    A tibble: 10633 x 9
    origindestdep_delayarr_delaydep_timesched_dep_timearr_timesched_arr_timeair_time
    <chr><chr><dbl><dbl><dttm><dttm><dttm><dttm><dbl>
    EWRBQN 9 -42013-01-01 19:29:002013-01-01 19:20:002013-01-01 00:03:002013-01-01 00:07:00192
    JFKDFW 59 NA2013-01-01 19:39:002013-01-01 18:40:002013-01-01 00:29:002013-01-01 21:51:00 NA
    EWRTPA -2 92013-01-01 20:58:002013-01-01 21:00:002013-01-01 00:08:002013-01-01 23:59:00159
    EWRSJU -6-122013-01-01 21:02:002013-01-01 21:08:002013-01-01 01:46:002013-01-01 01:58:00199
    EWRSFO 11-142013-01-01 21:08:002013-01-01 20:57:002013-01-01 00:25:002013-01-01 00:39:00354
    LGAFLL-10 -22013-01-01 21:20:002013-01-01 21:30:002013-01-01 00:16:002013-01-01 00:18:00160
    EWRMCO 41 432013-01-01 21:21:002013-01-01 20:40:002013-01-01 00:06:002013-01-01 23:23:00143
    JFKLAX -7-242013-01-01 21:28:002013-01-01 21:35:002013-01-01 00:26:002013-01-01 00:50:00338
    EWRFLL 49 282013-01-01 21:34:002013-01-01 20:45:002013-01-01 00:20:002013-01-01 23:52:00152
    EWRFLL -9-142013-01-01 21:36:002013-01-01 21:45:002013-01-01 00:25:002013-01-01 00:39:00154
    JFKSJU 5-142013-01-01 21:40:002013-01-01 21:35:002013-01-01 02:10:002013-01-01 02:24:00189
    JFKMCO 2 22013-01-01 21:57:002013-01-01 21:55:002013-01-01 00:43:002013-01-01 00:41:00140
    EWRMIA2852462013-01-01 22:05:002013-01-01 17:20:002013-01-01 00:46:002013-01-01 20:40:00146
    JFKPBI 24 212013-01-01 22:09:002013-01-01 21:45:002013-01-01 00:58:002013-01-01 00:37:00143
    JFKSJU-12-262013-01-01 22:17:002013-01-01 22:29:002013-01-01 02:49:002013-01-01 03:15:00191
    JFKTPA 47 732013-01-01 22:17:002013-01-01 21:30:002013-01-01 01:40:002013-01-01 00:27:00163
    JFKFLL 30 492013-01-01 22:29:002013-01-01 21:59:002013-01-01 01:49:002013-01-01 01:00:00153
    JFKROC 21 232013-01-01 23:06:002013-01-01 22:45:002013-01-01 00:28:002013-01-01 00:05:00 59
    JFKBTV 22 352013-01-01 23:07:002013-01-01 22:45:002013-01-01 00:32:002013-01-01 23:57:00 59
    JFKBUF 15 92013-01-01 23:10:002013-01-01 22:55:002013-01-01 00:24:002013-01-01 00:15:00 57
    EWRDCA1921912013-01-01 23:12:002013-01-01 20:00:002013-01-01 00:21:002013-01-01 21:10:00 44
    EWRBTV 83 692013-01-01 23:23:002013-01-01 22:00:002013-01-01 00:22:002013-01-01 23:13:00 44
    JFKLAS116 732013-01-01 23:26:002013-01-01 21:30:002013-01-01 01:31:002013-01-01 00:18:00290
    JFKSYR 37 332013-01-01 23:27:002013-01-01 22:50:002013-01-01 00:32:002013-01-01 23:59:00 45
    EWRMCI3794562013-01-01 23:43:002013-01-01 17:24:002013-01-01 03:14:002013-01-01 19:38:00222
    JFKPSE -6-202013-01-01 23:53:002013-01-01 23:59:002013-01-01 04:25:002013-01-01 04:45:00195
    JFKSJU -6-242013-01-01 23:53:002013-01-01 23:59:002013-01-01 04:18:002013-01-01 04:42:00185
    JFKBQN -3-122013-01-01 23:56:002013-01-01 23:59:002013-01-01 04:25:002013-01-01 04:37:00186
    JFKAUS 5 322013-01-02 20:30:002013-01-02 20:25:002013-01-02 00:13:002013-01-02 23:41:00259
    EWRAUS 2 172013-01-02 20:30:002013-01-02 20:28:002013-01-02 00:08:002013-01-02 23:51:00252
    ...........................
    JFKSJU 32 252013-09-28 22:17:002013-09-28 21:45:002013-09-28 02:05:002013-09-28 01:40:00213
    JFKBTV 10 152013-09-28 22:55:002013-09-28 22:45:002013-09-28 00:08:002013-09-28 23:53:00 48
    JFKPSE -8 172013-09-28 23:51:002013-09-28 23:59:002013-09-28 04:07:002013-09-28 03:50:00218
    JFKSJU -9-212013-09-29 20:36:002013-09-29 20:45:002013-09-29 00:32:002013-09-29 00:53:00212
    JFKPDX -1 282013-09-29 20:54:002013-09-29 20:55:002013-09-29 00:27:002013-09-29 23:59:00346
    JFKLAX -2 32013-09-29 20:58:002013-09-29 21:00:002013-09-29 00:17:002013-09-29 00:14:00326
    JFKAUS 5 32013-09-29 20:59:002013-09-29 20:54:002013-09-29 00:01:002013-09-29 23:58:00202
    JFKSEA -3-112013-09-29 21:02:002013-09-29 21:05:002013-09-29 00:06:002013-09-29 00:17:00336
    EWRBQN 37 212013-09-29 21:32:002013-09-29 20:55:002013-09-29 01:17:002013-09-29 00:56:00206
    JFKLAX -5-342013-09-29 21:35:002013-09-29 21:40:002013-09-29 00:06:002013-09-29 00:40:00295
    JFKSJU 9 112013-09-29 21:54:002013-09-29 21:45:002013-09-29 01:51:002013-09-29 01:40:00208
    LGAMCO 64 392013-09-29 22:04:002013-09-29 21:00:002013-09-29 00:23:002013-09-29 23:44:00117
    JFKDEN 39 32013-09-29 22:09:002013-09-29 21:30:002013-09-29 00:02:002013-09-29 23:59:00208
    LGACVG 73 542013-09-29 22:23:002013-09-29 21:10:002013-09-29 00:13:002013-09-29 23:19:00 87
    LGAFLL 47 202013-09-29 22:24:002013-09-29 21:37:002013-09-29 00:47:002013-09-29 00:27:00128
    EWRMSP1311102013-09-29 22:45:002013-09-29 20:34:002013-09-29 00:27:002013-09-29 22:37:00141
    JFKLAX1731412013-09-29 22:53:002013-09-29 20:00:002013-09-29 01:21:002013-09-29 23:00:00303
    JFKDCA 94 762013-09-29 23:14:002013-09-29 21:40:002013-09-29 00:11:002013-09-29 22:55:00 39
    JFKPWM 33 192013-09-29 23:18:002013-09-29 22:45:002013-09-29 00:15:002013-09-29 23:56:00 44
    LGARDU1441242013-09-29 23:24:002013-09-29 21:00:002013-09-29 00:39:002013-09-29 22:35:00 61
    LGAFLL2251832013-09-29 23:27:002013-09-29 19:42:002013-09-29 01:53:002013-09-29 22:50:00129
    JFKPSE -3 -92013-09-29 23:56:002013-09-29 23:59:002013-09-29 03:41:002013-09-29 03:50:00204
    JFKSJU 5-332013-09-30 20:50:002013-09-30 20:45:002013-09-30 00:20:002013-09-30 00:53:00188
    JFKPDX 15 522013-09-30 21:10:002013-09-30 20:55:002013-09-30 00:51:002013-09-30 23:59:00361
    JFKLAX 0-302013-09-30 21:40:002013-09-30 21:40:002013-09-30 00:10:002013-09-30 00:40:00298
    JFKSJU 0-252013-09-30 21:45:002013-09-30 21:45:002013-09-30 01:15:002013-09-30 01:40:00192
    LGAFLL 10 32013-09-30 21:47:002013-09-30 21:37:002013-09-30 00:30:002013-09-30 00:27:00139
    EWRSFO 80 422013-09-30 22:33:002013-09-30 21:13:002013-09-30 01:12:002013-09-30 00:30:00318
    JFKMCO1541302013-09-30 22:35:002013-09-30 20:01:002013-09-30 00:59:002013-09-30 22:49:00123
    JFKPSE-10-252013-09-30 23:49:002013-09-30 23:59:002013-09-30 03:25:002013-09-30 03:50:00196
    flights_dt <- flights_dt |> 
      mutate(
        overnight = arr_time < dep_time,
        arr_time = arr_time + days(overnight),
        sched_arr_time = sched_arr_time + days(overnight)
      )
    
    flights_dt |> 
      filter(arr_time < dep_time) 
    
    A tibble: 0 x 10
    origindestdep_delayarr_delaydep_timesched_dep_timearr_timesched_arr_timeair_timeovernight
    <chr><chr><dbl><dbl><dttm><dttm><dttm><dttm><dbl><lgl>

    Intervals

    dyears(1) / ddays(365) # 1 dyear is 365.25days
    years(1) / days(1)
    

    1.00068493150685

    365.25

    y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01")
    y2024 <- ymd("2024-01-01") %--% ymd("2025-01-01")
    
    y2023 / days(1)
    y2024 / days(1)
    

    365

    366

    Exercises

    1. Explain days(!overnight) and days(overnight) to someone who has just started learning R. What is the key fact you need to know?

    2. Create a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the current year.

    3. Write a function that given your birthday (as a date), returns how old you are in years.

    4. Why can’t (today() %--% (today() + years(1))) / months(1) work?

    (today() %--% (today() + years(1))) / months(1)
    

    12

    17.5 Time zones

    Sys.timezone()      # {area}/{location}: {continent}/{city} or {ocian}/{city}
    

    ‘Australia/Brisbane’

    length(OlsonNames())
    head(OlsonNames())
    

    597

    <ol class=list-inline><li>‘Africa/Abidjan’</li><li>‘Africa/Accra’</li><li>‘Africa/Addis_Ababa’</li><li>‘Africa/Algiers’</li><li>‘Africa/Asmara’</li><li>‘Africa/Asmera’</li></ol>

    x1 <- ymd_hms("2024-06-01 12:00:00", tz = "America/New_York")
    x1
    
    x2 <- ymd_hms("2024-06-01 18:00:00", tz = "Europe/Copenhagen")
    x2
    
    x3 <- ymd_hms("2024-06-02 04:00:00", tz = "Pacific/Auckland")
    x3
    
    x1 - x2
    #> Time difference of 0 secs
    x1 - x3
    #> Time difference of 0 secs
    
    [1] "2024-06-01 12:00:00 EDT"
    
    
    
    [1] "2024-06-01 18:00:00 CEST"
    
    
    
    [1] "2024-06-02 04:00:00 NZST"
    
    
    
    Time difference of 0 secs
    
    
    
    Time difference of 0 secs
    

    You can change the time zone in two ways:

    • Keep the instant in time the same, and change how it’s displayed. Use this when the instant is correct, but you want a more natural display.
    • Change the underlying instant in time. Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.
    x4 <- c(x1, x2, x3)
    x4
    x4a <- with_tz(x4, tzone = "Australia/Lord_Howe")
    x4a
    
    x4a - x4
    
    [1] "2024-06-01 12:00:00 EDT" "2024-06-01 12:00:00 EDT"
    [3] "2024-06-01 12:00:00 EDT"
    
    
    
    [1] "2024-06-02 02:30:00 +1030" "2024-06-02 02:30:00 +1030"
    [3] "2024-06-02 02:30:00 +1030"
    
    
    
    Time differences in secs
    [1] 0 0 0
    
    x4b <- force_tz(x4, tzone = "Australia/Lord_Howe")
    x4b
    
    x4b - x4
    
    [1] "2024-06-01 12:00:00 +1030" "2024-06-01 12:00:00 +1030"
    [3] "2024-06-01 12:00:00 +1030"
    
    
    
    Time differences in hours
    [1] -14.5 -14.5 -14.5
    

    C18. Missing Values

    library(tidyverse)
    

    18.2 Explicit missing values

    • Last observation carried forward (repeated)
    • Fixed values
    treatment <- tribble(
      ~person,           ~treatment, ~response,
      "Derrick Whitmore", 1,         7,
      NA,                 2,         10,
      NA,                 3,         NA,
      "Katherine Burke",  1,         4
    )
    
    treatment |>
      fill(everything())        # method: last observation carried forward!
    
    
    A tibble: 4 x 3
    persontreatmentresponse
    <chr><dbl><dbl>
    Derrick Whitmore1 7
    Derrick Whitmore210
    Derrick Whitmore310
    Katherine Burke 1 4
    x <- c(1, 4, 5, 7, NA)
    coalesce(x, 0)
    

    <ol class=list-inline><li>1</li><li>4</li><li>5</li><li>7</li><li>0</li></ol>

    18.3 Implicit missing values

    An explicit missing value is the presence of an absence.

    An implicit missing value is the absence of a presence.

    stocks <- tibble(
      year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
      qtr   = c(   1,    2,    3,    4,    2,    3,    4),
      price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
    )
    

    Pivoting

    Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value.

    stocks |>
      pivot_wider(
        names_from = qtr, 
        values_from = price
      )
    
    A tibble: 2 x 5
    year1234
    <dbl><dbl><dbl><dbl><dbl>
    20201.880.590.35 NA
    2021 NA0.920.172.66

    Complete

    tidyr::complete() allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist.

    stocks |>
      complete(year, qtr)
    
    A tibble: 8 x 3
    yearqtrprice
    <dbl><dbl><dbl>
    202011.88
    202020.59
    202030.35
    20204 NA
    20211 NA
    202120.92
    202130.17
    202142.66
    stocks |>
      complete(year = 2019:2021, qtr)
    
    A tibble: 12 x 3
    yearqtrprice
    <dbl><dbl><dbl>
    20191 NA
    20192 NA
    20193 NA
    20194 NA
    202011.88
    202020.59
    202030.35
    20204 NA
    20211 NA
    202120.92
    202130.17
    202142.66

    Joins

    In some cases, the complete set of observations can’t be generated by a simple combination of variables. In that case, you can do manually what complete() does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with dplyr::full_join().

    dplyr::anti_join(x, y) is a particularly useful tool , it selects only the rows in x that don’t have a match in y

    flights |> 
      distinct(faa = dest) |> 
      anti_join(airports)
    
    Joining with `by = join_by(faa)`
    
    A tibble: 4 x 1
    faa
    <chr>
    BQN
    SJU
    STT
    PSE
    flights |> 
      distinct(tailnum) |> 
      anti_join(planes)
    
    Joining with `by = join_by(tailnum)`
    
    A tibble: 722 x 1
    tailnum
    <chr>
    N3ALAA
    N3DUAA
    N542MQ
    N730MQ
    N9EAMQ
    N532UA
    N3EMAA
    N518MQ
    N3BAAA
    N3CYAA
    N426US
    N3GKAA
    N4WNAA
    N5FMAA
    N722MQ
    N3EKAA
    N3ETAA
    N541AA
    N4WRAA
    N4WPAA
    N508MQ
    N3HMAA
    N828MQ
    N3GEAA
    N739MQ
    N531MQ
    N527JB
    N846MQ
    N3GVAA
    N4YCAA
    ...
    N7BAAA
    N7BVAA
    N626MQ
    N675MQ
    N580AA
    N717MQ
    N738MQ
    N720MQ
    N7ASAA
    N328AT
    N735MQ
    N5EDAA
    N5DJAA
    N7ALAA
    N721MQ
    N7BGAA
    N5ESAA
    N456UW
    N838MQ
    N442US
    N502SW
    N451UW
    N7BKAA
    N800MQ
    N7CAAA
    N823MQ
    N5FCAA
    N5ERAA
    N654MQ
    N647MQ

    Exercises

    1. Can you find any relationship between the carrier and the rows that appear to be missing from planes?

    18.4 Factors and empty groups

    health <- tibble(
      name   = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
      smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
      age    = c(34, 88, 75, 47, 56),
    )
    
    health |> count(smoker)
    
    A tibble: 1 x 2
    smokern
    <fct><int>
    no5
    # We can request count() to keep all the groups, even those not seen in the data by using .drop = FALSE:
    health |> count(smoker, .drop = FALSE)
    
    A tibble: 2 x 2
    smokern
    <fct><int>
    yes0
    no 5
    p1 <- ggplot(health, aes(x = smoker)) +
      geom_bar() +
      scale_x_discrete()
    
    p2 <- ggplot(health, aes(x = smoker)) +
      geom_bar() +
      scale_x_discrete(drop = FALSE)
    
    options(repr.plot.width = 10, repr.plot.height = 5)
    grid.arrange(p1, p2, ncol=2)
    
    

    png

    health |> 
      group_by(smoker, .drop = FALSE) |> 
      summarize(
        n = n(),
        mean_age = mean(age),
        min_age = min(age),
        max_age = max(age),
        sd_age = sd(age)
      )
    
    Warning message:
    "There were 2 warnings in `summarize()`.
    The first warning was:
    i In argument: `min_age = min(age)`.
    i In group 1: `smoker = yes`.
    Caused by warning in `min()`:
    ! no non-missing arguments to min; returning Inf
    i Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning."
    
    A tibble: 2 x 6
    smokernmean_agemin_agemax_agesd_age
    <fct><int><dbl><dbl><dbl><dbl>
    yes0NaNInf-Inf NA
    no 5 60 34 8821.62175
    health |> 
      group_by(smoker) |> 
      summarize(
        n = n(),
        mean_age = mean(age),
        min_age = min(age),
        max_age = max(age),
        sd_age = sd(age)
      ) |> 
      complete(smoker)
    
    A tibble: 2 x 6
    smokernmean_agemin_agemax_agesd_age
    <fct><int><dbl><dbl><dbl><dbl>
    yesNANANANA NA
    no 560348821.62175

    C19. Joins

    Typically you have many data frames, and you must join them together to answer the questions that you’re interested in. This chapter will introduce you to two important types of joins:

    1. Mutating joins, which add new variables to one data frame from matching observations in another.
    2. Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.
    library(tidyverse)
    library(nycflights13)
    

    19.2 Keys

    Primary and foreign keys

    • A primary key is a variable or set of variables that uniquely identifies each observation. When more than one variable is needed, the key is called a compound key.
    • A foreign key is a variable (or set of variables) that corresponds to a primary key in another table.

    For examples:

    • carrier and faa, tailnum are primary key of corresponding tables airlines, airports, planes
    • origin and time_hour are the compound primary key of the table weather
    • flights$tailnum is a foreign key that corresponds to the primary key planes$tailnum.
    • flights$carrier is a foreign key that corresponds to the primary key airlines$carrier.
    • flights$origin is a foreign key that corresponds to the primary key airports$faa.
    • flights$dest is a foreign key that corresponds to the primary key airports$faa.
    • flights$origin-flights$time_hour is a compound foreign key that corresponds to the compound primary key weather$origin-weather$time_hour.
    airlines
    
    A tibble: 16 x 2
    carriername
    <chr><chr>
    9EEndeavor Air Inc.
    AAAmerican Airlines Inc.
    ASAlaska Airlines Inc.
    B6JetBlue Airways
    DLDelta Air Lines Inc.
    EVExpressJet Airlines Inc.
    F9Frontier Airlines Inc.
    FLAirTran Airways Corporation
    HAHawaiian Airlines Inc.
    MQEnvoy Air
    OOSkyWest Airlines Inc.
    UAUnited Air Lines Inc.
    USUS Airways Inc.
    VXVirgin America
    WNSouthwest Airlines Co.
    YVMesa Airlines Inc.
    airports
    
    A tibble: 1458 x 8
    faanamelatlonalttzdsttzone
    <chr><chr><dbl><dbl><dbl><dbl><chr><chr>
    04GLansdowne Airport 41.13047 -80.619581044-5AAmerica/New_York
    06AMoton Field Municipal Airport 32.46057 -85.68003 264-6AAmerica/Chicago
    06CSchaumburg Regional 41.98934 -88.10124 801-6AAmerica/Chicago
    06NRandall Airport 41.43191 -74.39156 523-5AAmerica/New_York
    09JJekyll Island Airport 31.07447 -81.42778 11-5AAmerica/New_York
    0A9Elizabethton Municipal Airport 36.37122 -82.173421593-5AAmerica/New_York
    0G6Williams County Airport 41.46731 -84.50678 730-5AAmerica/New_York
    0G7Finger Lakes Regional Airport 42.88356 -76.78123 492-5AAmerica/New_York
    0P2Shoestring Aviation Airfield 39.79482 -76.647191000-5UAmerica/New_York
    0S9Jefferson County Intl 48.05381-122.81064 108-8AAmerica/Los_Angeles
    0W3Harford County Airport 39.56684 -76.20240 409-5AAmerica/New_York
    10CGalt Field Airport 42.40289 -88.37511 875-6UAmerica/Chicago
    17GPort Bucyrus-Crawford County Airport40.78156 -82.974811003-5AAmerica/New_York
    19AJackson County Airport 34.17586 -83.56160 951-5UAmerica/New_York
    1A3Martin Campbell Field Airport 35.01581 -84.346831789-5AAmerica/New_York
    1B9Mansfield Municipal 42.00013 -71.19677 122-5AAmerica/New_York
    1C9Frazier Lake Airpark 54.01333-124.76833 152-8AAmerica/Vancouver
    1CSClow International Airport 41.69597 -88.12923 670-6UAmerica/Chicago
    1G3Kent State Airport 41.15139 -81.415111134-5AAmerica/New_York
    1G4Grand Canyon West Airport 35.89990-113.815674813-7AAmerica/Phoenix
    1H2Effingham Memorial Airport 39.07000 -88.53400 585-6AAmerica/Chicago
    1OHFortman Airport 40.55533 -84.38662 885-5UAmerica/New_York
    1RLPoint Roberts Airpark 48.97972-123.07889 10-8AAmerica/Los_Angeles
    23MClarke CO 32.05170 -88.44340 320-6AAmerica/Chicago
    24CLowell City Airport 42.95392 -85.34391 681-5AAmerica/New_York
    24JSuwannee County Airport 30.30013 -83.02469 104-5AAmerica/New_York
    25DForest Lake Airport 45.24775 -92.99439 925-6AAmerica/Chicago
    29DGrove City Airport 41.14603 -80.167751371-5AAmerica/New_York
    2A0Mark Anton Airport 35.48625 -84.93108 718-5AAmerica/New_York
    2B2Plum Island Airport 42.79536 -70.83944 11-5AAmerica/New_York
    ........................
    X59Valkaria Municipal 27.96086 -80.55833 26-5AAmerica/New_York
    XFLFlagler County Airport 29.28210 -81.12120 33-5AAmerica/New_York
    XNANW Arkansas Regional 36.28187 -94.306811287-6AAmerica/Chicago
    XZKAmherst Amtrak Station AMM 42.37500 -72.51139 258-5AAmerica/New_York
    Y51Municipal Airport 43.57936 -90.896471292-6AAmerica/Chicago
    Y72Bloyer Field 43.97622 -90.48061 966-6AAmerica/Chicago
    YAKYakutat 59.30120-139.39370 33-9ANA
    YIPWillow Run 42.23793 -83.53041 716-5AAmerica/New_York
    YKMYakima Air Terminal McAllister Field46.56820-120.544001095-8AAmerica/Los_Angeles
    YKNChan Gurney 42.87110 -97.396901200-6AAmerica/Chicago
    YNGYoungstown Warren Rgnl 41.26074 -80.679101196-5AAmerica/New_York
    YUMYuma Mcas Yuma Intl 32.65658-114.60598 216-7NAmerica/Phoenix
    Z84Clear 64.30120-149.12014 552-9AAmerica/Anchorage
    ZBPPenn Station 39.30722 -76.61556 66-5AAmerica/New_York
    ZFVPhiladelphia 30th St Station 39.95570 -75.18200 0-5AAmerica/New_York
    ZPHMunicipal Airport 28.22806 -82.15583 90-5AAmerica/New_York
    ZRAAtlantic City Rail Terminal 39.36650 -74.44200 8-5AAmerica/New_York
    ZRDTrain Station 37.53430 -77.42945 26-5AAmerica/New_York
    ZRPNewark Penn Station 40.73472 -74.16417 0-5AAmerica/New_York
    ZRTHartford Union Station 41.76888 -72.68150 0-5AAmerica/New_York
    ZRZNew Carrollton Rail Station 38.94800 -76.87190 39-5AAmerica/New_York
    ZSFSpringfield Amtrak Station 42.10600 -72.59305 65-5AAmerica/New_York
    ZSYScottsdale Airport 33.62289-111.910531519-7AAmerica/Phoenix
    ZTFStamford Amtrak Station 41.04694 -73.54149 0-5AAmerica/New_York
    ZTYBoston Back Bay Station 42.34780 -71.07500 20-5AAmerica/New_York
    ZUNBlack Rock 35.08323-108.791786454-7AAmerica/Denver
    ZVENew Haven Rail Station 41.29867 -72.92599 7-5AAmerica/New_York
    ZWIWilmington Amtrak Station 39.73667 -75.55167 0-5AAmerica/New_York
    ZWUWashington Union Station 38.89746 -77.00643 76-5AAmerica/New_York
    ZYPPenn Station 40.75050 -73.99350 35-5AAmerica/New_York
    planes
    
    A tibble: 3322 x 9
    tailnumyeartypemanufacturermodelenginesseatsspeedengine
    <chr><int><chr><chr><chr><int><int><int><chr>
    N101562004Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N102UW1998Fixed wing multi engineAIRBUS INDUSTRIEA320-214 2182NATurbo-fan
    N103US1999Fixed wing multi engineAIRBUS INDUSTRIEA320-214 2182NATurbo-fan
    N104UW1999Fixed wing multi engineAIRBUS INDUSTRIEA320-214 2182NATurbo-fan
    N105752002Fixed wing multi engineEMBRAER EMB-145LR2 55NATurbo-fan
    N105UW1999Fixed wing multi engineAIRBUS INDUSTRIEA320-214 2182NATurbo-fan
    N107US1999Fixed wing multi engineAIRBUS INDUSTRIEA320-214 2182NATurbo-fan
    N108UW1999Fixed wing multi engineAIRBUS INDUSTRIEA320-214 2182NATurbo-fan
    N109UW1999Fixed wing multi engineAIRBUS INDUSTRIEA320-214 2182NATurbo-fan
    N110UW1999Fixed wing multi engineAIRBUS INDUSTRIEA320-214 2182NATurbo-fan
    N111062002Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111072002Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111092002Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111132002Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111192002Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111212003Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111272003Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111372003Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111402003Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111502003Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111552004Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111642004Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111652004Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111762004Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111812005Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111842005Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111872005Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111892005Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111912005Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    N111922005Fixed wing multi engineEMBRAER EMB-145XR2 55NATurbo-fan
    ...........................
    N984DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N985AT2001Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N985DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N986AT2001Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N986DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N987AT2001Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N987DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N988AT2001Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N988DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N989AT2001Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N989DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N990AT2001Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N990DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N991AT NAFixed wing multi engineBOEING 717-2002100NATurbo-fan
    N991DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N992AT2002Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N992DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N993AT2002Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N993DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N994AT2002Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N994DL1991Fixed wing multi engineMCDONNELL DOUGLAS CORPORATIONMD-88 2142NATurbo-jet
    N995AT2002Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N995DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N996AT2002Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N996DL1991Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N997AT2002Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N997DL1992Fixed wing multi engineMCDONNELL DOUGLAS AIRCRAFT COMD-88 2142NATurbo-fan
    N998AT2002Fixed wing multi engineBOEING 717-2002100NATurbo-fan
    N998DL1992Fixed wing multi engineMCDONNELL DOUGLAS CORPORATIONMD-88 2142NATurbo-jet
    N999DN1992Fixed wing multi engineMCDONNELL DOUGLAS CORPORATIONMD-88 2142NATurbo-jet

    The data relationships{width=75%}

    Checking primary keys

    planes |> 
      count(tailnum) |> 
      filter(n > 1)
    
    weather |> 
      count(time_hour, origin) |> 
      filter(n > 1)
    
    A tibble: 0 x 2
    tailnumn
    <chr><int>
    A tibble: 0 x 3
    time_houroriginn
    <dttm><chr><int>
    planes |> 
      filter(is.na(tailnum))
    
    weather |> 
      filter(is.na(time_hour) | is.na(origin))
    
    A tibble: 0 x 9
    tailnumyeartypemanufacturermodelenginesseatsspeedengine
    <chr><int><chr><chr><chr><int><int><int><chr>
    A tibble: 0 x 15
    originyearmonthdayhourtempdewphumidwind_dirwind_speedwind_gustprecippressurevisibtime_hour
    <chr><int><int><int><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dttm>

    Surrogate keys

    flights |> 
      count(time_hour, carrier, flight) |> 
      filter(n > 1)
    
    A tibble: 0 x 4
    time_hourcarrierflightn
    <dttm><chr><int><int>
    flights2 <- flights |> 
      mutate(id = row_number(), .before = 1)
    flights2
    
    A tibble: 336776 x 20
    idyearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
    <int><int><int><int><int><int><dbl><int><int><dbl><chr><int><chr><chr><chr><dbl><dbl><dbl><dbl><dttm>
    1201311517515 2 830 819 11UA1545N14228EWRIAH22714005152013-01-01 05:00:00
    2201311533529 4 850 830 20UA1714N24211LGAIAH22714165292013-01-01 05:00:00
    3201311542540 2 923 850 33AA1141N619AAJFKMIA16010895402013-01-01 05:00:00
    4201311544545-110041022-18B6 725N804JBJFKBQN18315765452013-01-01 05:00:00
    5201311554600-6 812 837-25DL 461N668DNLGAATL116 7626 02013-01-01 06:00:00
    6201311554558-4 740 728 12UA1696N39463EWRORD150 7195582013-01-01 05:00:00
    7201311555600-5 913 854 19B6 507N516JBEWRFLL15810656 02013-01-01 06:00:00
    8201311557600-3 709 723-14EV5708N829ASLGAIAD 53 2296 02013-01-01 06:00:00
    9201311557600-3 838 846 -8B6 79N593JBJFKMCO140 9446 02013-01-01 06:00:00
    10201311558600-2 753 745 8AA 301N3ALAALGAORD138 7336 02013-01-01 06:00:00
    11201311558600-2 849 851 -2B6 49N793JBJFKPBI14910286 02013-01-01 06:00:00
    12201311558600-2 853 856 -3B6 71N657JBJFKTPA15810056 02013-01-01 06:00:00
    13201311558600-2 924 917 7UA 194N29129JFKLAX34524756 02013-01-01 06:00:00
    14201311558600-2 923 937-14UA1124N53441EWRSFO36125656 02013-01-01 06:00:00
    15201311559600-1 941 910 31AA 707N3DUAALGADFW25713896 02013-01-01 06:00:00
    16201311559559 0 702 706 -4B61806N708JBJFKBOS 44 1875592013-01-01 05:00:00
    17201311559600-1 854 902 -8UA1187N76515EWRLAS33722276 02013-01-01 06:00:00
    18201311600600 0 851 858 -7B6 371N595JBLGAFLL15210766 02013-01-01 06:00:00
    19201311600600 0 837 825 12MQ4650N542MQLGAATL134 7626 02013-01-01 06:00:00
    20201311601600 1 844 850 -6B6 343N644JBEWRPBI14710236 02013-01-01 06:00:00
    21201311602610-8 812 820 -8DL1919N971DLLGAMSP17010206102013-01-01 06:00:00
    22201311602605-3 821 805 16MQ4401N730MQLGADTW105 5026 52013-01-01 06:00:00
    23201311606610-4 858 910-12AA1895N633AAEWRMIA15210856102013-01-01 06:00:00
    24201311606610-4 837 845 -8DL1743N3739PJFKATL128 7606102013-01-01 06:00:00
    25201311607607 0 858 915-17UA1077N53442EWRMIA15710856 72013-01-01 06:00:00
    26201311608600 8 807 735 32MQ3768N9EAMQEWRORD139 7196 02013-01-01 06:00:00
    2720131161160011 945 931 14UA 303N532UAJFKSFO36625866 02013-01-01 06:00:00
    28201311613610 3 925 921 4B6 135N635JBJFKRSW17510746102013-01-01 06:00:00
    29201311615615 010391100-21B6 709N794JBJFKSJU18215986152013-01-01 06:00:00
    30201311615615 0 833 842 -9DL 575N326NBEWRATL120 7466152013-01-01 06:00:00
    ............................................................
    336747201393021232125 -222232247-24EV5489N712EVLGACHO 45 30521252013-09-30 21:00:00
    336748201393021272129 -223142323 -9EV3833N16546EWRCLT 72 52921292013-09-30 21:00:00
    336749201393021282130 -223282359-31B6 97N807JBJFKDEN213162621302013-09-30 21:00:00
    336750201393021292059 3022302232 -2EV5048N751EVLGARIC 45 29220592013-09-30 20:00:00
    336751201393021312140 -922252255-30MQ3621N807MQJFKDCA 36 21321402013-09-30 21:00:00
    336752201393021402140 0 10 40-30AA 185N335AAJFKLAX298247521402013-09-30 21:00:00
    336753201393021422129 1322502239 11EV4509N12957EWRPWM 47 28421292013-09-30 21:00:00
    336754201393021452145 0 115 140-25B61103N633JBJFKSJU192159821452013-09-30 21:00:00
    336755201393021472137 10 30 27 3B61371N627JBLGAFLL139107621372013-09-30 21:00:00
    336756201393021492156 -722452308-23UA 523N813UAEWRBOS 37 20021562013-09-30 21:00:00
    336757201393021502159 -922502306-16EV3842N10575EWRMHT 39 20921592013-09-30 21:00:00
    336758201393021591845194234420301949E3320N906XJJFKBUF 50 30118452013-09-30 18:00:00
    336759201393022032205 -223392331 8EV5311N722EVLGABGR 61 37822 52013-09-30 22:00:00
    336760201393022072140 2722572250 7MQ3660N532MQLGABNA 97 76421402013-09-30 21:00:00
    336761201393022112059 7223392242 57EV4672N12145EWRSTL120 87220592013-09-30 20:00:00
    336762201393022312245-1423352356-21B6 108N193JBJFKPWM 48 27322452013-09-30 22:00:00
    336763201393022332113 80 112 30 42UA 471N578UAEWRSFO318256521132013-09-30 21:00:00
    336764201393022352001154 592249130B61083N804JBJFKMCO123 94420 12013-09-30 20:00:00
    336765201393022372245 -823452353 -8B6 234N318JBJFKBTV 43 26622452013-09-30 22:00:00
    336766201393022402245 -523342351-17B61816N354JBJFKSYR 41 20922452013-09-30 22:00:00
    336767201393022402250-102347 7-20B62002N281JBJFKBUF 52 30122502013-09-30 22:00:00
    336768201393022412246 -52345 1-16B6 486N346JBJFKROC 47 26422462013-09-30 22:00:00
    336769201393023072255 1223592358 1B6 718N565JBJFKBOS 33 18722552013-09-30 22:00:00
    336770201393023492359-10 325 350-25B6 745N516JBJFKPSE196161723592013-09-30 23:00:00
    3367712013930 NA1842 NA NA2019 NAEV5274N740EVLGABNA NA 76418422013-09-30 18:00:00
    3367722013930 NA1455 NA NA1634 NA9E3393NA JFKDCA NA 21314552013-09-30 14:00:00
    3367732013930 NA2200 NA NA2312 NA9E3525NA LGASYR NA 19822 02013-09-30 22:00:00
    3367742013930 NA1210 NA NA1330 NAMQ3461N535MQLGABNA NA 76412102013-09-30 12:00:00
    3367752013930 NA1159 NA NA1344 NAMQ3572N511MQLGACLE NA 41911592013-09-30 11:00:00
    3367762013930 NA 840 NA NA1020 NAMQ3531N839MQLGARDU NA 431 8402013-09-30 08:00:00

    Exercises

    1. We forgot to draw the relationship between weather and airports in Figure 19.1. What is the relationship and how should it appear in the diagram?

    2. weather only contains information for the three origin airports in NYC. If it contained weather records for all airports in the USA, what additional connection would it make to flights?

    3. The year, month, day, hour, and origin variables almost form a compound key for weather, but there’s one hour that has duplicate observations. Can you figure out what’s special about that hour?

    4. We know that some days of the year are special and fewer people than usual fly on them (e.g., Christmas eve and Christmas day). How might you represent that data as a data frame? What would be the primary key? How would it connect to the existing data frames?

    5. Draw a diagram illustrating the connections between the Batting, People, and Salaries data frames in the Lahman package. Draw another diagram that shows the relationship between People, Managers, AwardsManagers. How would you characterize the relationship between the Batting, Pitching, and Fielding data frames?

    19.3 Basic joins

    dplyr provides six join functions:

    • left_join(), to add in additional metadata.
    • inner_join(),
    • right_join(),
    • full_join(),
    • semi_join(),
    • anti_join()

    Mutating joins:

    A mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. inner_join(), right_join(), full_join() have the same interface as left_join(). The difference is which rows they keep: left join keeps all the rows in x, the right join keeps all rows in y, the full join keeps all rows in either x or y, and the inner join only keeps rows that occur in both x and y.

    flights2 <- flights |> 
      select(year, time_hour, origin, dest, tailnum, carrier)
    flights2
    
    A tibble: 336776 x 6
    yeartime_hourorigindesttailnumcarrier
    <int><dttm><chr><chr><chr><chr>
    20132013-01-01 05:00:00EWRIAHN14228UA
    20132013-01-01 05:00:00LGAIAHN24211UA
    20132013-01-01 05:00:00JFKMIAN619AAAA
    20132013-01-01 05:00:00JFKBQNN804JBB6
    20132013-01-01 06:00:00LGAATLN668DNDL
    20132013-01-01 05:00:00EWRORDN39463UA
    20132013-01-01 06:00:00EWRFLLN516JBB6
    20132013-01-01 06:00:00LGAIADN829ASEV
    20132013-01-01 06:00:00JFKMCON593JBB6
    20132013-01-01 06:00:00LGAORDN3ALAAAA
    20132013-01-01 06:00:00JFKPBIN793JBB6
    20132013-01-01 06:00:00JFKTPAN657JBB6
    20132013-01-01 06:00:00JFKLAXN29129UA
    20132013-01-01 06:00:00EWRSFON53441UA
    20132013-01-01 06:00:00LGADFWN3DUAAAA
    20132013-01-01 05:00:00JFKBOSN708JBB6
    20132013-01-01 06:00:00EWRLASN76515UA
    20132013-01-01 06:00:00LGAFLLN595JBB6
    20132013-01-01 06:00:00LGAATLN542MQMQ
    20132013-01-01 06:00:00EWRPBIN644JBB6
    20132013-01-01 06:00:00LGAMSPN971DLDL
    20132013-01-01 06:00:00LGADTWN730MQMQ
    20132013-01-01 06:00:00EWRMIAN633AAAA
    20132013-01-01 06:00:00JFKATLN3739PDL
    20132013-01-01 06:00:00EWRMIAN53442UA
    20132013-01-01 06:00:00EWRORDN9EAMQMQ
    20132013-01-01 06:00:00JFKSFON532UAUA
    20132013-01-01 06:00:00JFKRSWN635JBB6
    20132013-01-01 06:00:00JFKSJUN794JBB6
    20132013-01-01 06:00:00EWRATLN326NBDL
    ..................
    20132013-09-30 21:00:00LGACHON712EVEV
    20132013-09-30 21:00:00EWRCLTN16546EV
    20132013-09-30 21:00:00JFKDENN807JBB6
    20132013-09-30 20:00:00LGARICN751EVEV
    20132013-09-30 21:00:00JFKDCAN807MQMQ
    20132013-09-30 21:00:00JFKLAXN335AAAA
    20132013-09-30 21:00:00EWRPWMN12957EV
    20132013-09-30 21:00:00JFKSJUN633JBB6
    20132013-09-30 21:00:00LGAFLLN627JBB6
    20132013-09-30 21:00:00EWRBOSN813UAUA
    20132013-09-30 21:00:00EWRMHTN10575EV
    20132013-09-30 18:00:00JFKBUFN906XJ9E
    20132013-09-30 22:00:00LGABGRN722EVEV
    20132013-09-30 21:00:00LGABNAN532MQMQ
    20132013-09-30 20:00:00EWRSTLN12145EV
    20132013-09-30 22:00:00JFKPWMN193JBB6
    20132013-09-30 21:00:00EWRSFON578UAUA
    20132013-09-30 20:00:00JFKMCON804JBB6
    20132013-09-30 22:00:00JFKBTVN318JBB6
    20132013-09-30 22:00:00JFKSYRN354JBB6
    20132013-09-30 22:00:00JFKBUFN281JBB6
    20132013-09-30 22:00:00JFKROCN346JBB6
    20132013-09-30 22:00:00JFKBOSN565JBB6
    20132013-09-30 23:00:00JFKPSEN516JBB6
    20132013-09-30 18:00:00LGABNAN740EVEV
    20132013-09-30 14:00:00JFKDCANA 9E
    20132013-09-30 22:00:00LGASYRNA 9E
    20132013-09-30 12:00:00LGABNAN535MQMQ
    20132013-09-30 11:00:00LGACLEN511MQMQ
    20132013-09-30 08:00:00LGARDUN839MQMQ
    # add the full airline name to the flights2 data
    flights2 |>
      left_join(airlines)
    
    Joining with `by = join_by(carrier)`
    
    A tibble: 336776 x 7
    yeartime_hourorigindesttailnumcarriername
    <int><dttm><chr><chr><chr><chr><chr>
    20132013-01-01 05:00:00EWRIAHN14228UAUnited Air Lines Inc.
    20132013-01-01 05:00:00LGAIAHN24211UAUnited Air Lines Inc.
    20132013-01-01 05:00:00JFKMIAN619AAAAAmerican Airlines Inc.
    20132013-01-01 05:00:00JFKBQNN804JBB6JetBlue Airways
    20132013-01-01 06:00:00LGAATLN668DNDLDelta Air Lines Inc.
    20132013-01-01 05:00:00EWRORDN39463UAUnited Air Lines Inc.
    20132013-01-01 06:00:00EWRFLLN516JBB6JetBlue Airways
    20132013-01-01 06:00:00LGAIADN829ASEVExpressJet Airlines Inc.
    20132013-01-01 06:00:00JFKMCON593JBB6JetBlue Airways
    20132013-01-01 06:00:00LGAORDN3ALAAAAAmerican Airlines Inc.
    20132013-01-01 06:00:00JFKPBIN793JBB6JetBlue Airways
    20132013-01-01 06:00:00JFKTPAN657JBB6JetBlue Airways
    20132013-01-01 06:00:00JFKLAXN29129UAUnited Air Lines Inc.
    20132013-01-01 06:00:00EWRSFON53441UAUnited Air Lines Inc.
    20132013-01-01 06:00:00LGADFWN3DUAAAAAmerican Airlines Inc.
    20132013-01-01 05:00:00JFKBOSN708JBB6JetBlue Airways
    20132013-01-01 06:00:00EWRLASN76515UAUnited Air Lines Inc.
    20132013-01-01 06:00:00LGAFLLN595JBB6JetBlue Airways
    20132013-01-01 06:00:00LGAATLN542MQMQEnvoy Air
    20132013-01-01 06:00:00EWRPBIN644JBB6JetBlue Airways
    20132013-01-01 06:00:00LGAMSPN971DLDLDelta Air Lines Inc.
    20132013-01-01 06:00:00LGADTWN730MQMQEnvoy Air
    20132013-01-01 06:00:00EWRMIAN633AAAAAmerican Airlines Inc.
    20132013-01-01 06:00:00JFKATLN3739PDLDelta Air Lines Inc.
    20132013-01-01 06:00:00EWRMIAN53442UAUnited Air Lines Inc.
    20132013-01-01 06:00:00EWRORDN9EAMQMQEnvoy Air
    20132013-01-01 06:00:00JFKSFON532UAUAUnited Air Lines Inc.
    20132013-01-01 06:00:00JFKRSWN635JBB6JetBlue Airways
    20132013-01-01 06:00:00JFKSJUN794JBB6JetBlue Airways
    20132013-01-01 06:00:00EWRATLN326NBDLDelta Air Lines Inc.
    .....................
    20132013-09-30 21:00:00LGACHON712EVEVExpressJet Airlines Inc.
    20132013-09-30 21:00:00EWRCLTN16546EVExpressJet Airlines Inc.
    20132013-09-30 21:00:00JFKDENN807JBB6JetBlue Airways
    20132013-09-30 20:00:00LGARICN751EVEVExpressJet Airlines Inc.
    20132013-09-30 21:00:00JFKDCAN807MQMQEnvoy Air
    20132013-09-30 21:00:00JFKLAXN335AAAAAmerican Airlines Inc.
    20132013-09-30 21:00:00EWRPWMN12957EVExpressJet Airlines Inc.
    20132013-09-30 21:00:00JFKSJUN633JBB6JetBlue Airways
    20132013-09-30 21:00:00LGAFLLN627JBB6JetBlue Airways
    20132013-09-30 21:00:00EWRBOSN813UAUAUnited Air Lines Inc.
    20132013-09-30 21:00:00EWRMHTN10575EVExpressJet Airlines Inc.
    20132013-09-30 18:00:00JFKBUFN906XJ9EEndeavor Air Inc.
    20132013-09-30 22:00:00LGABGRN722EVEVExpressJet Airlines Inc.
    20132013-09-30 21:00:00LGABNAN532MQMQEnvoy Air
    20132013-09-30 20:00:00EWRSTLN12145EVExpressJet Airlines Inc.
    20132013-09-30 22:00:00JFKPWMN193JBB6JetBlue Airways
    20132013-09-30 21:00:00EWRSFON578UAUAUnited Air Lines Inc.
    20132013-09-30 20:00:00JFKMCON804JBB6JetBlue Airways
    20132013-09-30 22:00:00JFKBTVN318JBB6JetBlue Airways
    20132013-09-30 22:00:00JFKSYRN354JBB6JetBlue Airways
    20132013-09-30 22:00:00JFKBUFN281JBB6JetBlue Airways
    20132013-09-30 22:00:00JFKROCN346JBB6JetBlue Airways
    20132013-09-30 22:00:00JFKBOSN565JBB6JetBlue Airways
    20132013-09-30 23:00:00JFKPSEN516JBB6JetBlue Airways
    20132013-09-30 18:00:00LGABNAN740EVEVExpressJet Airlines Inc.
    20132013-09-30 14:00:00JFKDCANA 9EEndeavor Air Inc.
    20132013-09-30 22:00:00LGASYRNA 9EEndeavor Air Inc.
    20132013-09-30 12:00:00LGABNAN535MQMQEnvoy Air
    20132013-09-30 11:00:00LGACLEN511MQMQEnvoy Air
    20132013-09-30 08:00:00LGARDUN839MQMQEnvoy Air
    # find out the temperature and wind speed when each plane departed:
    flights2 |> 
      left_join(weather |> select(origin, time_hour, temp, wind_speed))
    
    Joining with `by = join_by(time_hour, origin)`
    
    A tibble: 336776 x 8
    yeartime_hourorigindesttailnumcarriertempwind_speed
    <int><dttm><chr><chr><chr><chr><dbl><dbl>
    20132013-01-01 05:00:00EWRIAHN14228UA39.0212.65858
    20132013-01-01 05:00:00LGAIAHN24211UA39.9214.96014
    20132013-01-01 05:00:00JFKMIAN619AAAA39.0214.96014
    20132013-01-01 05:00:00JFKBQNN804JBB639.0214.96014
    20132013-01-01 06:00:00LGAATLN668DNDL39.9216.11092
    20132013-01-01 05:00:00EWRORDN39463UA39.0212.65858
    20132013-01-01 06:00:00EWRFLLN516JBB637.9411.50780
    20132013-01-01 06:00:00LGAIADN829ASEV39.9216.11092
    20132013-01-01 06:00:00JFKMCON593JBB637.9413.80936
    20132013-01-01 06:00:00LGAORDN3ALAAAA39.9216.11092
    20132013-01-01 06:00:00JFKPBIN793JBB637.9413.80936
    20132013-01-01 06:00:00JFKTPAN657JBB637.9413.80936
    20132013-01-01 06:00:00JFKLAXN29129UA37.9413.80936
    20132013-01-01 06:00:00EWRSFON53441UA37.9411.50780
    20132013-01-01 06:00:00LGADFWN3DUAAAA39.9216.11092
    20132013-01-01 05:00:00JFKBOSN708JBB639.0214.96014
    20132013-01-01 06:00:00EWRLASN76515UA37.9411.50780
    20132013-01-01 06:00:00LGAFLLN595JBB639.9216.11092
    20132013-01-01 06:00:00LGAATLN542MQMQ39.9216.11092
    20132013-01-01 06:00:00EWRPBIN644JBB637.9411.50780
    20132013-01-01 06:00:00LGAMSPN971DLDL39.9216.11092
    20132013-01-01 06:00:00LGADTWN730MQMQ39.9216.11092
    20132013-01-01 06:00:00EWRMIAN633AAAA37.9411.50780
    20132013-01-01 06:00:00JFKATLN3739PDL37.9413.80936
    20132013-01-01 06:00:00EWRMIAN53442UA37.9411.50780
    20132013-01-01 06:00:00EWRORDN9EAMQMQ37.9411.50780
    20132013-01-01 06:00:00JFKSFON532UAUA37.9413.80936
    20132013-01-01 06:00:00JFKRSWN635JBB637.9413.80936
    20132013-01-01 06:00:00JFKSJUN794JBB637.9413.80936
    20132013-01-01 06:00:00EWRATLN326NBDL37.9411.50780
    ........................
    20132013-09-30 21:00:00LGACHON712EVEV64.94 8.05546
    20132013-09-30 21:00:00EWRCLTN16546EV62.96 3.45234
    20132013-09-30 21:00:00JFKDENN807JBB662.06 9.20624
    20132013-09-30 20:00:00LGARICN751EVEV64.94 6.90468
    20132013-09-30 21:00:00JFKDCAN807MQMQ62.06 9.20624
    20132013-09-30 21:00:00JFKLAXN335AAAA62.06 9.20624
    20132013-09-30 21:00:00EWRPWMN12957EV62.96 3.45234
    20132013-09-30 21:00:00JFKSJUN633JBB662.06 9.20624
    20132013-09-30 21:00:00LGAFLLN627JBB664.94 8.05546
    20132013-09-30 21:00:00EWRBOSN813UAUA62.96 3.45234
    20132013-09-30 21:00:00EWRMHTN10575EV62.96 3.45234
    20132013-09-30 18:00:00JFKBUFN906XJ9E64.04 6.90468
    20132013-09-30 22:00:00LGABGRN722EVEV64.94 6.90468
    20132013-09-30 21:00:00LGABNAN532MQMQ64.94 8.05546
    20132013-09-30 20:00:00EWRSTLN12145EV64.94 3.45234
    20132013-09-30 22:00:00JFKPWMN193JBB660.98 9.20624
    20132013-09-30 21:00:00EWRSFON578UAUA62.96 3.45234
    20132013-09-30 20:00:00JFKMCON804JBB662.06 8.05546
    20132013-09-30 22:00:00JFKBTVN318JBB660.98 9.20624
    20132013-09-30 22:00:00JFKSYRN354JBB660.98 9.20624
    20132013-09-30 22:00:00JFKBUFN281JBB660.98 9.20624
    20132013-09-30 22:00:00JFKROCN346JBB660.98 9.20624
    20132013-09-30 22:00:00JFKBOSN565JBB660.98 9.20624
    20132013-09-30 23:00:00JFKPSEN516JBB660.08 9.20624
    20132013-09-30 18:00:00LGABNAN740EVEV66.92 9.20624
    20132013-09-30 14:00:00JFKDCANA 9E68.0011.50780
    20132013-09-30 22:00:00LGASYRNA 9E64.94 6.90468
    20132013-09-30 12:00:00LGABNAN535MQMQ69.08 5.75390
    20132013-09-30 11:00:00LGACLEN511MQMQ66.92 8.05546
    20132013-09-30 08:00:00LGARDUN839MQMQ60.98 5.75390
    # what size of plane was flying:
    flights2 |> 
      left_join(planes |> select(tailnum, type, engines, seats))
    
    Joining with `by = join_by(tailnum)`
    
    A tibble: 336776 x 9
    yeartime_hourorigindesttailnumcarriertypeenginesseats
    <int><dttm><chr><chr><chr><chr><chr><int><int>
    20132013-01-01 05:00:00EWRIAHN14228UAFixed wing multi engine 2149
    20132013-01-01 05:00:00LGAIAHN24211UAFixed wing multi engine 2149
    20132013-01-01 05:00:00JFKMIAN619AAAAFixed wing multi engine 2178
    20132013-01-01 05:00:00JFKBQNN804JBB6Fixed wing multi engine 2200
    20132013-01-01 06:00:00LGAATLN668DNDLFixed wing multi engine 2178
    20132013-01-01 05:00:00EWRORDN39463UAFixed wing multi engine 2191
    20132013-01-01 06:00:00EWRFLLN516JBB6Fixed wing multi engine 2200
    20132013-01-01 06:00:00LGAIADN829ASEVFixed wing multi engine 2 55
    20132013-01-01 06:00:00JFKMCON593JBB6Fixed wing multi engine 2200
    20132013-01-01 06:00:00LGAORDN3ALAAAANA NA NA
    20132013-01-01 06:00:00JFKPBIN793JBB6Fixed wing multi engine 2200
    20132013-01-01 06:00:00JFKTPAN657JBB6Fixed wing multi engine 2200
    20132013-01-01 06:00:00JFKLAXN29129UAFixed wing multi engine 2178
    20132013-01-01 06:00:00EWRSFON53441UAFixed wing multi engine 2191
    20132013-01-01 06:00:00LGADFWN3DUAAAANA NA NA
    20132013-01-01 05:00:00JFKBOSN708JBB6Fixed wing multi engine 2200
    20132013-01-01 06:00:00EWRLASN76515UAFixed wing multi engine 2149
    20132013-01-01 06:00:00LGAFLLN595JBB6Fixed wing multi engine 2200
    20132013-01-01 06:00:00LGAATLN542MQMQNA NA NA
    20132013-01-01 06:00:00EWRPBIN644JBB6Fixed wing multi engine 2200
    20132013-01-01 06:00:00LGAMSPN971DLDLFixed wing multi engine 2142
    20132013-01-01 06:00:00LGADTWN730MQMQNA NA NA
    20132013-01-01 06:00:00EWRMIAN633AAAAFixed wing multi engine 2178
    20132013-01-01 06:00:00JFKATLN3739PDLFixed wing multi engine 2189
    20132013-01-01 06:00:00EWRMIAN53442UAFixed wing multi engine 2191
    20132013-01-01 06:00:00EWRORDN9EAMQMQNA NA NA
    20132013-01-01 06:00:00JFKSFON532UAUANA NA NA
    20132013-01-01 06:00:00JFKRSWN635JBB6Fixed wing multi engine 2200
    20132013-01-01 06:00:00JFKSJUN794JBB6Fixed wing multi engine 2200
    20132013-01-01 06:00:00EWRATLN326NBDLFixed wing multi engine 2145
    ...........................
    20132013-09-30 21:00:00LGACHON712EVEVFixed wing multi engine 2 80
    20132013-09-30 21:00:00EWRCLTN16546EVFixed wing multi engine 2 55
    20132013-09-30 21:00:00JFKDENN807JBB6Fixed wing multi engine 2200
    20132013-09-30 20:00:00LGARICN751EVEVFixed wing multi engine 2 80
    20132013-09-30 21:00:00JFKDCAN807MQMQNA NA NA
    20132013-09-30 21:00:00JFKLAXN335AAAAFixed wing multi engine 2255
    20132013-09-30 21:00:00EWRPWMN12957EVFixed wing multi engine 2 55
    20132013-09-30 21:00:00JFKSJUN633JBB6Fixed wing multi engine 2200
    20132013-09-30 21:00:00LGAFLLN627JBB6Fixed wing multi engine 2200
    20132013-09-30 21:00:00EWRBOSN813UAUAFixed wing multi engine 2179
    20132013-09-30 21:00:00EWRMHTN10575EVFixed wing multi engine 2 55
    20132013-09-30 18:00:00JFKBUFN906XJ9EFixed wing multi engine 2 95
    20132013-09-30 22:00:00LGABGRN722EVEVFixed wing multi engine 2 80
    20132013-09-30 21:00:00LGABNAN532MQMQNA NA NA
    20132013-09-30 20:00:00EWRSTLN12145EVFixed wing multi engine 2 55
    20132013-09-30 22:00:00JFKPWMN193JBB6Fixed wing multi engine 2 20
    20132013-09-30 21:00:00EWRSFON578UAUAFixed wing multi engine 2178
    20132013-09-30 20:00:00JFKMCON804JBB6Fixed wing multi engine 2200
    20132013-09-30 22:00:00JFKBTVN318JBB6Fixed wing multi engine 2 20
    20132013-09-30 22:00:00JFKSYRN354JBB6Fixed wing multi engine 2 20
    20132013-09-30 22:00:00JFKBUFN281JBB6Fixed wing multi engine 2 20
    20132013-09-30 22:00:00JFKROCN346JBB6Fixed wing multi engine 2 20
    20132013-09-30 22:00:00JFKBOSN565JBB6Fixed wing multi engine 2200
    20132013-09-30 23:00:00JFKPSEN516JBB6Fixed wing multi engine 2200
    20132013-09-30 18:00:00LGABNAN740EVEVFixed wing multi engine 2 80
    20132013-09-30 14:00:00JFKDCANA 9ENA NA NA
    20132013-09-30 22:00:00LGASYRNA 9ENA NA NA
    20132013-09-30 12:00:00LGABNAN535MQMQNA NA NA
    20132013-09-30 11:00:00LGACLEN511MQMQNA NA NA
    20132013-09-30 08:00:00LGARDUN839MQMQNA NA NA
    # When left_join() fails to find a match for a row in x, it fills in the new variables with missing values. 
    flights2 |> 
      filter(tailnum == "N535MQ") |> 
      left_join(planes |> select(tailnum, type, engines, seats))
    
    Joining with `by = join_by(tailnum)`
    
    A tibble: 264 x 9
    yeartime_hourorigindesttailnumcarriertypeenginesseats
    <int><dttm><chr><chr><chr><chr><chr><int><int>
    20132013-01-02 16:00:00LGABNAN535MQMQNANANA
    20132013-01-03 06:00:00LGACLTN535MQMQNANANA
    20132013-01-03 11:00:00LGAMSPN535MQMQNANANA
    20132013-01-03 18:00:00LGAMSPN535MQMQNANANA
    20132013-01-11 16:00:00LGAATLN535MQMQNANANA
    20132013-01-12 12:00:00LGABNAN535MQMQNANANA
    20132013-01-12 18:00:00LGACLEN535MQMQNANANA
    20132013-01-13 15:00:00LGACLTN535MQMQNANANA
    20132013-01-13 20:00:00LGAATLN535MQMQNANANA
    20132013-01-14 08:00:00LGAATLN535MQMQNANANA
    20132013-01-14 20:00:00LGAATLN535MQMQNANANA
    20132013-01-18 11:00:00JFKDCAN535MQMQNANANA
    20132013-01-19 15:00:00JFKDCAN535MQMQNANANA
    20132013-01-19 19:00:00JFKCMHN535MQMQNANANA
    20132013-01-20 08:00:00LGAATLN535MQMQNANANA
    20132013-01-20 14:00:00LGAMSPN535MQMQNANANA
    20132013-01-20 21:00:00LGABNAN535MQMQNANANA
    20132013-01-21 09:00:00LGAATLN535MQMQNANANA
    20132013-01-23 18:00:00EWRORDN535MQMQNANANA
    20132013-01-25 06:00:00LGAMSPN535MQMQNANANA
    20132013-01-26 08:00:00LGAATLN535MQMQNANANA
    20132013-01-26 15:00:00LGACLTN535MQMQNANANA
    20132013-01-27 11:00:00LGAMSPN535MQMQNANANA
    20132013-01-28 09:00:00LGACLTN535MQMQNANANA
    20132013-01-28 14:00:00LGAATLN535MQMQNANANA
    20132013-01-31 12:00:00EWRORDN535MQMQNANANA
    20132013-10-01 09:00:00LGACLTN535MQMQNANANA
    20132013-10-01 14:00:00LGAMSPN535MQMQNANANA
    20132013-10-01 21:00:00LGABNAN535MQMQNANANA
    20132013-10-02 15:00:00LGACLTN535MQMQNANANA
    ...........................
    20132013-08-30 10:00:00EWRORDN535MQMQNANANA
    20132013-08-30 15:00:00EWRORDN535MQMQNANANA
    20132013-08-31 13:00:00EWRORDN535MQMQNANANA
    20132013-09-03 13:00:00EWRORDN535MQMQNANANA
    20132013-09-05 15:00:00EWRORDN535MQMQNANANA
    20132013-09-13 13:00:00EWRORDN535MQMQNANANA
    20132013-09-18 10:00:00LGAATLN535MQMQNANANA
    20132013-09-18 16:00:00LGABNAN535MQMQNANANA
    20132013-09-19 06:00:00LGAATLN535MQMQNANANA
    20132013-09-19 12:00:00LGAMSPN535MQMQNANANA
    20132013-09-20 13:00:00LGACLEN535MQMQNANANA
    20132013-09-20 18:00:00LGACLEN535MQMQNANANA
    20132013-09-21 11:00:00LGAMSPN535MQMQNANANA
    20132013-09-23 11:00:00JFKDCAN535MQMQNANANA
    20132013-09-25 06:00:00EWRORDN535MQMQNANANA
    20132013-09-25 20:00:00JFKCMHN535MQMQNANANA
    20132013-09-26 11:00:00JFKDCAN535MQMQNANANA
    20132013-09-26 15:00:00JFKDCAN535MQMQNANANA
    20132013-09-27 11:00:00LGACLEN535MQMQNANANA
    20132013-09-27 15:00:00LGARDUN535MQMQNANANA
    20132013-09-27 21:00:00LGACLTN535MQMQNANANA
    20132013-09-28 11:00:00LGACLEN535MQMQNANANA
    20132013-09-29 08:00:00LGADTWN535MQMQNANANA
    20132013-09-29 13:00:00LGARDUN535MQMQNANANA
    20132013-09-29 17:00:00LGARDUN535MQMQNANANA
    20132013-09-29 21:00:00LGABNAN535MQMQNANANA
    20132013-09-30 11:00:00LGARDUN535MQMQNANANA
    20132013-09-30 15:00:00LGARDUN535MQMQNANANA
    20132013-09-30 21:00:00LGARDUN535MQMQNANANA
    20132013-09-30 12:00:00LGABNAN535MQMQNANANA

    Specifying join keys

    By default, left_join() will use all variables that appear in both data frames as the join key, the so called natural join. This is a useful heuristic, but it doesn’t always work, for example, both dataframes have the same column’s name (but with different meanings)

    flights2 |> 
      left_join(airports, join_by(origin == faa))
    
    A tibble: 336776 x 13
    yeartime_hourorigindesttailnumcarriernamelatlonalttzdsttzone
    <int><dttm><chr><chr><chr><chr><chr><dbl><dbl><dbl><dbl><chr><chr>
    20132013-01-01 05:00:00EWRIAHN14228UANewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-01-01 05:00:00LGAIAHN24211UALa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-01-01 05:00:00JFKMIAN619AAAAJohn F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 05:00:00JFKBQNN804JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 06:00:00LGAATLN668DNDLLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-01-01 05:00:00EWRORDN39463UANewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-01-01 06:00:00EWRFLLN516JBB6Newark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-01-01 06:00:00LGAIADN829ASEVLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-01-01 06:00:00JFKMCON593JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 06:00:00LGAORDN3ALAAAALa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-01-01 06:00:00JFKPBIN793JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 06:00:00JFKTPAN657JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 06:00:00JFKLAXN29129UAJohn F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 06:00:00EWRSFON53441UANewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-01-01 06:00:00LGADFWN3DUAAAALa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-01-01 05:00:00JFKBOSN708JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 06:00:00EWRLASN76515UANewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-01-01 06:00:00LGAFLLN595JBB6La Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-01-01 06:00:00LGAATLN542MQMQLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-01-01 06:00:00EWRPBIN644JBB6Newark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-01-01 06:00:00LGAMSPN971DLDLLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-01-01 06:00:00LGADTWN730MQMQLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-01-01 06:00:00EWRMIAN633AAAANewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-01-01 06:00:00JFKATLN3739PDLJohn F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 06:00:00EWRMIAN53442UANewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-01-01 06:00:00EWRORDN9EAMQMQNewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-01-01 06:00:00JFKSFON532UAUAJohn F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 06:00:00JFKRSWN635JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 06:00:00JFKSJUN794JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-01-01 06:00:00EWRATLN326NBDLNewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    .......................................
    20132013-09-30 21:00:00LGACHON712EVEVLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-09-30 21:00:00EWRCLTN16546EVNewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-09-30 21:00:00JFKDENN807JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 20:00:00LGARICN751EVEVLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-09-30 21:00:00JFKDCAN807MQMQJohn F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 21:00:00JFKLAXN335AAAAJohn F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 21:00:00EWRPWMN12957EVNewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-09-30 21:00:00JFKSJUN633JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 21:00:00LGAFLLN627JBB6La Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-09-30 21:00:00EWRBOSN813UAUANewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-09-30 21:00:00EWRMHTN10575EVNewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-09-30 18:00:00JFKBUFN906XJ9EJohn F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 22:00:00LGABGRN722EVEVLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-09-30 21:00:00LGABNAN532MQMQLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-09-30 20:00:00EWRSTLN12145EVNewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-09-30 22:00:00JFKPWMN193JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 21:00:00EWRSFON578UAUANewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    20132013-09-30 20:00:00JFKMCON804JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 22:00:00JFKBTVN318JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 22:00:00JFKSYRN354JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 22:00:00JFKBUFN281JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 22:00:00JFKROCN346JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 22:00:00JFKBOSN565JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 23:00:00JFKPSEN516JBB6John F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 18:00:00LGABNAN740EVEVLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-09-30 14:00:00JFKDCANA 9EJohn F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    20132013-09-30 22:00:00LGASYRNA 9ELa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-09-30 12:00:00LGABNAN535MQMQLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-09-30 11:00:00LGACLEN511MQMQLa Guardia 40.77725-73.8726122-5AAmerica/New_York
    20132013-09-30 08:00:00LGARDUN839MQMQLa Guardia 40.77725-73.8726122-5AAmerica/New_York

    Filtering joins

    The primary action of a filtering join is to filter the rows.

    • semi_join() keep all rows in x that have a match in y.
    • anti_join() return all rows in x that don’t have a match in y. They’re useful for finding missing values that are implicit in the data.
    airports |> 
      semi_join(flights2, join_by(faa == origin))
    
    A tibble: 3 x 8
    faanamelatlonalttzdsttzone
    <chr><chr><dbl><dbl><dbl><dbl><chr><chr>
    EWRNewark Liberty Intl40.69250-74.1686718-5AAmerica/New_York
    JFKJohn F Kennedy Intl40.63975-73.7789313-5AAmerica/New_York
    LGALa Guardia 40.77725-73.8726122-5AAmerica/New_York
    airports |> 
      semi_join(flights2, join_by(faa == dest))
    
    A tibble: 101 x 8
    faanamelatlonalttzdsttzone
    <chr><chr><dbl><dbl><dbl><dbl><chr><chr>
    ABQAlbuquerque International Sunport 35.04022-106.609195355-7AAmerica/Denver
    ACKNantucket Mem 41.25305 -70.06018 48-5AAmerica/New_York
    ALBAlbany Intl 42.74827 -73.80169 285-5AAmerica/New_York
    ANCTed Stevens Anchorage Intl 61.17436-149.99636 152-9AAmerica/Anchorage
    ATLHartsfield Jackson Atlanta Intl 33.63672 -84.428071026-5AAmerica/New_York
    AUSAustin Bergstrom Intl 30.19453 -97.66989 542-6AAmerica/Chicago
    AVLAsheville Regional Airport 35.43619 -82.541812165-5AAmerica/New_York
    BDLBradley Intl 41.93889 -72.68322 173-5AAmerica/New_York
    BGRBangor Intl 44.80744 -68.82814 192-5AAmerica/New_York
    BHMBirmingham Intl 33.56294 -86.75355 644-6AAmerica/Chicago
    BNANashville Intl 36.12447 -86.67819 599-6AAmerica/Chicago
    BOSGeneral Edward Lawrence Logan Intl42.36435 -71.00518 19-5AAmerica/New_York
    BTVBurlington Intl 44.47186 -73.15328 335-5AAmerica/New_York
    BUFBuffalo Niagara Intl 42.94053 -78.73217 724-5AAmerica/New_York
    BURBob Hope 34.20067-118.35867 778-8AAmerica/Los_Angeles
    BWIBaltimore Washington Intl 39.17536 -76.66833 146-5AAmerica/New_York
    BZNGallatin Field 45.77764-111.160154500-7AAmerica/Denver
    CAEColumbia Metropolitan 33.93883 -81.11953 236-5AAmerica/New_York
    CAKAkron Canton Regional Airport 40.91608 -81.442191228-5AAmerica/New_York
    CHOCharlottesville-Albemarle 38.13864 -78.45286 639-5AAmerica/New_York
    CHSCharleston Afb Intl 32.89865 -80.04053 45-5AAmerica/New_York
    CLECleveland Hopkins Intl 41.41169 -81.84979 791-5AAmerica/New_York
    CLTCharlotte Douglas Intl 35.21400 -80.94314 748-5AAmerica/New_York
    CMHPort Columbus Intl 39.99797 -82.89189 815-5AAmerica/New_York
    CRWYeager 38.37315 -81.59319 981-5AAmerica/New_York
    CVGCincinnati Northern Kentucky Intl 39.04884 -84.66782 896-5AAmerica/New_York
    DAYJames M Cox Dayton Intl 39.90237 -84.219371009-5AAmerica/New_York
    DCARonald Reagan Washington Natl 38.85208 -77.03772 15-5AAmerica/New_York
    DENDenver Intl 39.86166-104.673185431-7AAmerica/Denver
    DFWDallas Fort Worth Intl 32.89683 -97.03800 607-6AAmerica/Chicago
    ........................
    PDXPortland Intl 45.58872-122.59750 30-8AAmerica/Los_Angeles
    PHLPhiladelphia Intl 39.87194 -75.24114 36-5AAmerica/New_York
    PHXPhoenix Sky Harbor Intl 33.43428-112.011581135-7NAmerica/Phoenix
    PITPittsburgh Intl 40.49147 -80.232871204-5AAmerica/New_York
    PSPPalm Springs Intl 33.82967-116.50669 477-8AAmerica/Los_Angeles
    PVDTheodore Francis Green State 41.73258 -71.42038 55-5AAmerica/New_York
    PWMPortland Intl Jetport 43.64616 -70.30928 77-5AAmerica/New_York
    RDURaleigh Durham Intl 35.87764 -78.78747 435-5AAmerica/New_York
    RICRichmond Intl 37.50517 -77.31967 167-5AAmerica/New_York
    ROCGreater Rochester Intl 43.11887 -77.67239 559-5AAmerica/New_York
    RSWSouthwest Florida Intl 26.53617 -81.75517 30-5AAmerica/New_York
    SANSan Diego Intl 32.73356-117.18967 17-8AAmerica/Los_Angeles
    SATSan Antonio Intl 29.53369 -98.46978 809-6AAmerica/Chicago
    SAVSavannah Hilton Head Intl 32.12758 -81.20214 51-5AAmerica/New_York
    SBNSouth Bend Rgnl 41.70866 -86.31725 799-5AAmerica/New_York
    SDFLouisville International Airport38.17409 -85.73650 501-5AAmerica/New_York
    SEASeattle Tacoma Intl 47.44900-122.30931 433-8AAmerica/Los_Angeles
    SFOSan Francisco Intl 37.61897-122.37489 13-8AAmerica/Los_Angeles
    SJCNorman Y Mineta San Jose Intl 37.36260-121.92902 62-8AAmerica/Los_Angeles
    SLCSalt Lake City Intl 40.78839-111.977774227-7AAmerica/Denver
    SMFSacramento Intl 38.69542-121.59078 27-8AAmerica/Los_Angeles
    SNAJohn Wayne Arpt Orange Co 33.67567-117.86822 56-8AAmerica/Los_Angeles
    SRQSarasota Bradenton Intl 27.39544 -82.55439 30-5AAmerica/New_York
    STLLambert St Louis Intl 38.74870 -90.37003 618-6AAmerica/Chicago
    SYRSyracuse Hancock Intl 43.11119 -76.10631 421-5AAmerica/New_York
    TPATampa Intl 27.97547 -82.53325 26-5AAmerica/New_York
    TULTulsa Intl 36.19839 -95.88811 677-6AAmerica/Chicago
    TVCCherry Capital Airport 44.74144 -85.58223 624-5AAmerica/New_York
    TYSMc Ghee Tyson 35.81097 -83.99403 981-5AAmerica/New_York
    XNANW Arkansas Regional 36.28187 -94.306811287-6AAmerica/Chicago
    # find rows that are missing from airports by looking for flights that don’t have a matching destination airport.
    flights2 |> 
      anti_join(airports, join_by(dest == faa)) |> 
      distinct(dest)
    
    A tibble: 4 x 1
    dest
    <chr>
    BQN
    SJU
    STT
    PSE
    #  find which tailnums are missing from planes
    flights2 |>
      anti_join(planes, join_by(tailnum)) |> 
      distinct(tailnum)
    
    A tibble: 722 x 1
    tailnum
    <chr>
    N3ALAA
    N3DUAA
    N542MQ
    N730MQ
    N9EAMQ
    N532UA
    N3EMAA
    N518MQ
    N3BAAA
    N3CYAA
    N426US
    N3GKAA
    N4WNAA
    N5FMAA
    N722MQ
    N3EKAA
    N3ETAA
    N541AA
    N4WRAA
    N4WPAA
    N508MQ
    N3HMAA
    N828MQ
    N3GEAA
    N739MQ
    N531MQ
    N527JB
    N846MQ
    N3GVAA
    N4YCAA
    ...
    N7BAAA
    N7BVAA
    N626MQ
    N675MQ
    N580AA
    N717MQ
    N738MQ
    N720MQ
    N7ASAA
    N328AT
    N735MQ
    N5EDAA
    N5DJAA
    N7ALAA
    N721MQ
    N7BGAA
    N5ESAA
    N456UW
    N838MQ
    N442US
    N502SW
    N451UW
    N7BKAA
    N800MQ
    N7CAAA
    N823MQ
    N5FCAA
    N5ERAA
    N654MQ
    N647MQ

    Exercises

    1. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

    2. Imagine you’ve found the top 10 most popular destinations using this code:

    top_dest <- flights2 |>
      count(dest, sort = TRUE) |>
      head(10)
    

    How can you find all flights to those destinations?

    1. Does every departing flight have corresponding weather data for that hour?

    2. What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

    3. Add a column to planes that lists every carrier that has flown that plane. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned in previous chapters.

    4. Add the latitude and the longitude of the origin and destination airport to flights. Is it easier to rename the columns before or after the join?

    5. Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:
      airports |>
        semi_join(flights, join_by(faa == dest)) |>
        ggplot(aes(x = lon, y = lat)) +
       borders("state") +
       geom_point() +
       coord_quickmap()
      

      You might want to use the size or color of the points to display the average delay for each airport.

    6. What happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather.

    19.4 How do joins work?

    Inner join{width=30%} left join{width=30%} right join{width=30%}

    full join{width=30%} Venn diagram{width=30%}

    x <- tribble(
      ~key, ~val_x,
         1, "x1",
         2, "x2",
         3, "x3"
    )
    y <- tribble(
      ~key, ~val_y,
         1, "y1",
         2, "y2",
         4, "y3"
    )
    
    x |>
    inner_join(y)
    
    Joining with `by = join_by(key)`
    
    A tibble: 2 x 3
    keyval_xval_y
    <dbl><chr><chr>
    1x1y1
    2x2y2
    x |>
    left_join(y)
    
    Joining with `by = join_by(key)`
    
    A tibble: 3 x 3
    keyval_xval_y
    <dbl><chr><chr>
    1x1y1
    2x2y2
    3x3NA
    x |> 
    right_join(y)
    
    Joining with `by = join_by(key)`
    
    A tibble: 3 x 3
    keyval_xval_y
    <dbl><chr><chr>
    1x1y1
    2x2y2
    4NAy3
    full_join(x,y)
    
    Joining with `by = join_by(key)`
    
    A tibble: 4 x 3
    keyval_xval_y
    <dbl><chr><chr>
    1x1y1
    2x2y2
    3x3NA
    4NAy3

    Row matching

    Inner join{width=30%}

    one particularly dangerous case which can cause a combinatorial explosion of rows.

    Filterring joins

    semi join{width=30%} anti join{width=30%}

    • In a semi-join it only matters that there is a match; otherwise values in y don’t affect the output.
    • An anti-join is the inverse of a semi-join, dropping rows from x that have a match in y.
    semi_join(x, y)
    anti_join(x, y)
    
    Joining with `by = join_by(key)`
    
    A tibble: 2 x 2
    keyval_x
    <dbl><chr>
    1x1
    2x2
    Joining with `by = join_by(key)`
    
    A tibble: 1 x 2
    keyval_x
    <dbl><chr>
    3x3

    19.5 Non-equi joins

    • Cross joins match every pair of rows.
    • Inequality joins use <, <=, >, and >= instead of ==.
    • Rolling joins are similar to inequality joins but only find the closest match.
    • Overlap joins are a special type of inequality join designed to work with ranges.
    # equi join (where the rows match if the x key equals the y key)
    x |> 
    inner_join(y, join_by(key == key), keep = TRUE)
    
    A tibble: 2 x 4
    key.xval_xkey.yval_y
    <dbl><chr><dbl><chr>
    1x11y1
    2x22y2
    # non-equi join
    x |> 
    inner_join(y, join_by(key >= key), keep = TRUE)
    
    A tibble: 5 x 4
    key.xval_xkey.yval_y
    <dbl><chr><dbl><chr>
    1x11y1
    2x21y1
    2x22y2
    3x31y1
    3x32y2

    Cross join{width=30%} Inequality join{width=30%} Rolling join{width=30%}

    cross_join(x,y)
    
    A tibble: 9 x 4
    key.xval_xkey.yval_y
    <dbl><chr><dbl><chr>
    1x11y1
    1x12y2
    1x14y3
    2x21y1
    2x22y2
    2x24y3
    3x31y1
    3x32y2
    3x34y3
    inner_join(x, y, join_by(key < key))
    
    A tibble: 4 x 4
    key.xval_xkey.yval_y
    <dbl><chr><dbl><chr>
    1x12y2
    1x14y3
    2x24y3
    3x34y3
    left_join(x, y, join_by(closest(key >= key)))
    
    A tibble: 3 x 4
    key.xval_xkey.yval_y
    <dbl><chr><dbl><chr>
    1x11y1
    2x22y2
    3x32y2
    anti_join(x, y, join_by(closest(key >= key)))
    
    A tibble: 0 x 2
    keyval_x
    <dbl><chr>

    Overlap join

    parties <- tibble(
      q = 1:4,
      party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
      start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
      end = ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
    )
    parties
    
    A tibble: 4 x 4
    qpartystartend
    <int><date><date><date>
    12022-01-102022-01-012022-04-03
    22022-04-042022-04-042022-07-11
    32022-07-112022-07-112022-10-02
    42022-10-032022-10-032022-12-31
    parties |> 
      inner_join(parties, join_by(overlaps(start, end, start, end), q < q)) |> 
      select(start.x, end.x, start.y, end.y)
    
    A tibble: 1 x 4
    start.xend.xstart.yend.y
    <date><date><date><date>
    2022-04-042022-07-112022-07-112022-10-02
    parties <- tibble(
      q = 1:4,
      party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
      start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
      end = ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
    )
    
    set.seed(123)
    employees <- tibble(
      name = sample(babynames::babynames$name, 100),
      birthday = ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
    )
    employees
    
    A tibble: 100 x 2
    namebirthday
    <chr><date>
    Kemba 2022-01-22
    Orean 2022-06-26
    Kirstyn 2022-02-11
    Amparo 2022-11-11
    Belen 2022-03-25
    Rayshaun 2022-01-11
    Brazil 2022-05-01
    Chaston 2022-10-29
    Reyn 2022-03-26
    Ogechi 2022-12-31
    Raylin 2022-07-13
    Elisha 2022-04-17
    Francico 2022-03-18
    Theoplis 2022-07-17
    Ashea 2022-09-06
    Angela 2022-07-19
    Essie 2022-06-09
    Allyn 2022-09-07
    Tanita 2022-10-19
    Sherriann2022-01-16
    Raven 2022-02-02
    Glenys 2022-02-09
    Li 2022-01-10
    Miana 2022-07-19
    Sewell 2022-05-05
    Wayland 2022-09-22
    Nadean 2022-09-20
    Gregoria 2022-07-05
    Sumiye 2022-03-02
    Norvell 2022-09-09
    ......
    Liora 2022-07-31
    Andrina 2022-08-10
    Gay 2022-06-08
    Dustyn 2022-12-15
    Nasiya 2022-05-25
    Belma 2022-02-26
    Thena 2022-05-28
    Cooper 2022-06-12
    Adele 2022-08-26
    Atiana 2022-06-10
    Jamil 2022-03-07
    Nalani 2022-01-04
    Nathalie 2022-11-26
    Duriel 2022-08-13
    Yesenia 2022-04-27
    Sherlie 2022-01-25
    Brooksley 2022-05-16
    Kristi 2022-02-24
    Alethea 2022-08-05
    Teandre 2022-03-26
    Mohamedali2022-02-14
    Audray 2022-05-26
    Elta 2022-06-19
    Suzannah 2022-05-14
    Cleve 2022-07-18
    Promise 2022-12-27
    Ian 2022-06-25
    Shelva 2022-09-02
    Aedan 2022-11-05
    Dorianna 2022-04-14
    employees |> 
      inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
    
    A tibble: 100 x 6
    namebirthdayqpartystartend
    <chr><date><int><date><date><date>
    Kemba 2022-01-2212022-01-102022-01-012022-04-03
    Orean 2022-06-2622022-04-042022-04-042022-07-10
    Kirstyn 2022-02-1112022-01-102022-01-012022-04-03
    Amparo 2022-11-1142022-10-032022-10-032022-12-31
    Belen 2022-03-2512022-01-102022-01-012022-04-03
    Rayshaun 2022-01-1112022-01-102022-01-012022-04-03
    Brazil 2022-05-0122022-04-042022-04-042022-07-10
    Chaston 2022-10-2942022-10-032022-10-032022-12-31
    Reyn 2022-03-2612022-01-102022-01-012022-04-03
    Ogechi 2022-12-3142022-10-032022-10-032022-12-31
    Raylin 2022-07-1332022-07-112022-07-112022-10-02
    Elisha 2022-04-1722022-04-042022-04-042022-07-10
    Francico 2022-03-1812022-01-102022-01-012022-04-03
    Theoplis 2022-07-1732022-07-112022-07-112022-10-02
    Ashea 2022-09-0632022-07-112022-07-112022-10-02
    Angela 2022-07-1932022-07-112022-07-112022-10-02
    Essie 2022-06-0922022-04-042022-04-042022-07-10
    Allyn 2022-09-0732022-07-112022-07-112022-10-02
    Tanita 2022-10-1942022-10-032022-10-032022-12-31
    Sherriann2022-01-1612022-01-102022-01-012022-04-03
    Raven 2022-02-0212022-01-102022-01-012022-04-03
    Glenys 2022-02-0912022-01-102022-01-012022-04-03
    Li 2022-01-1012022-01-102022-01-012022-04-03
    Miana 2022-07-1932022-07-112022-07-112022-10-02
    Sewell 2022-05-0522022-04-042022-04-042022-07-10
    Wayland 2022-09-2232022-07-112022-07-112022-10-02
    Nadean 2022-09-2032022-07-112022-07-112022-10-02
    Gregoria 2022-07-0522022-04-042022-04-042022-07-10
    Sumiye 2022-03-0212022-01-102022-01-012022-04-03
    Norvell 2022-09-0932022-07-112022-07-112022-10-02
    ..................
    Liora 2022-07-3132022-07-112022-07-112022-10-02
    Andrina 2022-08-1032022-07-112022-07-112022-10-02
    Gay 2022-06-0822022-04-042022-04-042022-07-10
    Dustyn 2022-12-1542022-10-032022-10-032022-12-31
    Nasiya 2022-05-2522022-04-042022-04-042022-07-10
    Belma 2022-02-2612022-01-102022-01-012022-04-03
    Thena 2022-05-2822022-04-042022-04-042022-07-10
    Cooper 2022-06-1222022-04-042022-04-042022-07-10
    Adele 2022-08-2632022-07-112022-07-112022-10-02
    Atiana 2022-06-1022022-04-042022-04-042022-07-10
    Jamil 2022-03-0712022-01-102022-01-012022-04-03
    Nalani 2022-01-0412022-01-102022-01-012022-04-03
    Nathalie 2022-11-2642022-10-032022-10-032022-12-31
    Duriel 2022-08-1332022-07-112022-07-112022-10-02
    Yesenia 2022-04-2722022-04-042022-04-042022-07-10
    Sherlie 2022-01-2512022-01-102022-01-012022-04-03
    Brooksley 2022-05-1622022-04-042022-04-042022-07-10
    Kristi 2022-02-2412022-01-102022-01-012022-04-03
    Alethea 2022-08-0532022-07-112022-07-112022-10-02
    Teandre 2022-03-2612022-01-102022-01-012022-04-03
    Mohamedali2022-02-1412022-01-102022-01-012022-04-03
    Audray 2022-05-2622022-04-042022-04-042022-07-10
    Elta 2022-06-1922022-04-042022-04-042022-07-10
    Suzannah 2022-05-1422022-04-042022-04-042022-07-10
    Cleve 2022-07-1832022-07-112022-07-112022-10-02
    Promise 2022-12-2742022-10-032022-10-032022-12-31
    Ian 2022-06-2522022-04-042022-04-042022-07-10
    Shelva 2022-09-0232022-07-112022-07-112022-10-02
    Aedan 2022-11-0542022-10-032022-10-032022-12-31
    Dorianna 2022-04-1422022-04-042022-04-042022-07-10

    Exercises

    x |> full_join(y, join_by(key == key))
    
    A tibble: 4 x 3
    keyval_xval_y
    <dbl><chr><chr>
    1x1y1
    2x2y2
    3x3NA
    4NAy3
    x |> full_join(y, join_by(key == key), keep = TRUE)
    
    A tibble: 4 x 4
    key.xval_xkey.yval_y
    <dbl><chr><dbl><chr>
    1x1 1y1
    2x2 2y2
    3x3NANA
    NANA 4y3
    1. When finding if any party period overlapped with another party period we used q < q in the join_by()? Why? What happens if you remove this inequality?`
    Share: Twitter Facebook LinkedIn