Reference for ‘dplyr’ package
The following notes are based on Kevin Markham’s two part tutorial on dplyr.
(Nov 2015 – minor adjustments)
Intro by DataSchool - Part 1
Note: There is a 40-minute video tutorial on YouTube that walks through this document in detail.
Why use dplyr?
- Great for data exploration and transformation
- Intuitive to write and easy to read, especially when using the “chaining” syntax
- Fast on data frames
dplyr functionality
- Five basic verbs:
filter
,select
,arrange
,mutate
,summarise
(plusgroup_by
) - Can work with data stored in databases and data tables
- Joins: inner join, left join, semi-join, anti-join
- Window functions for calculating ranking, offsets, and more
Loading dplyr and an example dataset
- dplyr will mask a few base functions
- If you also use plyr, load plyr first
hflights
is flights departing from two Houston airports in 2011
# load packages
suppressMessages(library(dplyr))
## Warning: package 'dplyr' was built under R version 3.2.2
library(hflights)
## Warning: package 'hflights' was built under R version 3.2.2
# explore data
data(hflights)
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
tbl_df
creates a “local data frame”- Local data frame is simply a wrapper for a data frame that prints nicely
# convert to local data frame
flights <- tbl_df(hflights)
# printing only shows 10 rows and as many columns as can fit on your screen
flights
## Source: local data frame [227,496 x 21]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## .. ... ... ... ... ... ... ... ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
## (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
## Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int)
# you can specify that you want to see more rows
print(flights, n=20)
# convert to a normal data frame to see all of the columns
data.frame(head(flights))
filter: Keep rows matching criteria
- Base R approach to filtering forces you to repeat the data frame’s name
- dplyr approach is simpler to write and read
- Command structure (for all dplyr verbs):
- first argument is a data frame
- return value is a data frame
- nothing is modified in place
- Note:
dplyr
generally does not preserve row names
# base R approach to view all flights on January 1
flights[flights$Month==1 & flights$DayofMonth==1, ]
# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(flights, Month==1, DayofMonth==1)
## Source: local data frame [552 x 21]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 1 6 728 840 AA 460
## 3 2011 1 1 6 1631 1736 AA 1121
## 4 2011 1 1 6 1756 2112 AA 1294
## 5 2011 1 1 6 1012 1347 AA 1700
## 6 2011 1 1 6 1211 1325 AA 1820
## 7 2011 1 1 6 557 906 AA 1994
## 8 2011 1 1 6 1824 2106 AS 731
## 9 2011 1 1 6 654 1124 B6 620
## 10 2011 1 1 6 1639 2110 B6 622
## .. ... ... ... ... ... ... ... ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
## (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
## Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int)
# use pipe for OR condition
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")
## Source: local data frame [5,316 x 21]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## .. ... ... ... ... ... ... ... ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
## (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
## Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int)
# you can also use %in% operator
filter(flights, UniqueCarrier %in% c("AA", "UA"))
select: Pick columns by name
- Base R approach is awkward to type and to read
- dplyr approach uses similar syntax to
filter
- Like a SELECT in SQL
# base R approach to select DepTime, ArrTime, and FlightNum columns
flights[, c("DepTime", "ArrTime", "FlightNum")]
# dplyr approach
select(flights, DepTime, ArrTime, FlightNum)
## Source: local data frame [227,496 x 3]
##
## DepTime ArrTime FlightNum
## 1 1400 1500 428
## 2 1401 1501 428
## 3 1352 1502 428
## 4 1403 1513 428
## 5 1405 1507 428
## 6 1359 1503 428
## 7 1359 1509 428
## 8 1355 1454 428
## 9 1443 1554 428
## 10 1443 1553 428
## .. ... ... ...
# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
## Source: local data frame [227,496 x 7]
##
## Year Month DayofMonth TaxiIn TaxiOut ArrDelay DepDelay
## 1 2011 1 1 7 13 -10 0
## 2 2011 1 2 6 9 -9 1
## 3 2011 1 3 5 17 -8 -8
## 4 2011 1 4 9 22 3 3
## 5 2011 1 5 9 9 -3 5
## 6 2011 1 6 6 13 -7 -1
## 7 2011 1 7 12 15 -1 -1
## 8 2011 1 8 7 12 -16 -5
## 9 2011 1 9 8 22 44 43
## 10 2011 1 10 6 19 43 43
## .. ... ... ... ... ... ... ...
“Chaining” or “Pipelining”
- Usual way to perform multiple operations in one line is by nesting
- Can write commands in a natural order by using the
%>%
infix operator (which can be pronounced as “then”)
# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)
# chaining method
flights %>%
select(UniqueCarrier, DepDelay) %>%
filter(DepDelay > 60)
## Source: local data frame [10,242 x 2]
##
## UniqueCarrier DepDelay
## 1 AA 90
## 2 AA 67
## 3 AA 74
## 4 AA 125
## 5 AA 82
## 6 AA 99
## 7 AA 70
## 8 AA 61
## 9 AA 74
## 10 AS 73
## .. ... ...
- Chaining increases readability significantly when there are many commands
- Operator is automatically imported from the magrittr package
- Can be used to replace nesting in R commands outside of
dplyr
# create two vectors and calculate Euclidian distance between them
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))
# chaining method
(x1-x2)^2 %>% sum() %>% sqrt()
## [1] 2.236068
arrange: Reorder rows
# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay
flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")]
# dplyr approach, default is ascending order
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(DepDelay)
## Source: local data frame [227,496 x 2]
##
## UniqueCarrier DepDelay
## 1 OO -33
## 2 MQ -23
## 3 XE -19
## 4 XE -19
## 5 CO -18
## 6 EV -18
## 7 XE -17
## 8 CO -17
## 9 XE -17
## 10 MQ -17
## .. ... ...
# use `desc` for descending
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
mutate: Add new variables
- Create new variables that are functions of existing variables
# base R approach to create a new variable Speed (in mph)
flights$Speed <- flights$Distance / flights$AirTime*60
flights[, c("Distance", "AirTime", "Speed")]
# dplyr approach (prints the new variable but does not store it)
flights %>%
select(Distance, AirTime) %>%
mutate(Speed = Distance/AirTime*60)
## Source: local data frame [227,496 x 3]
##
## Distance AirTime Speed
## 1 224 40 336.0000
## 2 224 45 298.6667
## 3 224 48 280.0000
## 4 224 39 344.6154
## 5 224 44 305.4545
## 6 224 45 298.6667
## 7 224 43 312.5581
## 8 224 40 336.0000
## 9 224 41 327.8049
## 10 224 45 298.6667
## .. ... ... ...
# store the new variable
flights <- flights %>% mutate(Speed = Distance/AirTime*60)
summarise: Reduce variables to values
- Primarily useful with data that has been grouped by one or more variables
group_by
creates the groups that will be operated onsummarise
uses the provided aggregation function to summarise each group
# base R approaches to calculate the average arrival delay to each destination
head(with(flights, tapply(ArrDelay, Dest, mean, na.rm=TRUE)))
head(aggregate(ArrDelay ~ Dest, flights, mean))
# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
flights %>%
group_by(Dest) %>%
summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))
## Source: local data frame [116 x 2]
##
## Dest avg_delay
## 1 ABQ 7.226259
## 2 AEX 5.839437
## 3 AGS 4.000000
## 4 AMA 6.840095
## 5 ANC 26.080645
## 6 ASE 6.794643
## 7 ATL 8.233251
## 8 AUS 7.448718
## 9 AVL 9.973988
## 10 BFL -13.198807
## .. ... ...
summarise_each
allows you to apply the same summary function to multiple columns at once- Note:
mutate_each
is also available
# for each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(mean), Cancelled, Diverted)
## Source: local data frame [15 x 3]
##
## UniqueCarrier Cancelled Diverted
## 1 AA 0.018495684 0.001849568
## 2 AS 0.000000000 0.002739726
## 3 B6 0.025899281 0.005755396
## 4 CO 0.006782614 0.002627370
## 5 DL 0.015903067 0.003029156
## 6 EV 0.034482759 0.003176044
## 7 F9 0.007159905 0.000000000
## 8 FL 0.009817672 0.003272557
## 9 MQ 0.029044750 0.001936317
## 10 OO 0.013946828 0.003486707
## 11 UA 0.016409266 0.002413127
## 12 US 0.011268986 0.001469868
## 13 WN 0.015504047 0.002293629
## 14 XE 0.015495599 0.003449550
## 15 YV 0.012658228 0.000000000
# for each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))
## Source: local data frame [15 x 5]
##
## UniqueCarrier ArrDelay_min DepDelay_min ArrDelay_max DepDelay_max
## 1 AA -39 -15 978 970
## 2 AS -43 -15 183 172
## 3 B6 -44 -14 335 310
## 4 CO -55 -18 957 981
## 5 DL -32 -17 701 730
## 6 EV -40 -18 469 479
## 7 F9 -24 -15 277 275
## 8 FL -30 -14 500 507
## 9 MQ -38 -23 918 931
## 10 OO -57 -33 380 360
## 11 UA -47 -11 861 869
## 12 US -42 -17 433 425
## 13 WN -44 -10 499 548
## 14 XE -70 -19 634 628
## 15 YV -32 -11 72 54
- Helper function
n()
counts the number of rows in a group - Helper function
n_distinct(vector)
counts the number of unique items in that vector
# for each day of the year, count the total number of flights and sort in descending order
flights %>%
group_by(Month, DayofMonth) %>%
summarise(flight_count = n()) %>%
arrange(desc(flight_count))
## Source: local data frame [365 x 3]
## Groups: Month
##
## Month DayofMonth flight_count
## 1 1 3 702
## 2 1 2 678
## 3 1 20 663
## 4 1 27 663
## 5 1 13 662
## 6 1 7 661
## 7 1 14 661
## 8 1 21 661
## 9 1 28 661
## 10 1 6 660
## .. ... ... ...
# rewrite more simply with the `tally` function
flights %>%
group_by(Month, DayofMonth) %>%
tally(sort = TRUE)
## Source: local data frame [365 x 3]
## Groups: Month
##
## Month DayofMonth n
## 1 1 3 702
## 2 1 2 678
## 3 1 20 663
## 4 1 27 663
## 5 1 13 662
## 6 1 7 661
## 7 1 14 661
## 8 1 21 661
## 9 1 28 661
## 10 1 6 660
## .. ... ... ...
# for each destination, count the total number of flights and the number of distinct planes that flew there
flights %>%
group_by(Dest) %>%
summarise(flight_count = n(), plane_count = n_distinct(TailNum))
## Source: local data frame [116 x 3]
##
## Dest flight_count plane_count
## 1 ABQ 2812 716
## 2 AEX 724 215
## 3 AGS 1 1
## 4 AMA 1297 158
## 5 ANC 125 38
## 6 ASE 125 60
## 7 ATL 7886 983
## 8 AUS 5022 1015
## 9 AVL 350 142
## 10 BFL 504 70
## .. ... ... ...
- Grouping can sometimes be useful without summarising
# for each destination, show the number of cancelled and not cancelled flights
flights %>%
group_by(Dest) %>%
select(Cancelled) %>%
table() %>%
head()
## Cancelled
## Dest 0 1
## ABQ 2787 25
## AEX 712 12
## AGS 1 0
## AMA 1265 32
## ANC 125 0
## ASE 120 5
Window Functions
- Aggregation function (like
mean
) takes n inputs and returns 1 value - Window function takes n inputs and returns n values
- Includes ranking and ordering functions (like
min_rank
), offset functions (lead
andlag
), and cumulative aggregates (likecummean
).
- Includes ranking and ordering functions (like
# for each carrier, calculate which two days of the year they had their longest departure delays
# note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
filter(min_rank(desc(DepDelay)) <= 2) %>%
arrange(UniqueCarrier, desc(DepDelay))
# rewrite more simply with the `top_n` function
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
top_n(2) %>%
arrange(UniqueCarrier, desc(DepDelay))
## Selecting by DepDelay
## Source: local data frame [30 x 4]
## Groups: UniqueCarrier
##
## UniqueCarrier Month DayofMonth DepDelay
## 1 AA 12 12 970
## 2 AA 11 19 677
## 3 AS 2 28 172
## 4 AS 7 6 138
## 5 B6 10 29 310
## 6 B6 8 19 283
## 7 CO 8 1 981
## 8 CO 1 20 780
## 9 DL 10 25 730
## 10 DL 4 5 497
## .. ... ... ... ...
# for each month, calculate the number of flights and the change from the previous month
flights %>%
group_by(Month) %>%
summarise(flight_count = n()) %>%
mutate(change = flight_count - lag(flight_count))
## Source: local data frame [12 x 3]
##
## Month flight_count change
## 1 1 18910 NA
## 2 2 17128 -1782
## 3 3 19470 2342
## 4 4 18593 -877
## 5 5 19172 579
## 6 6 19600 428
## 7 7 20548 948
## 8 8 20176 -372
## 9 9 18065 -2111
## 10 10 18696 631
## 11 11 18021 -675
## 12 12 19117 1096
# rewrite more simply with the `tally` function
flights %>%
group_by(Month) %>%
tally() %>%
mutate(change = n - lag(n))
## Source: local data frame [12 x 3]
##
## Month n change
## 1 1 18910 NA
## 2 2 17128 -1782
## 3 3 19470 2342
## 4 4 18593 -877
## 5 5 19172 579
## 6 6 19600 428
## 7 7 20548 948
## 8 8 20176 -372
## 9 9 18065 -2111
## 10 10 18696 631
## 11 11 18021 -675
## 12 12 19117 1096
Other Useful Convenience Functions
# randomly sample a fixed number of rows, without replacement
flights %>% sample_n(5)
## Source: local data frame [5 x 22]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 10 9 7 1457 1910 EV 5412
## 2 2011 2 14 1 1529 1620 XE 2936
## 3 2011 8 28 7 1021 1413 XE 2568
## 4 2011 3 20 7 1512 1619 XE 2308
## 5 2011 6 10 5 1025 1146 XE 2928
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
## (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
## Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int), Speed (dbl)
# randomly sample a fraction of rows, with replacement
flights %>% sample_frac(0.25, replace=TRUE)
## Source: local data frame [56,874 x 22]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 11 11 5 1004 1148 XE 4234
## 2 2011 8 18 4 508 759 FL 294
## 3 2011 6 22 3 1326 1433 CO 379
## 4 2011 12 14 3 1440 1541 CO 1083
## 5 2011 9 16 5 1301 1520 OO 2018
## 6 2011 10 24 1 1711 1807 WN 3869
## 7 2011 8 11 4 1914 2004 XE 2418
## 8 2011 9 28 3 1706 1905 XE 2404
## 9 2011 9 14 3 1320 1443 WN 913
## 10 2011 10 14 5 1009 1138 XE 4419
## .. ... ... ... ... ... ... ... ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
## (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
## Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int), Speed (dbl)
# base R approach to view the structure of an object
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 227496 obs. of 22 variables:
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ...
## $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
## $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
## $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ...
## $ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ...
## $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ...
## $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ...
## $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ...
## $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
## $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ...
## $ Origin : chr "IAH" "IAH" "IAH" "IAH" ...
## $ Dest : chr "DFW" "DFW" "DFW" "DFW" ...
## $ Distance : int 224 224 224 224 224 224 224 224 224 224 ...
## $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ...
## $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CancellationCode : chr "" "" "" "" ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Speed : num 336 299 280 345 305 ...
# dplyr approach: better formatting, and adapts to your screen width
glimpse(flights)
## Observations: 227496
## Variables:
## $ Year (int) 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek (int) 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime (int) 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime (int) 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier (chr) "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum (int) 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum (chr) "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime (int) 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime (int) 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay (int) -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay (int) 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin (chr) "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest (chr) "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance (int) 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn (int) 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut (int) 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode (chr) "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Speed (dbl) 336.0000, 298.6667, 280.0000, 344.6154, 305....
Intro by DataSchool - Part 2
Since Part 1, there have been two significant updates to dplyr (0.3 and 0.4), introducing a ton of new features.
This document (created in March 2015) covers the most useful new features in 0.3 and 0.4, as well as other functionality not covered last time. The new video tutorial walks through the code below in detail.
Loading dplyr and the nycflights13 dataset
Hadley Wickham has rewritten the dplyr vignettes to use the nycflights13 package instead, and so we are also using nycflights13 for the sake of consistency.
# remove flights data if you just finished my previous tutorial
rm(flights)
# load packages
suppressMessages(library(dplyr))
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 3.2.2
# print the flights dataset from nycflights13
flights
## Source: local data frame [336,776 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
Choosing columns: select, rename
# besides just using select() to pick columns...
flights %>% select(carrier, flight)
## Source: local data frame [336,776 x 2]
##
## carrier flight
## 1 UA 1545
## 2 UA 1714
## 3 AA 1141
## 4 B6 725
## 5 DL 461
## 6 UA 1696
## 7 B6 507
## 8 EV 5708
## 9 B6 79
## 10 AA 301
## .. ... ...
# ...you can use the minus sign to hide columns
flights %>% select(-month, -day)
## Source: local data frame [336,776 x 14]
##
## year dep_time dep_delay arr_time arr_delay carrier tailnum flight
## 1 2013 517 2 830 11 UA N14228 1545
## 2 2013 533 4 850 20 UA N24211 1714
## 3 2013 542 2 923 33 AA N619AA 1141
## 4 2013 544 -1 1004 -18 B6 N804JB 725
## 5 2013 554 -6 812 -25 DL N668DN 461
## 6 2013 554 -4 740 12 UA N39463 1696
## 7 2013 555 -5 913 19 B6 N516JB 507
## 8 2013 557 -3 709 -14 EV N829AS 5708
## 9 2013 557 -3 838 -8 B6 N593JB 79
## 10 2013 558 -2 753 8 AA N3ALAA 301
## .. ... ... ... ... ... ... ... ...
## Variables not shown: origin (chr), dest (chr), air_time (dbl), distance
## (dbl), hour (dbl), minute (dbl)
# hide a range of columns
flights %>% select(-(dep_time:arr_delay))
# hide any column with a matching name
flights %>% select(-contains("time"))
# pick columns using a character vector of column names
cols <- c("carrier", "flight", "tailnum")
flights %>% select(one_of(cols))
## Source: local data frame [336,776 x 3]
##
## carrier flight tailnum
## 1 UA 1545 N14228
## 2 UA 1714 N24211
## 3 AA 1141 N619AA
## 4 B6 725 N804JB
## 5 DL 461 N668DN
## 6 UA 1696 N39463
## 7 B6 507 N516JB
## 8 EV 5708 N829AS
## 9 B6 79 N593JB
## 10 AA 301 N3ALAA
## .. ... ... ...
# select() can be used to rename columns, though all columns not mentioned are dropped
flights %>% select(tail = tailnum)
## Source: local data frame [336,776 x 1]
##
## tail
## 1 N14228
## 2 N24211
## 3 N619AA
## 4 N804JB
## 5 N668DN
## 6 N39463
## 7 N516JB
## 8 N829AS
## 9 N593JB
## 10 N3ALAA
## .. ...
# rename() does the same thing, except all columns not mentioned are kept
flights %>% rename(tail = tailnum)
## Source: local data frame [336,776 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tail
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
Choosing rows: filter, between, slice, sample_n, top_n, distinct
# filter() supports the use of multiple conditions
flights %>% filter(dep_time >= 600, dep_time <= 605)
## Source: local data frame [2,460 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 600 0 851 -7 B6 N595JB
## 2 2013 1 1 600 0 837 12 MQ N542MQ
## 3 2013 1 1 601 1 844 -6 B6 N644JB
## 4 2013 1 1 602 -8 812 -8 DL N971DL
## 5 2013 1 1 602 -3 821 16 MQ N730MQ
## 6 2013 1 2 600 0 814 25 EV N13914
## 7 2013 1 2 600 -5 751 -27 EV N760EV
## 8 2013 1 2 600 0 819 4 9E N8946A
## 9 2013 1 2 600 0 846 0 B6 N529JB
## 10 2013 1 2 600 0 737 12 WN N8311Q
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
# between() is a concise alternative for determing if numeric values fall in a range
flights %>% filter(between(dep_time, 600, 605)) # inclusive of endpoints
# side note: is.na() can also be useful when filtering
flights %>% filter(!is.na(dep_time))
# slice() filters rows by position
flights %>% slice(1000:1005) # inclusive of endpoints
## Source: local data frame [6 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 2 809 -1 950 2 B6 N304JB
## 2 2013 1 2 810 10 1008 -6 DL N358NW
## 3 2013 1 2 811 -4 1100 4 DL N328NW
## 4 2013 1 2 811 -4 1126 -5 DL N305DQ
## 5 2013 1 2 811 -9 944 -11 MQ N509MQ
## 6 2013 1 2 815 0 1109 -19 DL N335NW
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
# keep the first three rows within each group
flights %>% group_by(month, day) %>% slice(1:3)
## Source: local data frame [1,095 x 16]
## Groups: month, day
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 2 42 43 518 36 B6 N580JB
## 5 2013 1 2 126 156 233 154 B6 N636JB
## 6 2013 1 2 458 -2 703 13 US N162UW
## 7 2013 1 3 32 33 504 22 B6 N763JB
## 8 2013 1 3 50 185 203 172 B6 N329JB
## 9 2013 1 3 235 156 700 143 B6 N618JB
## 10 2013 1 4 25 26 505 23 B6 N554JB
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
# sample three rows from each group
flights %>% group_by(month, day) %>% sample_n(3)
## Source: local data frame [1,095 x 16]
## Groups: month, day
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 559 0 702 -4 B6 N708JB
## 2 2013 1 1 1911 1 2050 -5 MQ N737MQ
## 3 2013 1 1 820 0 1249 -40 DL N900PC
## 4 2013 1 2 1020 -3 1242 -10 FL N997AT
## 5 2013 1 2 1004 19 1251 -7 B6 N706JB
## 6 2013 1 2 1335 13 1414 -2 EV N21537
## 7 2013 1 3 1856 -4 1959 -19 EV N834AS
## 8 2013 1 3 1142 2 1455 10 AA N3BNAA
## 9 2013 1 3 851 111 1206 111 AA N3EWAA
## 10 2013 1 4 1808 18 2103 18 B6 N588JB
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
# keep three rows from each group with the top dep_delay
flights %>% group_by(month, day) %>% top_n(3, dep_delay)
## Source: local data frame [1,108 x 16]
## Groups: month, day
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 848 853 1001 851 MQ N942MQ
## 2 2013 1 1 1815 290 2120 338 EV N17185
## 3 2013 1 1 2343 379 314 456 EV N21197
## 4 2013 1 2 1412 334 1710 323 UA N474UA
## 5 2013 1 2 1607 337 2003 368 AA N324AA
## 6 2013 1 2 2131 379 2340 359 UA N593UA
## 7 2013 1 3 2008 268 2339 270 DL N338NW
## 8 2013 1 3 2012 252 2314 257 B6 N558JB
## 9 2013 1 3 2056 291 2239 285 9E N928XJ
## 10 2013 1 4 2058 208 2 172 B6 N523JB
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
# also sort by dep_delay within each group
flights %>% group_by(month, day) %>% top_n(3, dep_delay) %>% arrange(desc(dep_delay))
## Source: local data frame [1,108 x 16]
## Groups: month, day
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 848 853 1001 851 MQ N942MQ
## 2 2013 1 1 2343 379 314 456 EV N21197
## 3 2013 1 1 1815 290 2120 338 EV N17185
## 4 2013 1 2 2131 379 2340 359 UA N593UA
## 5 2013 1 2 1607 337 2003 368 AA N324AA
## 6 2013 1 2 1412 334 1710 323 UA N474UA
## 7 2013 1 3 2056 291 2239 285 9E N928XJ
## 8 2013 1 3 2008 268 2339 270 DL N338NW
## 9 2013 1 3 2012 252 2314 257 B6 N558JB
## 10 2013 1 4 2123 288 2332 276 EV N29917
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
# unique rows can be identified using unique() from base R
flights %>% select(origin, dest) %>% unique()
## Source: local data frame [224 x 2]
##
## origin dest
## 1 EWR IAH
## 2 LGA IAH
## 3 JFK MIA
## 4 JFK BQN
## 5 LGA ATL
## 6 EWR ORD
## 7 EWR FLL
## 8 LGA IAD
## 9 JFK MCO
## 10 LGA ORD
## .. ... ...
# dplyr provides an alternative that is more "efficient"
flights %>% select(origin, dest) %>% distinct()
# side note: when chaining, you don't have to include the parentheses if there are no arguments
flights %>% select(origin, dest) %>% distinct
Adding new variables: mutate, transmute, add_rownames
# mutate() creates a new variable (and keeps all existing variables)
flights %>% mutate(speed = distance/air_time*60)
## Source: local data frame [336,776 x 17]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl), speed (dbl)
# transmute() only keeps the new variables
flights %>% transmute(speed = distance/air_time*60)
## Source: local data frame [336,776 x 1]
##
## speed
## 1 370.0441
## 2 374.2731
## 3 408.3750
## 4 516.7213
## 5 394.1379
## 6 287.6000
## 7 404.4304
## 8 259.2453
## 9 404.5714
## 10 318.6957
## .. ...
# example data frame with row names
mtcars %>% head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# add_rownames() turns row names into an explicit variable
mtcars %>% add_rownames("model") %>% head()
## Source: local data frame [6 x 12]
##
## model mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# side note: dplyr no longer prints row names (ever) for local data frames
mtcars %>% tbl_df()
## Source: local data frame [32 x 11]
##
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## .. ... ... ... ... ... ... ... .. .. ... ...
Grouping and counting: summarise, tally, count, group_size, n_groups, ungroup
# summarise() can be used to count the number of rows in each group
flights %>% group_by(month) %>% summarise(cnt = n())
## Source: local data frame [12 x 2]
##
## month cnt
## 1 1 27004
## 2 2 24951
## 3 3 28834
## 4 4 28330
## 5 5 28796
## 6 6 28243
## 7 7 29425
## 8 8 29327
## 9 9 27574
## 10 10 28889
## 11 11 27268
## 12 12 28135
# tally() and count() can do this more concisely
flights %>% group_by(month) %>% tally()
flights %>% count(month)
# you can sort by the count
flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt))
## Source: local data frame [12 x 2]
##
## month cnt
## 1 7 29425
## 2 8 29327
## 3 10 28889
## 4 3 28834
## 5 5 28796
## 6 4 28330
## 7 6 28243
## 8 12 28135
## 9 9 27574
## 10 11 27268
## 11 1 27004
## 12 2 24951
# tally() and count() have a sort parameter for this purpose
flights %>% group_by(month) %>% tally(sort=TRUE)
flights %>% count(month, sort=TRUE)
# you can sum over a specific variable instead of simply counting rows
flights %>% group_by(month) %>% summarise(dist = sum(distance))
## Source: local data frame [12 x 2]
##
## month dist
## 1 1 27188805
## 2 2 24975509
## 3 3 29179636
## 4 4 29427294
## 5 5 29974128
## 6 6 29856388
## 7 7 31149199
## 8 8 31149334
## 9 9 28711426
## 10 10 30012086
## 11 11 28639718
## 12 12 29954084
# tally() and count() have a wt parameter for this purpose
flights %>% group_by(month) %>% tally(wt = distance)
flights %>% count(month, wt = distance)
# group_size() returns the counts as a vector
flights %>% group_by(month) %>% group_size()
## [1] 27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268
## [12] 28135
# n_groups() simply reports the number of groups
flights %>% group_by(month) %>% n_groups()
## [1] 12
# group by two variables, summarise, arrange (output is possibly confusing)
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40)
## Source: local data frame [365 x 3]
## Groups: month
##
## month day cnt
## 1 1 2 943
## 2 1 7 933
## 3 1 10 932
## 4 1 11 930
## 5 1 14 928
## 6 1 31 928
## 7 1 17 927
## 8 1 24 925
## 9 1 18 924
## 10 1 28 923
## 11 1 25 922
## 12 1 4 915
## 13 1 3 914
## 14 1 21 912
## 15 1 9 902
## 16 1 16 901
## 17 1 30 900
## 18 1 8 899
## 19 1 23 897
## 20 1 15 894
## 21 1 22 890
## 22 1 29 890
## 23 1 1 842
## 24 1 6 832
## 25 1 13 828
## 26 1 27 823
## 27 1 20 786
## 28 1 5 720
## 29 1 12 690
## 30 1 26 680
## 31 1 19 674
## 32 2 28 964
## 33 2 21 961
## 34 2 25 961
## 35 2 22 957
## 36 2 14 956
## 37 2 15 954
## 38 2 20 949
## 39 2 18 948
## 40 2 27 945
## .. ... ... ...
# ungroup() before arranging to arrange across all groups
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt))
## Source: local data frame [365 x 3]
##
## month day cnt
## 1 11 27 1014
## 2 7 11 1006
## 3 7 8 1004
## 4 7 10 1004
## 5 12 2 1004
## 6 7 18 1003
## 7 7 25 1003
## 8 7 12 1002
## 9 7 9 1001
## 10 7 17 1001
## .. ... ... ...
Creating data frames: data_frame
data_frame()
is a better way than data.frame()
for creating data frames. Benefits of data_frame()
:
- You can use previously defined columns to compute new columns.
- It never coerces column types.
- It never munges column names.
- It never adds row names.
- It only recycles length 1 input.
- It returns a local data frame (a tbl_df).
# data_frame() example
data_frame(a = 1:6, b = a*2, c = 'string', 'd+e' = 1) %>% glimpse()
## Observations: 6
## Variables:
## $ a (int) 1, 2, 3, 4, 5, 6
## $ b (dbl) 2, 4, 6, 8, 10, 12
## $ c (chr) "string", "string", "string", "string", "string", "string"
## $ d+e (dbl) 1, 1, 1, 1, 1, 1
# data.frame() example
data.frame(a = 1:6, c = 'string', 'd+e' = 1) %>% glimpse()
## Observations: 6
## Variables:
## $ a (int) 1, 2, 3, 4, 5, 6
## $ c (fctr) string, string, string, string, string, string
## $ d.e (dbl) 1, 1, 1, 1, 1, 1
Joining (merging) tables: left_join, right_join, inner_join, full_join, semi_join, anti_join
# create two simple data frames
(a <- data_frame(color = c("green","yellow","red"), num = 1:3))
## Source: local data frame [3 x 2]
##
## color num
## 1 green 1
## 2 yellow 2
## 3 red 3
(b <- data_frame(color = c("green","yellow","pink"), size = c("S","M","L")))
## Source: local data frame [3 x 2]
##
## color size
## 1 green S
## 2 yellow M
## 3 pink L
# only include observations found in both "a" and "b" (automatically joins on variables that appear in both tables)
inner_join(a, b)
## Joining by: "color"
## Source: local data frame [2 x 3]
##
## color num size
## 1 green 1 S
## 2 yellow 2 M
# include observations found in either "a" or "b"
full_join(a, b)
## Joining by: "color"
## Source: local data frame [4 x 3]
##
## color num size
## 1 green 1 S
## 2 yellow 2 M
## 3 red 3 NA
## 4 pink NA L
# include all observations found in "a"
left_join(a, b)
## Joining by: "color"
## Source: local data frame [3 x 3]
##
## color num size
## 1 green 1 S
## 2 yellow 2 M
## 3 red 3 NA
# include all observations found in "b"
right_join(a, b)
## Joining by: "color"
## Source: local data frame [3 x 3]
##
## color num size
## 1 green 1 S
## 2 yellow 2 M
## 3 pink NA L
# right_join(a, b) is identical to left_join(b, a) except for column ordering
left_join(b, a)
## Joining by: "color"
## Source: local data frame [3 x 3]
##
## color size num
## 1 green S 1
## 2 yellow M 2
## 3 pink L NA
# filter "a" to only show observations that match "b"
semi_join(a, b)
## Joining by: "color"
## Source: local data frame [2 x 2]
##
## color num
## 1 green 1
## 2 yellow 2
# filter "a" to only show observations that don't match "b"
anti_join(a, b)
## Joining by: "color"
## Source: local data frame [1 x 2]
##
## color num
## 1 red 3
# sometimes matching variables don't have identical names
b <- b %>% rename(col = color)
# specify that the join should occur by matching "color" in "a" with "col" in "b"
inner_join(a, b, by=c("color" = "col"))
## Source: local data frame [2 x 3]
##
## color num size
## 1 green 1 S
## 2 yellow 2 M
Viewing more output: print, View
# specify that you want to see more rows
flights %>% print(n = 15)
## Source: local data frame [336,776 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## 11 2013 1 1 558 -2 849 -2 B6 N793JB
## 12 2013 1 1 558 -2 853 -3 B6 N657JB
## 13 2013 1 1 558 -2 924 7 UA N29129
## 14 2013 1 1 558 -2 923 -14 UA N53441
## 15 2013 1 1 559 -1 941 31 AA N3DUAA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
## (dbl), distance (dbl), hour (dbl), minute (dbl)
# specify that you want to see ALL rows (don't run this!)
flights %>% print(n = Inf)
# specify that you want to see all columns
flights %>% print(width = Inf)
## Source: local data frame [336,776 x 16]
##
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## .. ... ... ... ... ... ... ... ... ...
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 5 54
## 6 1696 EWR ORD 150 719 5 54
## 7 507 EWR FLL 158 1065 5 55
## 8 5708 LGA IAD 53 229 5 57
## 9 79 JFK MCO 140 944 5 57
## 10 301 LGA ORD 138 733 5 58
## .. ... ... ... ... ... ... ...
# show up to 1000 rows and all columns
flights %>% View()
# set option to see all columns and fewer rows
options(dplyr.width = Inf, dplyr.print_min = 6)
# reset options (or just close R)
options(dplyr.width = NULL, dplyr.print_min = 10)
Connecting to Databases
- dplyr can connect to a database as if the data was loaded into a data frame
- Use the same syntax for local data frames and databases
- Only generates SELECT statements
- Currently supports SQLite, PostgreSQL/Redshift, MySQL/MariaDB, BigQuery, MonetDB
- Example below is based upon an SQLite database containing the hflights data
- Instructions for creating this database are in the databases vignette
# connect to an SQLite database containing the hflights data
my_db <- src_sqlite("my_db.sqlite3")
# connect to the "hflights" table in that database
flights_tbl <- tbl(my_db, "hflights")
# example query with our data frame
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
# identical query using the database
flights_tbl %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
- You can write the SQL commands yourself
- dplyr can tell you the SQL it plans to run and the query execution plan
# send SQL commands to the database
tbl(my_db, sql("SELECT * FROM hflights LIMIT 100"))
# ask dplyr for the SQL commands
flights_tbl %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay)) %>%
explain()
Resources
- Official dplyr reference manual and vignettes on CRAN: vignettes are well-written and cover many aspects of dplyr
- Two-table vignette covering joins and set operations
- RStudio’s Data Wrangling Cheat Sheet for dplyr and tidyr
- dplyr tutorial by Hadley Wickham at the useR! 2014 conference: excellent, in-depth tutorial with lots of example code (Dropbox link includes slides, code files, and data files)
- dplyr GitHub repo and list of releases