Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Transformation with Dplyr Cheat Sheet, Cheat Sheet of Data Structures and Algorithms

Cheat sheet on Data Transformation with Dplyr: Manipulate Cases and Variables, Vector Functions, Row Names.

Typology: Cheat Sheet

2019/2020

Uploaded on 10/09/2020

nicoth
nicoth 🇺🇸

4.3

(20)

262 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
w
Summarise Cases
group_by(.data, ..., add =
FALSE)
Returns copy of table !
grouped by …
g_iris <- group_by(iris, Species)
ungroup(x, …)
Returns ungrouped copy !
of table.
ungroup(g_iris)
w
w
w
w
w
w
w
Use group_by() to create a "grouped" copy of a table. !
dplyr functions will manipulate each "group" separately and
then combine the results.
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg))
These apply summary functions to columns to create a new
table of summary statistics. Summary functions take vectors as
input and return one value (see back).
VARIATIONS
summarise_all() - Apply funs to every column.
summarise_at() - Apply funs to specific columns.
summarise_if() - Apply funs to all cols of one type.
w
w
w
w
w
w
summarise(.data, …)!
Compute table of summaries. !
summarise(mtcars, avg = mean(mpg))
count(x, ..., wt = NULL, sort = FALSE)!
Count number of rows in each group defined
by the variables in … Also tally().!
count(iris, Species)
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2019-08
Each observation, or
case, is in its own row
Each variable is in
its own column
&
dplyr functions work with pipes and expect tidy data. In tidy data:
pipes
x %>% f(y)
becomes f(x, y) filter(.data, …) Extract rows that meet logical
criteria. filter(iris, Sepal.Length > 7)
distinct(.data, ..., .keep_all = FALSE) Remove
rows with duplicate values. !
distinct(iris, Species)
sample_frac(tbl, size = 1, replace = FALSE,
weight = NULL, .env = parent.frame()) Randomly
select fraction of rows. !
sample_frac(iris, 0.5, replace = TRUE)
sample_n(tbl, size, replace = FALSE, weight =
NULL, .env = parent.frame()) Randomly select
size rows. sample_n(iris, 10, replace = TRUE)
slice(.data, …) Select rows by position.
slice(iris, 10:15)
top_n(x, n, wt) Select and order top n entries (by
group if grouped data). top_n(iris, 5, Sepal.Width)
Row functions return a subset of rows as a new table.
See ?base::Logic and ?Comparison for help.
>
>=
!is.na()
!
&
<
<=
is.na()
%in%
|
xor()
arrange(.data, …) Order rows by values of a
column or columns (low to high), use with
desc() to order from high to low.
arrange(mtcars, mpg)
arrange(mtcars, desc(mpg))
add_row(.data, ..., .before = NULL, .after = NULL)
Add one or more rows to a table.
add_row(faithful, eruptions = 1, waiting = 1)
Group Cases
Manipulate Cases
EXTRACT VARIABLES
ADD CASES
ARRANGE CASES
Logical and boolean operators to use with filter()
Column functions return a set of columns as a new vector or table.
contains(match)
ends_with(match)
matches(match)
:, e.g. mpg:cyl
-, e.g, -Species
num_range(prefix, range)
one_of()
starts_with(match)
pull(.data, var = -1) Extract column values as
a vector. Choose by name or index.
pull(iris, Sepal.Length)
Manipulate Variables
Use these helpers with select (),
e.g. select(iris, starts_with("Sepal"))
These apply vectorized functions to columns. Vectorized funs take
vectors as input and return vectors of the same length as output
(see back).
mutate(.data, …) !
Compute new column(s).
mutate(mtcars, gpm = 1/mpg)
transmute(.data, …)!
Compute new column(s), drop others.
transmute(mtcars, gpm = 1/mpg)
mutate_all(.tbl, .funs, …) Apply funs to every
column. Use with funs(). Also mutate_if().!
mutate_all(faithful, funs(log(.), log2(.)))
mutate_if(iris, is.numeric, funs(log(.)))
mutate_at(.tbl, .cols, .funs, …) Apply funs to
specific columns. Use with funs(), vars() and
the helper functions for select().!
mutate_at(iris, vars( -Species), funs(log(.)))
add_column(.data, ..., .before = NULL, .after =
NULL) Add new column(s). Also add_count(),
add_tally(). add_column(mtcars, new = 1:32)
rename(.data, …) Rename columns.!
rename(iris, Length = Sepal.Length)
MAKE NEW VARIABLES
EXTRACT CASES
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
summary function
vectorized function
Data Transformation with dplyr : : CHEAT SHEET
A
B
C
A
B
C
select(.data, …)
Extract columns as a table. Also select_if().
select(iris, Sepal.Length, Species)
w
w
w
w
pf2

Partial preview of the text

Download Data Transformation with Dplyr Cheat Sheet and more Cheat Sheet Data Structures and Algorithms in PDF only on Docsity!

w

Summarise Cases

group_by( .data, ..., add = FALSE ) Returns copy of table grouped by … g_iris <- group_by(iris, Species) ungroup( x, … ) Returns ungrouped copy of table. ungroup(g_iris)

wwwwww

w

Use group_by() to create a "grouped" copy of a table. dplyr functions will manipulate each "group" separately and then combine the results. mtcars %>% group_by(cyl) %>% summarise(avg = mean(mpg)) These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value (see back). VARIATIONS summarise_all() - Apply funs to every column. summarise_at() - Apply funs to specific columns. summarise_if() - Apply funs to all cols of one type.

www

www

summarise (.data, …) Compute table of summaries. summarise(mtcars, avg = mean(mpg)) count (x, ..., wt = NULL, sort = FALSE) Count number of rows in each group defined by the variables in … Also tally (). count(iris, Species) RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2019- Each observation , or case , is in its own row Each variable is in its own column

dplyr functions work with pipes and expect tidy data. In tidy data: pipes x %>% f(y) becomes f(x, y) filter( .data, … ) Extract rows that meet logical criteria. filter(iris, Sepal.Length > 7) distinct( .data, ..., .keep_all = FALSE ) Remove rows with duplicate values. distinct(iris, Species) sample_frac( tbl, size = 1, replace = FALSE, weight = NULL, .env = parent.frame() ) Randomly select fraction of rows. sample_frac(iris, 0.5, replace = TRUE) sample_n( tbl, size, replace = FALSE, weight = NULL, .env = parent.frame() ) Randomly select size rows. sample_n(iris, 10, replace = TRUE) slice( .data, … ) Select rows by position. slice(iris, 10:15) top_n( x, n, wt ) Select and order top n entries (by group if grouped data). top_n(iris, 5, Sepal.Width) Row functions return a subset of rows as a new table. See ?base::Logic and ?Comparison for help.

= !is.na()! & < <= is.na() %in% | xor() arrange( .data, … ) Order rows by values of a column or columns (low to high), use with desc() to order from high to low. arrange(mtcars, mpg) arrange(mtcars, desc(mpg)) add_row(. data, ..., .before = NULL, .after = NULL ) Add one or more rows to a table. add_row(faithful, eruptions = 1, waiting = 1)

Group Cases

Manipulate Cases

EXTRACT VARIABLES

ADD CASES

ARRANGE CASES

Logical and boolean operators to use with filter() Column functions return a set of columns as a new vector or table. contains( match ) ends_with( match ) matches( match ) : , e.g. mpg:cyl

- , e.g, -Species num_range( prefix, range ) one_of() starts_with( match ) pull( .data, var = -1 ) Extract column values as a vector. Choose by name or index. pull(iris, Sepal.Length)

Manipulate Variables

Use these helpers with select (), e.g. select(iris, starts_with("Sepal")) These apply vectorized functions to columns. Vectorized funs take vectors as input and return vectors of the same length as output (see back). mutate( .data, … ) Compute new column(s). mutate(mtcars, gpm = 1/mpg) transmute( .data, … ) Compute new column(s), drop others. transmute(mtcars, gpm = 1/mpg) mutate_all( .tbl, .funs, … ) Apply funs to every column. Use with funs(). Also mutate_if(). mutate_all(faithful, funs(log(.), log2(.))) mutate_if(iris, is.numeric, funs(log(.))) mutate_at( .tbl, .cols, .funs, … ) Apply funs to specific columns. Use with funs() , vars() and the helper functions for select(). mutate_at(iris, vars( -Species), funs(log(.))) add_column( .data, ..., .before = NULL, .after = NULL ) Add new column(s). Also add_count() , add_tally(). add_column(mtcars, new = 1:32) rename( .data, … ) Rename columns. rename(iris, Length = Sepal.Length)

MAKE NEW VARIABLES

EXTRACT CASES

wwwwww

wwwwww

wwwwww

wwwwww

wwwwww

wwwwww

wwww

wwwww

wwwwww

www

wwww

w

wwwwww

summary function vectorized function

Data Transformation with dplyr : : CHEAT SHEET

A B C A B C select( .data, … ) Extract columns as a table. Also select_if(). select(iris, Sepal.Length, Species)

wwww

dplyr

OFFSETS

dplyr:: lag() - Offset elements by 1 dplyr:: lead() - Offset elements by - CUMULATIVE AGGREGATES dplyr:: cumall() - Cumulative all() dplyr:: cumany() - Cumulative any() cummax() - Cumulative max() dplyr:: cummean() - Cumulative mean() cummin() - Cumulative min() cumprod() - Cumulative prod() cumsum() - Cumulative sum() RANKINGS dplyr:: cume_dist() - Proportion of all values <= dplyr:: dense_rank() - rank w ties = min, no gaps dplyr:: min_rank() - rank with ties = min dplyr:: ntile() - bins into n bins dplyr:: percent_rank() - min_rank scaled to [0,1] dplyr:: row_number() - rank with ties = "first" *MATH +, - , , /, ^, %/%, %% - arithmetic ops log(), log2(), log10() - logs <, <=, >, >=, !=, == - logical comparisons dplyr:: between() - x >= left & x <= right dplyr:: near() - safe == for floating point numbers MISC dplyr:: case_when() - multi-case if_else() iris %>% mutate(Species = case_when( Species == "versicolor" ~ "versi", Species == "virginica" ~ "virgi", TRUE ~ Species ) ) dplyr:: coalesce() - first non-NA values by element across a set of vectors dplyr:: if_else() - element-wise if() + else() dplyr:: na_if() - replace specific values with NA pmax() - element-wise max() pmin() - element-wise min() dplyr:: recode() - Vectorized switch() dplyr:: recode_factor() - Vectorized switch() for factors mutate() and transmute() apply vectorized functions to columns to create new columns. Vectorized functions take vectors as input and return vectors of the same length as output.

Vector Functions

TO USE WITH MUTATE ()

vectorized function

Summary Functions

TO USE WITH SUMMARISE ()

summarise() applies summary functions to columns to create a new table. Summary functions take vectors as input and return single values as output. COUNTS dplyr:: n() - number of values/rows dplyr:: n_distinct() - # of uniques sum(!is.na()) - # of non-NA’s LOCATION mean() - mean, also mean(!is.na()) median() - median LOGICALS mean() - Proportion of TRUE’s sum() - # of TRUE’s POSITION/ORDER dplyr:: first() - first value dplyr:: last() - last value dplyr:: nth() - value in nth location of vector RANK quantile() - nth quantile min() - minimum value max() - maximum value SPREAD IQR() - Inter-Quartile Range mad() - median absolute deviation sd() - standard deviation var() - variance

Row Names

Tidy data does not use rownames, which store a variable outside of the columns. To work with the rownames, first move them into a column. RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2019- rownames_to_column() Move row names into col. a <- rownames_to_column(iris, var = "C") column_to_rownames() Move col in row names. column_to_rownames(a, var = "C") summary function C A B Also has_rownames() , remove_rownames()

Combine Tables

COMBINE VARIABLES COMBINE CASES

Use bind_cols() to paste tables beside each other as they are. bind_cols(…) Returns tables placed side by side as a single table. BE SURE THAT ROWS ALIGN. Use a " Mutating Join " to join one table to columns from another, matching values with the rows that they correspond to. Each join retains a different combination of values from the tables. left_join( x, y, by = NULL, copy=FALSE, suffix=c(“.x”,“.y”),… ) Join matching values from y to x. right_join( x, y, by = NULL, copy = FALSE, suffix=c(“.x”,“.y”),… ) Join matching values from x to y. inner_join( x, y, by = NULL, copy = FALSE, suffix=c(“.x”,“.y”),… ) Join data. Retain only rows with matches. full_join( x, y, by = NULL, copy=FALSE, suffix=c(“.x”,“.y”),… ) Join data. Retain all values, all rows. Use by = c("col1", "col2", …) to specify one or more common columns to match on. left_join(x, y, by = "A") Use a named vector, by = c("col1" = "col2") , to match on columns that have different names in each table. left_join(x, y, by = c("C" = "D")) Use suffix to specify the suffix to give to unmatched columns that have the same name in both tables. left_join(x, y, by = c("C" = "D"), suffix = c("1", "2")) Use bind_rows() to paste tables below each other as they are. bind_rows( …, .id = NULL ) Returns tables one on top of the other as a single table. Set .id to a column name to add a column of the original table names (as pictured) intersect(x, y, …) Rows that appear in both x and y. setdiff(x, y, …) Rows that appear in x but not y. union(x, y, …) Rows that appear in x or y. (Duplicates removed). union_all() retains duplicates. Use a " Filtering Join " to filter one table against the rows of another. semi_join( x, y, by = NULL, … ) Return rows of x that have a match in y. USEFUL TO SEE WHAT WILL BE JOINED. anti_join( x, y, by = NULL, … ) Return rows of x that do not have a match in y. USEFUL TO SEE WHAT WILL NOT BE JOINED. Use setequal() to test whether two data sets contain the exact same rows (in any order). EXTRACT ROWS A B 1 a t 2 b u 3 c v 1 a t 2 b u 3 c v A B 1 a t 2 b u 3 c v A B C 1 a t 2 b u 3 c v x y A B C a t 1 b u 2 c v 3 A B D a t 3 b u 2 d w 1

A B C a t 1 b u 2 c v 3 A B D a t 3 b u 2 d w 1 A B C D a t 1 3 b u 2 2 c v 3 NA A B C D a t 1 3 b u 2 2 d w NA 1 A B C D a t 1 3 b u 2 2 A B C D a t 1 3 b u 2 2 c v 3 NA d w NA^1 A B.x C B.y D a t 1 t 3 b u 2 u 2 c v 3 NA^ NA A.x B.x C A.y B.y a t 1 d w b u 2 b u c v 3 a t A1 B1 C A2 B a t 1 d w b u 2 b u c v 3 a t x y A B C a t 1 b u 2 c v 3 A B C C v 3

+^ d^ w^4

DF A B C x a t 1 x b u 2 x c v 3 z c v 3 z d w 4 A B C c v 3 A B C a t 1 b u 2 c v 3 d w 4 A B C a t 1 b u 2 x y A B C a t 1 b u 2 c v 3 A B D a t 3 b u 2 d w 1

A B C c v 3 A B C a t 1 b u 2

dplyr