

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
How to create a data table by a cheat sheet.
Typology: Cheat Sheet
1 / 2
This page cannot be seen from the preview
Don't miss anything!
CC BY SA Erik Petrovski •Updated: 2018-
data.table is an extremely fast and memory efficient package for transforming data in R. It works by converting R’s native data frame objects into data.tables with new and enhanced functionality. The basics of working with data.tables are:
Take data.table dt , subset rows using i , and manipulate columns with j , grouped according to by.
data.tables are also data frames – functions that work with data frames therefore also work with data.tables.
data.table( a = c(1, 2), b = c("a", "b") ) – create a data.table from scratch. Analogous to data.frame().
setDT( df ) * or as.data.table( df ) – convert a data frame or a list to a data.table.
dt[ 1:2 , ] – subset rows based on row numbers.
dt[ a > 5 , ] – subset rows based on values in one or more columns.
LOGICAL OPERATORS TO USE IN i
< <= is.na() %in% | %like%
= !is.na()! & %between%
dt[, c(2) ] – extract column(s) by number. Prefix column numbers with “-” to drop.
b c b c dt[, .(b, c) ] – extract column(s) by name.
a 6
a 2 6 5
dt[, .(x = sum(a)) ] – create a data.table with new columns based on the summarized values of rows.
Summary functions like mean(), median(), min(), max(), etc. may be used to summarize rows.
dt[, .(c = sum(b)), by = a ] – summarize rows within groups.
dt[, c := sum(b), by = a ] – create a new column and compute rows within groups.
dt[, .SD[1], by = a ] – extract first row of groups.
dt[, .SD[.N], by = a ] – extract last row of groups.
c 3 3
dt[, c := 1 + 2 ] – compute a column based on an expression.
setorder( dt, a, - b ) – reorder a data.table according to specified columns. Prefix column names with “ - ” for descending order.
a b 1 2 1 1 2 2
a b 1 2 2 2 1 1
a 2 1
dt[ a == 1, c := 1 + 2 ] – compute a column based on an expression but only for a subset of rows.
a x
a c 2 NA 1 3
a a a (^) dt[, j, by = .(a) ] – group rows by values in specified column(s).
dt[, j, keyby = .(a) ] – group and simultaneously sort rows according to values in specified column(s).
dt[ … ][ … ] – perform a sequence of data.table operations by chaining multiple “[]”.
data.table’s functions prefixed with “set” and the operator “:=” work without “<-” to alter data without making copies in memory. E.g. the more efficient “setDT(df)” is analogous to “df <- as.data.table(df)”.
c d 1 2 1 2
dt[, :=
(c = 1 , d = 2) ] – compute multiple columns based on separate expressions.
c (^) dt[, c := NULL ] – delete a column.
b
b 1 2
dt[, b := as.integer(b) ] – convert the type of a column using as.integer(), as.numeric(), as.character(), as.Date(), etc..
CC BY SA Erik Petrovski •• Updated: 2018-
.SD
Combine data.tables
dt_a[dt_b, on = .(b = y) ] – join two data.tables based on rows with equal values. You can leave out “on” if keys are already set.
a b 1 c 2 a 3 b
x y 3 b 2 c 1 a
a b x 3 b 3 1 c 2 2 a 1
=
dt_a[dt_b, on = .(b = y, c > z) ] – join two data.tables based on rows with equal and unequal values.
a b c 1 c 7 2 a 5 3 b 6
x y z 3 b 4 2 c 5 1 a 8
a b c x 3 b 4 3 1 c 5 2 NA a 8 1
=
dt_a[dt_b, on = .(id = id, date = date), roll = TRUE ] – by default, a rolling join matches rows, according to id columns, but only keeps the most recent preceding match with the left table, according to date columns. Use “roll = -Inf” to reverse direction.
rbind( dt_a, dt_b ) – combine rows of two data.tables.
a b a^ b
cbind( dt_a, dt_b ) – combine columns of two data.tables.
a b x y
a b x y
=
a id date 1 A 01-01- 2 A 01-01- 3 A 01-01- 1 B 01-01- 2 B 01-01-
b id date 1 A 01-01- 1 B 01-01-
a id date b 2 A 01-01-2013 1 2 B 01-01-2013 1
setkey( dt, a, b ) – set keys in a data.table to enable faster repeated lookups in specified column(s) using “dt[.(value), ]” or for merging without specifying merging columns “dt_a[dt_b]”.
Reshape a data.table
dcast( dt, id ~ y, value.var = c("a", "b") )
id y a b A X 1 3 A Z 2 4 B X 1 3 B Z 2 4
Refer to a S ubset of the D ata with .SD.
dt[, lapply(.SD, as.character), .SDcols = c( " a " , " b " ) ] – convert the type of designated columns.
setnames( dt, c("a", "b"), c("x", "y") ) – rename column(s).
a b^ x^ y
id a_X a_Z b_X b_Z A 1 2 3 4 B 1 2 3 4
melt( dt, id.vars = c("id"), measure = patterns("^a", "^b"), variable.name = "y", value.name = c("a", "b") )
Reshape a data.table from long to wide format.
id y a b A 1 1 3 B 1 1 3 A 2 2 4 B 2 2 4
id a_X a_Z b_X b_Z A 1 2 3 4 B 1 2 3 4
Reshape a data.table from wide to long format.
Sequential rows
dt[, .SD[which.max(a)], by = b ] – within groups, extract rows with the maximum value in a specified column. Also works with which.min() and which(). Similar to “.SD[.N]” and “.SD[1]”.
dt A data.table. id ~ y Formula with a LHS: id column(s) containing id(s) for multiple entries. And a RHS: column(s) with value(s) to spread in column headers. value.var Column(s) containing values to fill into cells.
dt A data.table. id.vars Id column(s) with id(s) for multiple entries. measure Column(s) containing values to fill into cells (often in pattern form). variable.name, value.name
Name(s) of new column(s) for variables and values derived from old headers.
unique( dt, by = c("a", "b") ) – extract unique rows based on columns specified in “by”. Leave out “by” to use all columns.
a b 1 2 2 2
a b 1 2 2 2 1 2
uniqueN( dt, by = c("a", "b") ) – return the number of unique rows based on columns specified in “by”.
fread & fwrite
fread( "file.csv" ) – read data from a flat file such as .csv or .tsv into R.
fread( "file.csv", select = c("a", "b") ) – read specified column(s) from a flat file into R.
fwrite( dt, file = "file.csv" ) – write data to a flat file from R.
dt[, c := 1:.N, by = b ] – within groups, compute a column with sequential row IDs.
a b c 1 a 1 2 a 2 3 b 1
a b 1 a 2 a 3 b
dt[, c := shift(a, 1), by = b ] – within groups, duplicate a column with rows lagged by specified amount.
dt[, c := shift(a, 1, type = "lead"), by = b ] – within groups, duplicate a column with rows leading by specified amount.
a b c 1 a NA 2 a 1 3 b NA 4 b 3 5 b 4
a b 1 a 2 a 3 b 4 b 5 b