Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Transformation with Data Table Cheat Sheet, Cheat Sheet of Data Structures and Algorithms

How to create a data table by a cheat sheet.

Typology: Cheat Sheet

2020/2021

Uploaded on 04/26/2021

jeny
jeny 🇺🇸

4.6

(14)

251 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Basics
CC BY SA Erik Petrovski •Updated: 2018-09
Data Transformation with data.table
::
CHEAT SHEET
Manipulate columns with j
Functions for data.tables
data.table is an extremely fast and memory efficient package
for transforming data in R. It works by converting R’s native
data frame objects into data.tables with new and enhanced
functionality. The basics of working with data.tables are:
dt[i, j, by]
Take data.table dt,
subset rows using i,
and manipulate columns with j,
grouped according to by.
data.tables are also data frames functions that work with data
frames therefore also work with data.tables.
data.table(a = c(1, 2), b = c("a", "b"))create a data.table from
scratch. Analogous to data.frame().
setDT(df)* or as.data.table(df)convert a data frame or a list to
a data.table.
Create a data.table
dt[1:2, ] subset rows based on row numbers.
dt[a > 5, ] subset rows based on values in
one or more columns.
Subset rows using i
LOGICAL OPERATORS TO USE IN i
<<= is.na() %in% |%like%
>>= !is.na() ! & %between%
dt[, c(2)] extract column(s) by number. Prefix
column numbers with “-” to drop.
dt[, .(b, c)] extract column(s) by name.
b c
b c
a
6
a
2
6
5
EXTRACT
dt[, .(x = sum(a))] create a data.table with new
columns based on the summarized values of rows.
Summary functions like mean(), median(), min(),
max(), etc. may be used to summarize rows.
dt[, .(c = sum(b)), by = a]–summarize rows within groups.
dt[, c := sum(b), by = a] create a new column and compute rows
within groups.
dt[, .SD[1], by = a] extract first row of groups.
dt[, .SD[.N], by = a] extract last row of groups.
COMMON GROUPED OPERATIONS
COMPUTE COLUMNS*
c
3
3
dt[, c := 1 + 2] compute a column based on an
expression.
setorder(dt, a, -b)reorder a data.table
according to specified columns. Prefix
column names with “-” for descending
order.
a b
1 2
1 1
2 2
a b
1 2
2 2
1 1
REORDER
a
2
1
dt[a == 1, c := 1 + 2] compute a column based
on an expression but only for a subset of rows.
x
a
SUMMARIZE
a c
2NA
1 3
Group according to by
a
aadt[, j, by = .(a)] group rows by
values in specified column(s).
dt[, j, keyby = .(a)] group and
simultaneously sort rows according
to values in specified column(s).
Chaining
dt[][] perform a sequence of data.table operations by
chaining multiple “[]”.
*SET FUNCTIONS AND :=
data.table’s functions prefixed with “set” and the operator “:=”
work without “<-” to alter data without making copies in
memory. E.g. the more efficient “setDT(df)” is analogous to
df <- as.data.table(df)”.
c d
1 2
1 2
dt[, `:=`(c = 1 , d = 2)] compute multiple
columns based on separate expressions.
DELETE COLUMN
cdt[, c := NULL]delete a column.
CONVERT COLUMN TYPE
b
1.5
2.6
b
1
2
dt[, b := as.integer(b)] convert the type of a
column using as.integer(), as.numeric(),
as.character(), as.Date(), etc..
pf2

Partial preview of the text

Download Data Transformation with Data Table Cheat Sheet and more Cheat Sheet Data Structures and Algorithms in PDF only on Docsity!

Basics

CC BY SA Erik Petrovski •Updated: 2018-

Data Transformation with data.table : :^ CHEAT SHEET

Manipulate columns with j

Functions for data.tables

data.table is an extremely fast and memory efficient package for transforming data in R. It works by converting R’s native data frame objects into data.tables with new and enhanced functionality. The basics of working with data.tables are:

dt[i, j, by]

Take data.table dt , subset rows using i , and manipulate columns with j , grouped according to by.

data.tables are also data frames – functions that work with data frames therefore also work with data.tables.

data.table( a = c(1, 2), b = c("a", "b") ) – create a data.table from scratch. Analogous to data.frame().

setDT( df ) * or as.data.table( df ) – convert a data frame or a list to a data.table.

Create a data.table

dt[ 1:2 , ] – subset rows based on row numbers.

dt[ a > 5 , ] – subset rows based on values in one or more columns.

Subset rows using i

LOGICAL OPERATORS TO USE IN i

< <= is.na() %in% | %like%

= !is.na()! & %between%

dt[, c(2) ] – extract column(s) by number. Prefix column numbers with “-” to drop.

b c b c dt[, .(b, c) ] – extract column(s) by name.

a 6

a 2 6 5

EXTRACT

dt[, .(x = sum(a)) ] – create a data.table with new columns based on the summarized values of rows.

Summary functions like mean(), median(), min(), max(), etc. may be used to summarize rows.

dt[, .(c = sum(b)), by = a ] – summarize rows within groups.

dt[, c := sum(b), by = a ] – create a new column and compute rows within groups.

dt[, .SD[1], by = a ] – extract first row of groups.

dt[, .SD[.N], by = a ] – extract last row of groups.

COMMON GROUPED OPERATIONS

COMPUTE COLUMNS*

c 3 3

dt[, c := 1 + 2 ] – compute a column based on an expression.

setorder( dt, a, - b ) – reorder a data.table according to specified columns. Prefix column names with “ - ” for descending order.

a b 1 2 1 1 2 2

a b 1 2 2 2 1 1

REORDER

a 2 1

dt[ a == 1, c := 1 + 2 ] – compute a column based on an expression but only for a subset of rows.

a x

SUMMARIZE

a c 2 NA 1 3

Group according to by

a a a (^) dt[, j, by = .(a) ] – group rows by values in specified column(s).

dt[, j, keyby = .(a) ] – group and simultaneously sort rows according to values in specified column(s).

Chaining

dt[][] – perform a sequence of data.table operations by chaining multiple “[]”.

* SET FUNCTIONS AND :=

data.table’s functions prefixed with “set” and the operator “:=” work without “<-” to alter data without making copies in memory. E.g. the more efficient “setDT(df)” is analogous to “df <- as.data.table(df)”.

c d 1 2 1 2

dt[, :=(c = 1 , d = 2) ] – compute multiple columns based on separate expressions.

DELETE COLUMN

c (^) dt[, c := NULL ] – delete a column.

CONVERT COLUMN TYPE

b

b 1 2

dt[, b := as.integer(b) ] – convert the type of a column using as.integer(), as.numeric(), as.character(), as.Date(), etc..

CC BY SA Erik Petrovski •• Updated: 2018-

BIND

.SD

Combine data.tables

dt_a[dt_b, on = .(b = y) ] – join two data.tables based on rows with equal values. You can leave out “on” if keys are already set.

a b 1 c 2 a 3 b

JOIN

ROLLING JOIN

x y 3 b 2 c 1 a

a b x 3 b 3 1 c 2 2 a 1

=

dt_a[dt_b, on = .(b = y, c > z) ] – join two data.tables based on rows with equal and unequal values.

a b c 1 c 7 2 a 5 3 b 6

x y z 3 b 4 2 c 5 1 a 8

a b c x 3 b 4 3 1 c 5 2 NA a 8 1

=

dt_a[dt_b, on = .(id = id, date = date), roll = TRUE ] – by default, a rolling join matches rows, according to id columns, but only keeps the most recent preceding match with the left table, according to date columns. Use “roll = -Inf” to reverse direction.

rbind( dt_a, dt_b ) – combine rows of two data.tables.

a b a^ b

a b

cbind( dt_a, dt_b ) – combine columns of two data.tables.

a b x y

a b x y

=

a id date 1 A 01-01- 2 A 01-01- 3 A 01-01- 1 B 01-01- 2 B 01-01-

  • =

b id date 1 A 01-01- 1 B 01-01-

a id date b 2 A 01-01-2013 1 2 B 01-01-2013 1

setkey( dt, a, b ) – set keys in a data.table to enable faster repeated lookups in specified column(s) using “dt[.(value), ]” or for merging without specifying merging columns “dt_a[dt_b]”.

SET KEYS

Reshape a data.table

RESHAPE TO WIDE FORMAT

RESHAPE TO LONG FORMAT

dcast( dt, id ~ y, value.var = c("a", "b") )

id y a b A X 1 3 A Z 2 4 B X 1 3 B Z 2 4

MULTIPLE COLUMN TYPE CONVERSION

Refer to a S ubset of the D ata with .SD.

dt[, lapply(.SD, as.character), .SDcols = c( " a " , " b " ) ] – convert the type of designated columns.

setnames( dt, c("a", "b"), c("x", "y") ) – rename column(s).

a b^ x^ y

RENAME COLUMNS

id a_X a_Z b_X b_Z A 1 2 3 4 B 1 2 3 4

melt( dt, id.vars = c("id"), measure = patterns("^a", "^b"), variable.name = "y", value.name = c("a", "b") )

Reshape a data.table from long to wide format.

id y a b A 1 1 3 B 1 1 3 A 2 2 4 B 2 2 4

id a_X a_Z b_X b_Z A 1 2 3 4 B 1 2 3 4

Reshape a data.table from wide to long format.

Sequential rows

GROUP OPTIMA

dt[, .SD[which.max(a)], by = b ] – within groups, extract rows with the maximum value in a specified column. Also works with which.min() and which(). Similar to “.SD[.N]” and “.SD[1]”.

dt A data.table. id ~ y Formula with a LHS: id column(s) containing id(s) for multiple entries. And a RHS: column(s) with value(s) to spread in column headers. value.var Column(s) containing values to fill into cells.

dt A data.table. id.vars Id column(s) with id(s) for multiple entries. measure Column(s) containing values to fill into cells (often in pattern form). variable.name, value.name

Name(s) of new column(s) for variables and values derived from old headers.

unique( dt, by = c("a", "b") ) – extract unique rows based on columns specified in “by”. Leave out “by” to use all columns.

a b 1 2 2 2

a b 1 2 2 2 1 2

UNIQUE ROWS

uniqueN( dt, by = c("a", "b") ) – return the number of unique rows based on columns specified in “by”.

fread & fwrite

IMPORT

fread( "file.csv" ) – read data from a flat file such as .csv or .tsv into R.

fread( "file.csv", select = c("a", "b") ) – read specified column(s) from a flat file into R.

EXPORT

fwrite( dt, file = "file.csv" ) – write data to a flat file from R.

dt[, c := 1:.N, by = b ] – within groups, compute a column with sequential row IDs.

a b c 1 a 1 2 a 2 3 b 1

a b 1 a 2 a 3 b

ROW IDS

dt[, c := shift(a, 1), by = b ] – within groups, duplicate a column with rows lagged by specified amount.

dt[, c := shift(a, 1, type = "lead"), by = b ] – within groups, duplicate a column with rows leading by specified amount.

a b c 1 a NA 2 a 1 3 b NA 4 b 3 5 b 4

a b 1 a 2 a 3 b 4 b 5 b

LAG & LEAD