














Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An introduction to R programming, focusing on data manipulation and analysis using R. It covers the basics of R syntax, data structures, and functions, as well as the installation and use of packages. The document also includes examples and exercises to help readers practice their skills.
What you will learn
Typology: Study notes
1 / 22
This page cannot be seen from the preview
Don't miss anything!
This Notebook is selection of “A Very (short) Introduction to R” by Paul Torfs & Claudia Brauer and “R for Data Scince”" by Hadley Wickham and Garrett Grolemund
What is R?
R is a powerful language and environment for statistical computing and graphics. It is a public domain (a so called “GNU”) project which is similar to the commercial S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S, or in language terms different dialect of S. The main advantages of R are the fact that R is freeware and that there is a lot of help available online. It is quite similar to other programming packages such as MatLab (not freeware), but more user-friendly than programming languages such as C++ or Fortran. You can use R as it is, but for educational purposes we prefer to use R in combination with the RStudio interface (also freeware), which has an organized layout and several extra options. The R language came to use quite a bit after S had been developed. One key limitation of the S language was that it was only available in a commercial package, S-PLUS. In 1991, R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland. In 1993 the first announcement of R was made to the public. In 1995, Martin Mächler made an important contribution by convincing Ross and Robert to use the GNU General Public License to make R free software. This was critical because it allowed for the source code for the entire R system to be accessible to anyone who wanted to tinker with it (more on free software later). In 1996, a public mailing list was created (the R-help and R-devel lists) and in 1997 the R Core Group was formed, containing some people associated with S and S-PLUS. Currently, the core group controls the source code for R and is solely able to check in changes to the main R source tree. Finally, in 2000 R version 1.0. was released to the public.
Limitations of R
No programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the “old days”). Another commonly cited limitation of R is that objects must generally be stored in physical memory. This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.
At a higher level one “limitation” of R is that its functionality is based on consumer demand and (voluntary) user contributions. If no one feels like implementing your favorite method, then it’s your job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased.
2 Getting started
To install R on your computer (legally for free!), go to the home website of R: http://www.r-project.org/ and do the following (assuming you work on a windows computer):
It is also possible to run R and RStudio from a USB stick instead of installing them. This could be useful when you don’t have administrator rights on your computer. See our separate note “How to use portable versions of R and RStudio” for help on this topic.
After finishing this setup, you should see an “R” icon on you desktop. Clicking on this would start up the standard interface. We recommend, however, to use the RStudio interface. To install RStudio, go to: http://www.rstudio.org/ and do the following (assuming you work on a windows computer):
The RStudio interface consists of several windows (see Figure 1).
There are many more packages available on the R website. If you want to install and use a package (for example, the package called “dplyr”) you should:
3 Some first examples of R commands
This is section is dedicated to basic R use and syntax. Please do read this section to understand simple R syntax and possible data formats and types. It is highly recommended to complete some of the swirl() exercises.
R can be used as a calculator. You can just type your equation in the command window after the “>” or editor: 10^2 + 36
and R will give the answer
Practice
Compute the difference between 2014 and the year you started at this university and divide this by the difference between 2014 and the year you were born. Multiply this with 100 to get the percentage of your life you have spent at this university. Use brackets if you need them. If you use brackets and forget to add the closing bracket, the “>” on the command line changes into a “+”. The “+” can also mean that R is still busy with some heavy computation. If you want R to quit what it was doing and give back the “>”, press ESC (see the reference list on the last page).
_# this is comment
You can also give numbers a name. By doing so, they become so-called variables which can be used later. For example, you can type in the command window:
a <- 4
You can see that a appears in the workspace window, which means that R now remembers what a is. You can also ask R what a is (just type a ENTER in the command window): a
# or do calculations with a:
a * 5
# You can also assign a new value to a using the old one. a <- a + 10 a
To remove all variables from R’s memory, type
rm (list= ls ())
or click “clear all” in the workspace window. You can see that RStudio then empties the workspace window. If you only want to remove the variable a, you can type rm(a) or change list to grid in the workspace widow and check variable/object you want to remove.
Like in many other programs, R organizes numbers in scalars (a single number - 0-dimensional), vectors (a row of numbers, also called arrays - 1-dimensional) and matrices (like a table - 2- dimensional). The a you defined before was a scalar. To define a vector with the numbers 3, 4 and 5, you need the function c, which is short for concatenate (paste together).
b <- c (3,4,5)
Matrices and other 2-dimensional structures will be introduced in Section 5.
How would you use what you learned to install more than one R package in one command line
# write your code:
Swirl Practice
In R console run following commands:
Select 1 for ‘R Programming’ then Select 1 for ‘Basic Building Blocks’. At the end of exercise select 0 to exit. Hint: R console will display exercise completion %.
If you would like to compute the mean of all the elements in the vector b from the example above, you could type
4 Scripts
R is an interpreter that uses a command line based environment. This means that you have to type commands, rather than use the mouse and menus. This has the advantage that your tasks can be automated.
You can store your commands in files, the so called scripts. These scripts have typically file names with the extension .R, e.g. foo.R. You can open an editor window to edit these files by clicking File and New or Open file...
You can run (send to the console window) part of the code by selecting lines and pressing CTRL+ENTER or click Run in the editor window. If you do not select anything, R will run the line your cursor is on. You can always run the whole script with the console command source, so e.g. for the script in the file foo.R you type: source(“foo.R”)
You can also click Run all in the editor window or type CTRL+SHIFT + S to run the whole script at once.
Make a file called firstscript.R containing Rcode that generates 100 random numbers and plots them, and run this script several times.
5 Data structures
If you are unfamiliar with R, it makes sense to just retype the commands listed in this section. Maybe you will not need all these structures in the beginning, but it is always good to have at least a first glimpse of the terminology and possible applications.
Vectors were already introduced, but they can do more:
vec1 <- c (1,4,6,8,10) vec
vec1[5]
vec1[3] <- 12 vec
vec2 <- seq (from=0, to=1, by=0.25) vec
sum (vec1)
vec1 + vec
Swirl Practice
In R console run following commands:
library (swirl)
Select 1 for’R Programming’ then Select 4 for Vectors. At the end of exercise select 0 to exit.
Practice (optional)
Make a script file which constructs three random normal vectors of length 100. Call these vectors x1, x2 and x3. Make a data frame called t with three columns (called a, b and c) containing respectively x1, x1+x and x1+x2+x3. Call the following functions for this data frame: plot(t) and sd(t). Can you understand the results? Rerun this script a few times.
# write your code:
Swirl Practice (highly Recommended)
In R console run following commands:
Select 1 for’R Programming’ then Select 7 for Matrices and Data Frames. At the end of exercise select 0 to exit.
Another basic structure in R is a list. The main advantage of lists is that the “columns” (they’re not really ordered in columns any more, but are more a collection of vectors) don’t have to be of the same length, unlike matrices and data frames.
L <- list (one=1, two= c (1,2), five= seq (0, 1, length=5)) L
L$one
L$two
L$five
names (L)
L$five + 10
6 Graphics
Plotting is an important statistical activity. So it should not come as a surprise that R has many plotting facilities. The following lines show a simple plot:
plot ( rnorm (100), type="l", col="gold")
_#ToDo
Hundred random numbers are plotted by connecting the points by lines (the symbol between quotes after the type=, is the letter l stands for lines, not the number 1) in a gold color.
Another very simple example is the classical statistical histogram plot, generated by the simple command:
hist ( rnorm (100))
# Try to find out, either by experimenting or by using the help, what the meaning is of rgb, the last ar
Swirl Practice (optional)
In R console run following commands:
Select 1 for ‘R Programming’ then Select 15 for Base Graphics. At the end of exercise select 0 to exit.
7 Reading and writing data files
There are many ways to write data from within the R environment to files, and to read data from files.
The following lines illustrate the essential:
d <- data.frame (a = c (3,4,5), b = c (12,43,54)) d
write.table (d, file="tst0.txt", row.names=FALSE)
d2 <- read.table (file="tst0.txt", header=TRUE) d
Practice
Make a file called tst1.txt in Notepad a g x 1 2 3 4 5 6 7 8 9 10 11 12 and store it in your working directory. Write a script to read it, to multiply the column called g by 5 and to store it as tst2.txt. # write your code:
8 Not available data
Practice (optional)
Compute the mean of the square root of a vector of 100 random numbers. What happens? # write your code:
When you work with real data, you will encounter missing values because instrumentation failed or because you didn’t want to measure in the weekend. When a data point is not available, you write NA instead of a number. j <- c (1,2,NA)
Computing statistics of incomplete data sets is strictly speaking not possible. Maybe the largest value occurred during the weekend when you didn’t measure. Therefore, R will say that it doesn’t know what the largest value of j is: max (j)
If you don’t mind about the missing data and want to compute the statistics anyway, you can add the argument na.rm=TRUE (Should I remove the NAs? Yes!). max (j, na.rm=TRUE)
The if-statement is used when certain computations should only be done when a certain condition is met (and maybe something else should be done when the condition is not met). An example:
w <- 3 if( w < 5 ){ d<- }else{ d<- } d
If you want to model a time series, you usually do the computations for one time step and then for the next and the next, etc. Because nobody wants to type the same commands over and over again, these computations are automated in for-loops. In a for-loop you specify what has to be done and how many times. To tell “how many times”, you specify a so-called counter. An example: h <- seq (from=1, to=8) s <- c () for(i in 2:10){ s[i] <- h[i] * 10 } s
Practice (optional)
Make a vector from 1 to 100. Make a for-loop which runs through the whole vector. Multiply the elements which are smaller than 5 and larger than 90 with 10 and the other elements with 0.1.
# write your code:
Functions you program yourself work in the same way as pre-programmed R functions.
fun1 <- function(arg1, arg2 ) { w = arg1 ^ 2 return (arg2 + w) }
fun1 (arg1 = 3, arg2 = 5)
fun1 (3, 5)
11 Some useful references
This is a collection of very useful R base functions
This is a subset of the functions explained in the R reference card.
Data creation
Fitting
Plotting
Plotting parameters
These can be added as arguments to plot, lines, image, etc. For help see par. * type: “l”=lines, “p”=points, etc. * col: color - “blue”, “red”, etc * lty: line type - 1=solid, 2=dashed, etc. * pch: point type - 1=circle, 2=triangle, etc. * main: title - character string * xlab and ylab: axis labels - character string * xlim and ylim: range of axes - e.g. c(1,10) * log: logarithmic axis - “x”, “y” or “xy”
Programming
In RStudio IDE Alt+Shift+K to see all keyboard shortcuts
Tidy Data with R Bootcamp Introduction
Data science is an exciting and growing discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of “Explore and Tidy Your Data” is to help you learn the most important tools in R that will allow you to do data science. After the bootcamp, you’ll have the tools to tackle a wide variety of data science challenges where additional steps to prepare your data for analysis are required and repetitive, using the best parts of R.
The goal is to give you a solid foundation in the most important data manipulation R tools. Our model of the tools needed in a typical data science project looks something like this: