










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An introduction to the role of statistics in data science, focusing on the COMP6235 Foundations of Data Science course taught by Markus Brede at the University of Southampton. the importance of statistics in data science, including data description, comparison, and figuring out what is special about given data. It also introduces the R package as a tool for statistical analysis and provides resources for learning R. intended to give students a basic understanding of statistics and familiarize them with R, but not require memorization of every technique.
Typology: Summaries
1 / 18
This page cannot be seen from the preview
Don't miss anything!
Some background ● (^) Why do you need statistics in a Foundations of Data Science course? Several aims of data science are facilitated by statistics: ● (^) Describing given data: We will discuss are distributions, mean/median values, variance, and co-dependencies between data sets. ● (^) Figuring out what is special about given data: This needs comparison to reference models which are often at the basis of statistics – e.g. Are data more/less varied than expected? ● (^) Comparing data: is one set of data different from another?
Aims of the stats part We want you to achieve the following: ●Give you some idea of what stats is about and familiarise you with some basic tools you might find useful for data science (but this is only a starting point!) ● (^) Familiarise you with the R package – a very common stats package ●We don't need you to have to have memorized every technique, but we want you to know what you need to look up and why. ● (^) Approach of the lectures is very experimental and hands- on, the aim is to enable you to use basic stats
Where to get module information ● (^) I have previously taught a stats module for PhD students, the old course information can be found here http://users.ecs.soton.ac.uk/mb8/stats/stats.html if you are keen to learn more. ● (^) The current slides and coursework information is available here: http://users.ecs.soton.ac.uk/mb8/datascience/datascience.html or also linked from the course wiki: https://secure.ecs.soton.ac.uk/noteswiki/w/COMP6235/1718#S lides
Online material for R R is actually a full-featured programming language. We will mostly be using it as an advanced statistical calculator. In getting to know R better, there's a lot of online help: ●The R-project home page: http://www.r-project.org/ ●Jason Noble's crash course: http://users.ecs.soton.ac.uk/jn2/simulation/introToR.html ●The "Quick-R" guide: http://www.statmethods.net/stats/index.html ●Intro to plotting in R: http://www.people.carleton.edu/~lchihara/Splus/RPlot.pdf ●The ggplot2 package for potentially nicer looking graphs: http://had.co.nz/ggplot2/
Online materials by other people We are also going to use various online supplements. There are some very high quality stats materials on the net, and I won't pretend I can do better than all of them. ●Khan academy: http://www.khanacademy.org/ ●David M. Lane's online course notes: http://davidmlane.com/hyperstat/index.html and http://onlinestatbook.com/2/index.html
An aside If anyone needs to build their enthusiasm for statistics and its uses, try listening to the amazing Hans Rosling for a while: http://www.gapminder.org/videos/the-joy-of-stats/
What is statistics all about? ●Important distinction between the related areas of statistics and probability. ●Probability says "I've got a data generating process" (e.g., throwing two dice and adding the result), "now tell me what sorts of outcomes I can expect from this data generating process." ●Statistics is the inverse. Statistics says "I have some outcomes" (i.e., data), "Now what can I infer about the process that generated them?".
What is statistics all about? ● (^) We reason that the real world result either is or isn't close enough to the hypothetical one to make us suspect that the hypothetical world's data-generating process is actually a good description of the real one. ● (^) Once you "get" that inferential strategy, all of statistics starts to make a lot more sense.
What is statistics all about? ●Remember that some of statistics is convention: e.g., why are we so interested in the squared differences from the mean? ●Not everything is exact, there's often more than one way to do things. ●The statistical tests we favour might have been different if we'd had a different history of statistical development. ●There's a pragmatic rather than a pure spirit about statistical thinking.
Topics we will cover ●The R package ●Probability distributions and how to describe them (measures of central tendency and spread) ●Sampling and the Central limit theorem. ●The normal distribution, confidence intervals. ●Correlation coefficients, R-squared values, and what they mean
Assessment There is a stats coursework worth 15% of the marks of the overall module. ● (^) I will give you a data set – cf. Course web page for details ● (^) I will ask you a couple of questions about the data set ● (^) You are supposed to handle the data set using R and write a report that answers the questions. ● (^) The coursework is due on: Nov 17 12 (noon), feedback by email in December (around Dec 8)