Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Statistics in Foundations of Data Science: Understanding Data and Making Inferences with R, Summaries of Statistics

An introduction to the role of statistics in data science, focusing on the COMP6235 Foundations of Data Science course taught by Markus Brede at the University of Southampton. the importance of statistics in data science, including data description, comparison, and figuring out what is special about given data. It also introduces the R package as a tool for statistical analysis and provides resources for learning R. intended to give students a basic understanding of statistics and familiarize them with R, but not require memorization of every technique.

Typology: Summaries

2021/2022

Uploaded on 09/27/2022

globelaw
globelaw 🇬🇧

4.2

(43)

323 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Statistics part of
COMP6235
Opening lecture
Markus Brede, mb8@ecs.soton.ac.uk
(some material used here was
developed by Jason Noble)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Statistics in Foundations of Data Science: Understanding Data and Making Inferences with R and more Summaries Statistics in PDF only on Docsity!

Statistics part of

COMP

Opening lecture

Markus Brede, mb8@ecs.soton.ac.uk

(some material used here was

developed by Jason Noble)

Some background ● (^) Why do you need statistics in a Foundations of Data Science course? Several aims of data science are facilitated by statistics: ● (^) Describing given data: We will discuss are distributions, mean/median values, variance, and co-dependencies between data sets. ● (^) Figuring out what is special about given data: This needs comparison to reference models which are often at the basis of statistics – e.g. Are data more/less varied than expected? ● (^) Comparing data: is one set of data different from another?

Aims of the stats part We want you to achieve the following: ●Give you some idea of what stats is about and familiarise you with some basic tools you might find useful for data science (but this is only a starting point!) ● (^) Familiarise you with the R package – a very common stats package ●We don't need you to have to have memorized every technique, but we want you to know what you need to look up and why. ● (^) Approach of the lectures is very experimental and hands- on, the aim is to enable you to use basic stats

Where to get module information ● (^) I have previously taught a stats module for PhD students, the old course information can be found here http://users.ecs.soton.ac.uk/mb8/stats/stats.html if you are keen to learn more. ● (^) The current slides and coursework information is available here: http://users.ecs.soton.ac.uk/mb8/datascience/datascience.html or also linked from the course wiki: https://secure.ecs.soton.ac.uk/noteswiki/w/COMP6235/1718#S lides

Online material for R R is actually a full-featured programming language. We will mostly be using it as an advanced statistical calculator. In getting to know R better, there's a lot of online help: ●The R-project home page: http://www.r-project.org/ ●Jason Noble's crash course: http://users.ecs.soton.ac.uk/jn2/simulation/introToR.html ●The "Quick-R" guide: http://www.statmethods.net/stats/index.html ●Intro to plotting in R: http://www.people.carleton.edu/~lchihara/Splus/RPlot.pdf ●The ggplot2 package for potentially nicer looking graphs: http://had.co.nz/ggplot2/

Online materials by other people We are also going to use various online supplements. There are some very high quality stats materials on the net, and I won't pretend I can do better than all of them. ●Khan academy: http://www.khanacademy.org/ ●David M. Lane's online course notes: http://davidmlane.com/hyperstat/index.html and http://onlinestatbook.com/2/index.html

An aside If anyone needs to build their enthusiasm for statistics and its uses, try listening to the amazing Hans Rosling for a while: http://www.gapminder.org/videos/the-joy-of-stats/

What is statistics all about? ●Important distinction between the related areas of statistics and probability. ●Probability says "I've got a data generating process" (e.g., throwing two dice and adding the result), "now tell me what sorts of outcomes I can expect from this data generating process." ●Statistics is the inverse. Statistics says "I have some outcomes" (i.e., data), "Now what can I infer about the process that generated them?".

What is statistics all about? ● (^) We reason that the real world result either is or isn't close enough to the hypothetical one to make us suspect that the hypothetical world's data-generating process is actually a good description of the real one. ● (^) Once you "get" that inferential strategy, all of statistics starts to make a lot more sense.

What is statistics all about? ●Remember that some of statistics is convention: e.g., why are we so interested in the squared differences from the mean? ●Not everything is exact, there's often more than one way to do things. ●The statistical tests we favour might have been different if we'd had a different history of statistical development. ●There's a pragmatic rather than a pure spirit about statistical thinking.

Topics we will cover ●The R package ●Probability distributions and how to describe them (measures of central tendency and spread) ●Sampling and the Central limit theorem. ●The normal distribution, confidence intervals. ●Correlation coefficients, R-squared values, and what they mean

Assessment There is a stats coursework worth 15% of the marks of the overall module. ● (^) I will give you a data set – cf. Course web page for details ● (^) I will ask you a couple of questions about the data set ● (^) You are supposed to handle the data set using R and write a report that answers the questions. ● (^) The coursework is due on: Nov 17 12 (noon), feedback by email in December (around Dec 8)