Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Applied Statistics Projects and Data Analysis, Assignments of Sports Law

Detailed project descriptions and tasks for students in an applied statistics course. The projects involve data preprocessing, visualization, descriptive statistics, and modeling using various methods. The datasets include house prices, student performance, diets, flights, salaries, insurance, and supermarket sales. The goal is to gain insights and draw conclusions from the data.

Typology: Assignments

2021/2022

Uploaded on 04/13/2024

phuc-phan-3
phuc-phan-3 šŸ‡»šŸ‡³

1 document

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Course: Applied Statistics
Projects
Bui Anh Tuan
June 7, 2023
Overview
Due: Session 12.
Details
The class will be divided into groups. Each group, with 5 to 6 students, will be assigned
a topic to study and present in the class. The objective of this assessment is to encourage
students in doing research in groups and communicate their results in an oral presentation.
Presentation should be created using PowerPoint and should address:
1. Overview of the dataset, why would we investigate this topic.
2. Basic insights from the data using plots and descriptive statistics.
3. Models and results
4. Conclusion.
Presentations should generally not exceed 15 minutes, to allow time for questions and
discussion.
Marking criteria and standards
The presenters will be evaluated by the lecturer (50%) as well as the rest of the class (50%)
based on the following criteria:
i. Content: Is the presentation clear and focused? Does it cover all important content of
the assigned topic?
ii. Preparation: How well prepared is this group? How good are the slides and supporting
materials? How well does this group know their materials?
iii. Presentation and Communication: How well organized is the presentation? How
effectively does this group present, interact and involve the rest of the class? Does this
group use time effectively?
iv. Addressing questions: How effective does this group deal with questions and com-
ments?
v. Interest and Creativity: How interesting and creative is this group presentation?
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Applied Statistics Projects and Data Analysis and more Assignments Sports Law in PDF only on Docsity!

Course: Applied Statistics

Projects

Bui Anh Tuan

June 7, 2023

Overview

Due: Session 12.

Details

The class will be divided into groups. Each group, with 5 to 6 students, will be assigned a topic to study and present in the class. The objective of this assessment is to encourage students in doing research in groups and communicate their results in an oral presentation. Presentation should be created using PowerPoint and should address:

  1. Overview of the dataset, why would we investigate this topic.
  2. Basic insights from the data using plots and descriptive statistics.
  3. Models and results
  4. Conclusion.

Presentations should generally not exceed 15 minutes, to allow time for questions and discussion.

Marking criteria and standards

The presenters will be evaluated by the lecturer (50%) as well as the rest of the class (50%) based on the following criteria: i. Content: Is the presentation clear and focused? Does it cover all important content of the assigned topic? ii. Preparation: How well prepared is this group? How good are the slides and supporting materials? How well does this group know their materials? iii. Presentation and Communication: How well organized is the presentation? How effectively does this group present, interact and involve the rest of the class? Does this group use time effectively? iv. Addressing questions: How effective does this group deal with questions and com- ments? v. Interest and Creativity: How interesting and creative is this group presentation?

Dataset:

The file ā€houseprice.csvā€ contains house sale prices for King County, which includes Seat- tle. It includes homes sold between May 2014 and May 2015. Besides the house prices, the dataset also provides the details of the houses which are helpful for determining the house price. Use this dataset to build a regression model to predict the house price.

Main variables are:

ˆ price: price of the houses

ˆ floors: number of floors

ˆ condition: rating from 1 to 5 (from worse to great)

ˆ view: rating from 0-4 (from worse to great)

ˆ sqft above: area of the house

ˆ sqft living: living area (includes land around the house)

ˆ sqft basement: area of the basement.

ˆ bedrooms: number of bedrooms

Tasks:

Part 1. Data Preprocessing

  1. Import data:: houseprice.csv
  2. Data cleaning: NA (remove all observations containing ā€NAā€, missing data)

Part 2. Visualization and Descriptive Statistics

  1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get some basic insights of the data.
  2. Descriptive statistics of the price: mean, median,... Any insights?

Part 3. Models and Analyzing data

  1. Build a linear regression model.to evaluate factors on the price of the house.
  2. With the details of a chosen house: predict the price.
  3. Any interesting insights based on the data? (choose our own methods)

Dataset:

The data set ā€Diet.csvā€ contains information on 78 people who undertook one of three diets. There is background information such as age, gender (Female=0, Male=1) and height. The aim of the study was to see which diet was best for losing weight but it was also thought that the best diets for males and females may be different so the independent variables are diet and gender.

Main variables are:

ˆ Person: index of the participant

ˆ gender:

ˆ Age:

ˆ Height:

ˆ pre:weight: weight before the diet

ˆ Diet: type of diets (1,2 or 3)

ˆ weight6weeks: weight after 6 weeks on the chosen diet

Tasks:

Part 1. Data Preprocessing

  1. Import data:: Diet.csv
  2. Data cleaning: NA (remove all observations containing ā€NAā€, missing data)

Part 2. Visualization and Descriptive Statistics

  1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get some basic insights of the data.
  2. Descriptive statistics of the variables: mean, median,... Any insights?

Part 3. Models and Analyzing data

  1. Use one factor ANOVA to see which diet was best for losing weight.
  2. You may divide the whole dataset into two sub-dataset: one for male and one for female to see if we have difference choices.
  3. Any interesting insights based on the data? (choose our own methods)

Dataset:

The dataset ā€flights.csvā€ contains information about all flights that departed from the two major airports of the Pacific Northwest (PNW), SEA in Seattle and PDX in Portland, in 2014: 162,049 flights in total. The main goal of the project is to use this dataset and try to find out the major factors cause the delay or postpone of the flights.

Main variables are:

ˆ year, month, day: Date of departure

ˆ carrier: Two letter carrier abbreviation. See airlines to get name.

ˆ origin, dest: Origin and destination. See airports for additional metadata

ˆ dep delay, arr delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.

ˆ dep time, arr time: Actual departure and arrival times (format HHMM or HMM), local tz.

ˆ distance: Distance between airports, in miles.

Tasks:

Part 1. Data Preprocessing

  1. Import data:: flights.csv
  2. Data cleaning: NA (remove all observations containing ā€NAā€, missing data)

Part 2. Visualization and Descriptive Statistics

  1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get some basic insights of the data.
  2. Descriptive statistics of the arr delay: mean, median,... Any insights?

Part 3. Models and Analyzing data

  1. Use one factor ANOVA to evaluate the differences in the delay time between airlines.
  2. Based on your analysis, which carrier(s) tend to delay more than the others?.
  3. Any interesting insights based on the data? (choose our own methods)

Dataset:

The dataset ā€insurance.csvā€ consists of 1338 records of insurance contracts. The aim of this project is to build a model to predict the insurance costs.

Main variables are:

ˆ age: age of primary beneficiary

ˆ sex: insurance contractor gender( female, male)

ˆ bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/m^2 ) using the ratio of height to weight, ideally 18.5 to 24.

ˆ children: Number of children covered by health insurance / Number of dependents

ˆ smoker: smoking or non-smoking

ˆ region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

ˆ charges: Individual medical costs billed by health insurance

Tasks:

Part 1. Data Preprocessing

  1. Import data:: insurance.csv
  2. Data cleaning: NA (remove all observations containing ā€NAā€ if any, missing data)

Part 2. Visualization and Descriptive Statistics

  1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get some basic insights of the data.
  2. Descriptive statistics of the variables: mean, median,... Any insights?

Part 3. Models and Analyzing data

  1. Build a linear regression model.to evaluate factors on the insurance charges.
  2. Give an example of a contractor and then predict the insurance charge.
  3. Any interesting insights based on the data? (choose our own methods)

Dataset:

The dataset ā€supermarket sales.csvā€ is the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. The aim of this project is to inves- tigate the customer’s satisfaction based on the rating in difference branches.

Main variables are:

ˆ Invoice id: Computer generated sales slip invoice identification number

ˆ Branch: Branch of supercenter (3 branches are available identified by A, B and C).

ˆ Customer type: Type of customers, recorded by Members for customers using mem- ber card and Normal for without member card.

ˆ Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel

ˆ Unit price: Price of each product in US dollar

ˆ Quantity: Number of products purchased by customer

ˆ Total: Total price including tax

ˆ Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

Tasks:

Part 1. Data Preprocessing

  1. Import data:: supermarket sales.csv
  2. Data cleaning: NA (remove all observations containing ā€NAā€ if any, missing data)

Part 2. Visualization and Descriptive Statistics

  1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get some basic insights of the data.
  2. Descriptive statistics of the variables: mean, median,... Any insights?

Part 3. Models and Analyzing data

  1. Use one factor ANOVA to evaluate the differences in customer’s satisfaction (based on ratings) between 3 branches.
  2. Based on your analysis, which branch tend to higher customer’s satisfaction?.
  3. Any interesting insights based on the data? (choose our own methods)

Dataset:

This dataset ā€œOnlineNewsPopularity.xlsxā€ summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity).

Main variables are:

ˆ n tokens title: Number of words in the title

ˆ n tokens content: Number of words in the content

ˆ num hrefs: Number of links

ˆ num imgs: Number of images

ˆ num videos: Number of videos

ˆ data channel

ˆ weekday

ˆ global subjectivity: Text subjectivity

ˆ global rate positive words: Rate of positive words in the content

ˆ global rate negative words: Rate of negative words in the content

ˆ shares: Number of shares (target)

Tasks:

Part 1. Data Preprocessing

  1. Import data:: OnlineNewsPopularity.xlsx
  2. Data cleaning: NA (remove all observations containing ā€NAā€ if any, missing data)

Part 2. Visualization and Descriptive Statistics

  1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get some basic insights of the data.
  2. Descriptive statistics of number of shares: mean, median,... Any insights?

Part 3. Models and Analyzing data

  1. Build a linear regression model to evaluate factors on the number of shares.
  2. Give an example of a contractor and then predict the number of shares.
  3. Any interesting insights based on the data? (choose our own methods)

References

[1] Douglas C. Montgomery, George C. Runger. Hoboken. Applied Statistics and Probability for Engineers. NJ: Wiley, (2007).

[2] Peter Dalgaard Introductory Statistics with R. Springer, (2008).

[3] Gareth, J., Daniela, W., Trevor, H. and Robert, T. An introduction to statistical learning: with applications in R. Springer, (2013).