






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Detailed project descriptions and tasks for students in an applied statistics course. The projects involve data preprocessing, visualization, descriptive statistics, and modeling using various methods. The datasets include house prices, student performance, diets, flights, salaries, insurance, and supermarket sales. The goal is to gain insights and draw conclusions from the data.
Typology: Assignments
1 / 11
This page cannot be seen from the preview
Don't miss anything!
Due: Session 12.
The class will be divided into groups. Each group, with 5 to 6 students, will be assigned a topic to study and present in the class. The objective of this assessment is to encourage students in doing research in groups and communicate their results in an oral presentation. Presentation should be created using PowerPoint and should address:
Presentations should generally not exceed 15 minutes, to allow time for questions and discussion.
The presenters will be evaluated by the lecturer (50%) as well as the rest of the class (50%) based on the following criteria: i. Content: Is the presentation clear and focused? Does it cover all important content of the assigned topic? ii. Preparation: How well prepared is this group? How good are the slides and supporting materials? How well does this group know their materials? iii. Presentation and Communication: How well organized is the presentation? How effectively does this group present, interact and involve the rest of the class? Does this group use time effectively? iv. Addressing questions: How effective does this group deal with questions and com- ments? v. Interest and Creativity: How interesting and creative is this group presentation?
The file āhouseprice.csvā contains house sale prices for King County, which includes Seat- tle. It includes homes sold between May 2014 and May 2015. Besides the house prices, the dataset also provides the details of the houses which are helpful for determining the house price. Use this dataset to build a regression model to predict the house price.
Main variables are:
Ā price: price of the houses
Ā floors: number of floors
Ā condition: rating from 1 to 5 (from worse to great)
Ā view: rating from 0-4 (from worse to great)
Ā sqft above: area of the house
Ā sqft living: living area (includes land around the house)
Ā sqft basement: area of the basement.
Ā bedrooms: number of bedrooms
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
The data set āDiet.csvā contains information on 78 people who undertook one of three diets. There is background information such as age, gender (Female=0, Male=1) and height. The aim of the study was to see which diet was best for losing weight but it was also thought that the best diets for males and females may be different so the independent variables are diet and gender.
Main variables are:
Ā Person: index of the participant
Ā gender:
Ā Age:
Ā Height:
Ā pre:weight: weight before the diet
Ā Diet: type of diets (1,2 or 3)
Ā weight6weeks: weight after 6 weeks on the chosen diet
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
The dataset āflights.csvā contains information about all flights that departed from the two major airports of the Pacific Northwest (PNW), SEA in Seattle and PDX in Portland, in 2014: 162,049 flights in total. The main goal of the project is to use this dataset and try to find out the major factors cause the delay or postpone of the flights.
Main variables are:
Ā year, month, day: Date of departure
Ā carrier: Two letter carrier abbreviation. See airlines to get name.
Ā origin, dest: Origin and destination. See airports for additional metadata
Ā dep delay, arr delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
Ā dep time, arr time: Actual departure and arrival times (format HHMM or HMM), local tz.
Ā distance: Distance between airports, in miles.
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
The dataset āinsurance.csvā consists of 1338 records of insurance contracts. The aim of this project is to build a model to predict the insurance costs.
Main variables are:
Ā age: age of primary beneficiary
Ā sex: insurance contractor gender( female, male)
Ā bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/m^2 ) using the ratio of height to weight, ideally 18.5 to 24.
Ā children: Number of children covered by health insurance / Number of dependents
Ā smoker: smoking or non-smoking
Ā region: the beneficiaryās residential area in the US, northeast, southeast, southwest, northwest.
Ā charges: Individual medical costs billed by health insurance
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
The dataset āsupermarket sales.csvā is the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. The aim of this project is to inves- tigate the customerās satisfaction based on the rating in difference branches.
Main variables are:
Ā Invoice id: Computer generated sales slip invoice identification number
Ā Branch: Branch of supercenter (3 branches are available identified by A, B and C).
Ā Customer type: Type of customers, recorded by Members for customers using mem- ber card and Normal for without member card.
Ā Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
Ā Unit price: Price of each product in US dollar
Ā Quantity: Number of products purchased by customer
Ā Total: Total price including tax
Ā Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
This dataset āOnlineNewsPopularity.xlsxā summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity).
Main variables are:
Ā n tokens title: Number of words in the title
Ā n tokens content: Number of words in the content
Ā num hrefs: Number of links
Ā num imgs: Number of images
Ā num videos: Number of videos
Ā data channel
Ā weekday
Ā global subjectivity: Text subjectivity
Ā global rate positive words: Rate of positive words in the content
Ā global rate negative words: Rate of negative words in the content
Ā shares: Number of shares (target)
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
References
[1] Douglas C. Montgomery, George C. Runger. Hoboken. Applied Statistics and Probability for Engineers. NJ: Wiley, (2007).
[2] Peter Dalgaard Introductory Statistics with R. Springer, (2008).
[3] Gareth, J., Daniela, W., Trevor, H. and Robert, T. An introduction to statistical learning: with applications in R. Springer, (2013).