Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Science: Applications, Techniques, and Challenges, Study Guides, Projects, Research of Advanced Data Analysis

Basic statistics, visualization and techniques of data analysis

Typology: Study Guides, Projects, Research

2022/2023

Uploaded on 07/21/2023

anirudh-ani-1
anirudh-ani-1 🇮🇳

2 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ASSIGNMENT
EXPLORATORY DATA ANALYSIS
(DADS302)
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Data Science: Applications, Techniques, and Challenges and more Study Guides, Projects, Research Advanced Data Analysis in PDF only on Docsity!

ASSIGNMENT

EXPLORATORY DATA ANALYSIS

(DADS302)

SET – I

2. Explain the terms "mean," "median," and "mode" using specific examples. "Mean," "median," and "mode" are three commonly used measures of central tendency in statistics. 1. Mean: The mean is the sum of all the values in a dataset divided by the number of values. It is also known as the average. For example, if we have a dataset of five numbers: 2, 4, 6, 8, and 10, the mean can be calculated as: Mean = (2 + 4 + 6 + 8 + 10) / 5 Mean = 30 / 5 Mean = 6 Therefore, the mean of this dataset is 6. 2. Median: The median is the middle value in a dataset when it is arranged in order from least to greatest. If the dataset has an even number of values, the median is the average of the two middle values. For example, if we have a dataset of seven numbers: 4, 6, 8, 10, 12, 14, and 16, the median can be calculated as: Arranging the data in ascending order: 4, 6, 8, 10, 12, 14, 16 Median = 10 Therefore, the median of this dataset is 10. 3. Mode: The mode is the value that appears most frequently in a dataset. A dataset can have one or more modes, or it can have no mode if all values appear with equal frequency. For example, if we have a dataset of ten numbers: 2, 4, 6, 6, 8, 8, 8, 10, 12, and 14, the mode can be calculated as: Mode = 8 Therefore, the mode of this dataset is 8. In conclusion, mean, median, and mode are three measures of central tendency used in statistics to describe and summarize datasets. The mean is the average of all values, the median is the middle value when the data is arranged in order, and the mode is the most frequently occurring value. 3. Discuss various techniques used for Data Visualization. Data visualization is the process of representing data in a graphical or pictorial form to make it more easily understandable and interpretable. Effective data visualization helps to identify patterns, trends, and insights within the data, enabling better decision-making. In this article, we will discuss various techniques used for data visualization.  Line Charts: Line charts are a popular technique used for visualizing time-series data. They are used to show how a variable changes over time. In a line chart, the x-axis represents time, and the y-axis represents the value of the variable being measured. Line charts are useful for identifying trends and patterns over time.  Bar Charts: Bar charts are a popular technique used for visualizing categorical data. They are used to compare the values of different categories. In a bar chart, the x-axis represents the categories, and the y-axis represents the value of the variable being measured. Bar charts are

useful for comparing the values of different categories and identifying the most significant categories.  Pie Charts: Pie charts are a popular technique used for visualizing data that can be expressed as a percentage or a proportion. They are used to show the relative sizes of different categories. In a pie chart, each category is represented by a slice of the pie, and the size of each slice represents the proportion of the whole. Pie charts are useful for showing the relative sizes of different categories.  Scatter Plots: Scatter plots are a popular technique used for visualizing the relationship between two variables. They are used to identify patterns and trends in the data. In a scatter plot, each point represents a data point, and the position of each point represents the value of the two variables being measured. Scatter plots are useful for identifying correlations between variables and identifying outliers.  Heat Maps: Heat maps are a popular technique used for visualizing data in a two-dimensional grid. They are used to show how the value of a variable changes across two dimensions. In a heat map, each cell represents a value of the variable being measured, and the color of each cell represents the value of the variable. Heat maps are useful for identifying patterns and trends in data that changes across two dimensions.  Tree Maps: Tree maps are a popular technique used for visualizing hierarchical data. They are used to show how the different categories of data are related to each other. In a tree map, each category is represented by a rectangle, and the size of each rectangle represents the value of the variable being measured. Tree maps are useful for showing the relative sizes of different categories in a hierarchical structure.  Word Clouds: Word clouds are a popular technique used for visualizing text data. They are used to show the frequency of words in a text. In a word cloud, each word is represented by a font size, and the size of each font represents the frequency of the word. Word clouds are useful for identifying the most frequently occurring words in a text.  Box Plots: Box plots are a popular technique used for visualizing the distribution of data. They are used to show how the data is spread out around the median. In a box plot, the median is represented by a line, and the box represents the middle 50% of the data. The whiskers represent the spread of the data. Box plots are useful for identifying outliers and showing the distribution of the data.

4. What is feature selection? Discuss any two feature selection techniques used to get optimal feature combinations. Feature selection is the process of selecting a subset of relevant features from a larger set of features in a dataset. The main objective of feature selection is to improve the accuracy and efficiency of machine learning models by reducing the dimensionality of the dataset. There are several feature selection techniques available, but two of the most commonly used are: (A) Recursive Feature Elimination (RFE): RFE is a backward feature selection technique that recursively eliminates the least important features from the dataset until the desired number of features is obtained. The algorithm starts by training a machine learning model on the full dataset and then ranking the importance of each feature based on their contribution to the model. The least important feature is then eliminated, and the model is retrained on the reduced dataset. This process is repeated until the desired number of features is obtained. RFE is a popular feature selection technique because it is simple and effective. It can be used with any machine learning model and can handle both linear and non-linear models. However, it can be computationally expensive for large datasets and may not always result in the optimal feature subset. (B) Principal Component Analysis (PCA): PCA is a dimensionality reduction technique that can also be used for feature selection. PCA transforms the original dataset into a new set of features that are a linear combination of the original features. The new features, called principal components, are selected based on their ability to explain the maximum amount of variance in the data. PCA is useful for feature selection because it can reduce the dimensionality of the dataset without losing too much information. However, it is best suited for datasets with highly correlated features and may not work well with datasets that have a mix of correlated and uncorrelated features. **5. Differentiate between:

  1. Two Tail Test and One Tail Test
  2. Type I Error and Type II Error** A. Two Tail Test and One Tail Test: A hypothesis test is used to determine if a statistical hypothesis about a population is supported by the sample data. There are two types of hypothesis tests based on the directionality of the hypothesis:  Two Tail Test: A two-tailed test is used when the null hypothesis can be rejected in both directions. It is used when the hypothesis is that there is a difference between two groups, but the direction of the difference is not specified. The p-value in a two-tailed test is divided equally between both tails of the distribution.

 One Tail Test: A one-tailed test is used when the null hypothesis can only be rejected in one direction. It is used when the hypothesis is that there is a difference between two groups, and the direction of the difference is specified. The p-value in a one-tailed test is calculated only for one tail of the distribution.  B. Type I Error and Type II Error: Type I and Type II errors are two types of errors that can occur in hypothesis testing.  Type I Error: Type I error occurs when the null hypothesis is rejected when it is actually true. It is also known as a false positive error or an alpha error. The probability of a type I error is denoted by alpha (α) and is usually set at 0.05 or 0.01.  Type II Error: Type II error occurs when the null hypothesis is not rejected when it is actually false. It is also known as a false negative error or a beta error. The probability of a type II error is denoted by beta (β) and depends on the sample size, effect size, and level of significance. Hypothesis testing is an important statistical technique used to determine if a statistical hypothesis about a population is supported by the sample data. Two Tail Test and One Tail Test are two types of hypothesis tests based on the directionality of the hypothesis, while Type I Error and Type II Error are two types of errors that can occur in hypothesis testing. Understanding the differences between these terms is essential for conducting hypothesis testing and interpreting the results.

6. What is High-Dimensional Data? Explain one method used for High Dimensional Data representation for test. High-dimensional data refers to datasets with a large number of features or variables compared to the number of samples. High-dimensional data is common in many fields, including computer vision, bioinformatics, and finance. The high dimensionality of the data poses a challenge for data analysis and visualization, as traditional methods may not be suitable for such datasets. One popular method for high-dimensional data representation is Principal Component Analysis (PCA). PCA is a dimensionality reduction technique that transforms the original dataset into a new set of features that are a linear combination of the original features. The new features, called principal components, are selected based on their ability to explain the maximum amount of variance in the data. PCA can be used for both visualization and data analysis. PCA works by finding the direction of maximum variance in the data and projecting the data onto that direction. The first principal component is the direction of maximum variance, and each subsequent principal component is orthogonal to the previous components and captures the remaining variance in the data. The resulting principal components can be used to represent the high- dimensional data in a lower-dimensional space while preserving most of the information.