Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

FIT1043 - Introduction to Data Science: Concepts, Techniques, and Applications, Exercises of Financial Economics

A comprehensive overview of key concepts and techniques in data science, covering topics such as data wrangling, analysis, presentation, and machine learning. It explores various data types, machine learning models, and data management principles. The document also delves into big data concepts, including the 4 v's and metadata types. It further examines nosql database models and distributed processing techniques. Valuable for students and professionals seeking to understand the fundamentals of data science.

Typology: Exercises

2024/2025

Available from 02/26/2025

patrick-maina-2
patrick-maina-2 🇬🇧

299 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
FIT1043 - Introduction to Data Science
Drew Conway's Data Science Venn Diagram ✔✔Data Science involves hacking skills,
substantive expertise and math & statistics knowledge
Danger Zone in Drew Conway's Venn diagram ✔✔Having hacking skills and substantive
expertise but no math & statistics knowledge leads to trial and error experiments and
possibly bad judgements
Standard Value Chain ✔✔The tasks involved in data science
Collection ✔✔Getting the data
Wrangling ✔✔Involves data preprocessing, preparation, cleaning and transformation to
get the data into a usable state for analysis
Analysis ✔✔Discovery, learning, and visualization of data
Presentation ✔✔Presenting the data to argue that the results are significant and useful
Engineering ✔✔Storage and computational resources across full data lifecycle
Governance ✔✔Overall management of data across the full lifecycle
Operationalization ✔✔Putting the results to work to gain benefits/value from it
Data Scientist ✔✔Addresses the data science process to extract meaning/value from the
data
Chief Data Scientist ✔✔Addresses data management, data engineering and data science
goals
Inner join ✔✔Only rows in common to both dataframes
Left join ✔✔Keep all rows in left dataframe
Right join ✔✔Keep all rows in right dataframe
pf3
pf4
pf5

Partial preview of the text

Download FIT1043 - Introduction to Data Science: Concepts, Techniques, and Applications and more Exercises Financial Economics in PDF only on Docsity!

FIT1043 - Introduction to Data Science Drew Conway's Data Science Venn Diagram ✔✔Data Science involves hacking skills, substantive expertise and math & statistics knowledge Danger Zone in Drew Conway's Venn diagram ✔✔Having hacking skills and substantive expertise but no math & statistics knowledge leads to trial and error experiments and possibly bad judgements Standard Value Chain ✔✔The tasks involved in data science Collection ✔✔Getting the data Wrangling ✔✔Involves data preprocessing, preparation, cleaning and transformation to get the data into a usable state for analysis Analysis ✔✔Discovery, learning, and visualization of data Presentation ✔✔Presenting the data to argue that the results are significant and useful Engineering ✔✔Storage and computational resources across full data lifecycle Governance ✔✔Overall management of data across the full lifecycle Operationalization ✔✔Putting the results to work to gain benefits/value from it Data Scientist ✔✔Addresses the data science process to extract meaning/value from the data Chief Data Scientist ✔✔Addresses data management, data engineering and data science goals Inner join ✔✔Only rows in common to both dataframes Left join ✔✔Keep all rows in left dataframe Right join ✔✔Keep all rows in right dataframe

Union (full outer join) ✔✔Keep all rows from both dataframes and sort lexicographically Categorical-nominal data ✔✔Discrete numbers of values with no inherent ordering (e.g. Gender) Categorical-ordinal data ✔✔Discrete number of states with ordering (e.g. Year level) Numeric-discrete data ✔✔Numeric and enumerable; you can count it (e.g. integers) Numeric-continuous data ✔✔Numeric but not enumerable (e.g. height measurements, as they can go into decimals) Machine learning is useful when: ✔✔human expertise is not available; humans cannot explain their expertise (as a set of rules); and/or humans are expensive to use for the work Interpretability issue ✔✔Proper documentation is needed to understand and use the data (e.g. data dictionary) Data format issue ✔✔Different data sources sometimes use different data formats, which makes it hard to integrate and manipulate Inconsistent and faulty data ✔✔Mistyped data, inconsistent entries or unrelated data can make data hard to work with Duplicate data may not necessarily entered in word-for-word exactly the same, but: ✔✔conveys the same information Open data ✔✔Data that is publicly available and machine-readable Predictive model ✔✔A model that makes a prediction based on a set of features of an object The different features that a predictive model can use are: ✔✔Classifiers (binary or categorical), real values (aka regression), a vector of real values Machine learning ✔✔Getting computers to perform a task without using explicit instructions, relying on patterns and inference instead

True Positive ✔✔When both the predicted and actual class is positive True Negative ✔✔When both the predicted and actual class is negative False Positive (Type I Error) ✔✔When the predicted class is positive but the actual class is negative False Negative (Type II Error) ✔✔When the predicted class is negative but the actual class is positive Accuracy ✔✔How often is the prediction correct. Sensitivity (recall) ✔✔When the actual value is positive, how often is the prediction correct? Specificity ✔✔When the actual value is negative, how often is the prediction correct? False positive rate ✔✔When the actual value is negative, how often is the prediction incorrect? Precision ✔✔When a positive value is predicted, how often is this prediction correct? Decision Trees ✔✔Used to predict binary or categorical outcomes by grouping inputs based on common features Regression Trees ✔✔Used to predict continuous values by grouping inputs based on their value Recursive Partitioning ✔✔Method to build regression and decision trees by dividing feature space into regions so that similar instances are grouped together Random Forest ✔✔An ensemble of decision trees Clustering ✔✔Grouping a set of data points into different subgroups based on similarity

K-Means Clustering ✔✔Find a certain number (k) of subgroups in a dataset by assigning data points to the nearest centroid and then adjusting the centroid; these two steps are repeated until there is no more change Feature Scaling ✔✔Method used to normalize range of independent variables Big Data ✔✔Data sets with sizes beyond the ability of commonly used software tools The 4 V's of Big Data ✔✔Volume (Scale of data), Variety (Different forms of data), Velocity (Analysis of streaming data), Veracity (Uncertainty of data) Metadata ✔✔Structured information that describes, explains, locates or otherwise makes it easier to retrieve, use or manage an information resource Types of metadata ✔✔Descriptive (Describes content for identification and retrieval), Structural (Documents relationships and links), Administrative (Helps to manage information) Moore's Law ✔✔Number of transistors per chip, and thus computing power, doubles every 2 years (starting from 1975) Koomey's Law ✔✔Amount of energy needed for computation will half each decade Bell's Law ✔✔A new, cheaper computer class forms roughly each decade establishing a new industry Zimmerman's Law ✔✔Surveillance is constantly increasing, and privacy is constantly decreasing Information Silo ✔✔An isolated information system that can't interact with other systems NoSQL ✔✔A new generation of database management systems that is not based on the traditional relational database model Object Database ✔✔A NoSQL database model that stores data as objects, which can be grouped into classes and defined by attributes and methods

Hadoop ✔✔An open-source Java implementation of map-reduce to partition computation across clusters Spark ✔✔An open-source implementation of map-reduce built on Hadoop infrastructure Deep learning ✔✔ A machine learning subfield of learning representations of data Data Management ✔✔The development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets Privacy ✔✔Having control over how one shares oneself with others Confidentiality ✔✔Information privacy; having control over how one's information is shared Security ✔✔Protection of data, preventing it from being improperly used Implicit Data ✔✔Data not explicitly stored but can be inferred with reasonable precision from available data Compliance ✔✔Process of ensuring you meet regulations Audit ✔✔Validation of compliance