Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

FIT1043 - Introduction to Data Science: Concepts, Techniques, and Applications, Exercises of Financial Economics

Glasgow School of Art (GSA)Financial Economics

A comprehensive overview of key concepts and techniques in data science, covering topics such as data wrangling, analysis, presentation, and machine learning. It explores various data types, machine learning models, and data management principles. The document also delves into big data concepts, including the 4 v's and metadata types. It further examines nosql database models and distributed processing techniques. Valuable for students and professionals seeking to understand the fundamentals of data science.

Typology: Exercises

2024/2025

Available from 02/26/2025

patrick-maina-2 🇬🇧

299 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

FIT1043 - Introduction to Data Science

Drew Conway's Data Science Venn Diagram ✔✔Data Science involves hacking skills,

substantive expertise and math & statistics knowledge

Danger Zone in Drew Conway's Venn diagram ✔✔Having hacking skills and substantive

expertise but no math & statistics knowledge leads to trial and error experiments and

possibly bad judgements

Standard Value Chain ✔✔The tasks involved in data science

Collection ✔✔Getting the data

Wrangling ✔✔Involves data preprocessing, preparation, cleaning and transformation to

get the data into a usable state for analysis

Analysis ✔✔Discovery, learning, and visualization of data

Presentation ✔✔Presenting the data to argue that the results are significant and useful

Engineering ✔✔Storage and computational resources across full data lifecycle

Governance ✔✔Overall management of data across the full lifecycle

Operationalization ✔✔Putting the results to work to gain benefits/value from it

Data Scientist ✔✔Addresses the data science process to extract meaning/value from the

data

Chief Data Scientist ✔✔Addresses data management, data engineering and data science

goals

Inner join ✔✔Only rows in common to both dataframes

Left join ✔✔Keep all rows in left dataframe

Right join ✔✔Keep all rows in right dataframe

Partial preview of the text

Download FIT1043 - Introduction to Data Science: Concepts, Techniques, and Applications and more Exercises Financial Economics in PDF only on Docsity!

FIT1043 - Introduction to Data Science Drew Conway's Data Science Venn Diagram ✔✔Data Science involves hacking skills, substantive expertise and math & statistics knowledge Danger Zone in Drew Conway's Venn diagram ✔✔Having hacking skills and substantive expertise but no math & statistics knowledge leads to trial and error experiments and possibly bad judgements Standard Value Chain ✔✔The tasks involved in data science Collection ✔✔Getting the data Wrangling ✔✔Involves data preprocessing, preparation, cleaning and transformation to get the data into a usable state for analysis Analysis ✔✔Discovery, learning, and visualization of data Presentation ✔✔Presenting the data to argue that the results are significant and useful Engineering ✔✔Storage and computational resources across full data lifecycle Governance ✔✔Overall management of data across the full lifecycle Operationalization ✔✔Putting the results to work to gain benefits/value from it Data Scientist ✔✔Addresses the data science process to extract meaning/value from the data Chief Data Scientist ✔✔Addresses data management, data engineering and data science goals Inner join ✔✔Only rows in common to both dataframes Left join ✔✔Keep all rows in left dataframe Right join ✔✔Keep all rows in right dataframe

Union (full outer join) ✔✔Keep all rows from both dataframes and sort lexicographically Categorical-nominal data ✔✔Discrete numbers of values with no inherent ordering (e.g. Gender) Categorical-ordinal data ✔✔Discrete number of states with ordering (e.g. Year level) Numeric-discrete data ✔✔Numeric and enumerable; you can count it (e.g. integers) Numeric-continuous data ✔✔Numeric but not enumerable (e.g. height measurements, as they can go into decimals) Machine learning is useful when: ✔✔human expertise is not available; humans cannot explain their expertise (as a set of rules); and/or humans are expensive to use for the work Interpretability issue ✔✔Proper documentation is needed to understand and use the data (e.g. data dictionary) Data format issue ✔✔Different data sources sometimes use different data formats, which makes it hard to integrate and manipulate Inconsistent and faulty data ✔✔Mistyped data, inconsistent entries or unrelated data can make data hard to work with Duplicate data may not necessarily entered in word-for-word exactly the same, but: ✔✔conveys the same information Open data ✔✔Data that is publicly available and machine-readable Predictive model ✔✔A model that makes a prediction based on a set of features of an object The different features that a predictive model can use are: ✔✔Classifiers (binary or categorical), real values (aka regression), a vector of real values Machine learning ✔✔Getting computers to perform a task without using explicit instructions, relying on patterns and inference instead

True Positive ✔✔When both the predicted and actual class is positive True Negative ✔✔When both the predicted and actual class is negative False Positive (Type I Error) ✔✔When the predicted class is positive but the actual class is negative False Negative (Type II Error) ✔✔When the predicted class is negative but the actual class is positive Accuracy ✔✔How often is the prediction correct. Sensitivity (recall) ✔✔When the actual value is positive, how often is the prediction correct? Specificity ✔✔When the actual value is negative, how often is the prediction correct? False positive rate ✔✔When the actual value is negative, how often is the prediction incorrect? Precision ✔✔When a positive value is predicted, how often is this prediction correct? Decision Trees ✔✔Used to predict binary or categorical outcomes by grouping inputs based on common features Regression Trees ✔✔Used to predict continuous values by grouping inputs based on their value Recursive Partitioning ✔✔Method to build regression and decision trees by dividing feature space into regions so that similar instances are grouped together Random Forest ✔✔An ensemble of decision trees Clustering ✔✔Grouping a set of data points into different subgroups based on similarity

K-Means Clustering ✔✔Find a certain number (k) of subgroups in a dataset by assigning data points to the nearest centroid and then adjusting the centroid; these two steps are repeated until there is no more change Feature Scaling ✔✔Method used to normalize range of independent variables Big Data ✔✔Data sets with sizes beyond the ability of commonly used software tools The 4 V's of Big Data ✔✔Volume (Scale of data), Variety (Different forms of data), Velocity (Analysis of streaming data), Veracity (Uncertainty of data) Metadata ✔✔Structured information that describes, explains, locates or otherwise makes it easier to retrieve, use or manage an information resource Types of metadata ✔✔Descriptive (Describes content for identification and retrieval), Structural (Documents relationships and links), Administrative (Helps to manage information) Moore's Law ✔✔Number of transistors per chip, and thus computing power, doubles every 2 years (starting from 1975) Koomey's Law ✔✔Amount of energy needed for computation will half each decade Bell's Law ✔✔A new, cheaper computer class forms roughly each decade establishing a new industry Zimmerman's Law ✔✔Surveillance is constantly increasing, and privacy is constantly decreasing Information Silo ✔✔An isolated information system that can't interact with other systems NoSQL ✔✔A new generation of database management systems that is not based on the traditional relational database model Object Database ✔✔A NoSQL database model that stores data as objects, which can be grouped into classes and defined by attributes and methods

Hadoop ✔✔An open-source Java implementation of map-reduce to partition computation across clusters Spark ✔✔An open-source implementation of map-reduce built on Hadoop infrastructure Deep learning ✔✔ A machine learning subfield of learning representations of data Data Management ✔✔The development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets Privacy ✔✔Having control over how one shares oneself with others Confidentiality ✔✔Information privacy; having control over how one's information is shared Security ✔✔Protection of data, preventing it from being improperly used Implicit Data ✔✔Data not explicitly stored but can be inferred with reasonable precision from available data Compliance ✔✔Process of ensuring you meet regulations Audit ✔✔Validation of compliance

FIT1043 - Introduction to Data Science: Concepts, Techniques, and Applications, Exercises of Financial Economics

Related documents

Partial preview of the text

Download FIT1043 - Introduction to Data Science: Concepts, Techniques, and Applications and more Exercises Financial Economics in PDF only on Docsity!