



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A comprehensive overview of key concepts and techniques in data science, covering topics such as data wrangling, analysis, presentation, and machine learning. It explores various data types, machine learning models, and data management principles. The document also delves into big data concepts, including the 4 v's and metadata types. It further examines nosql database models and distributed processing techniques. Valuable for students and professionals seeking to understand the fundamentals of data science.
Typology: Exercises
1 / 7
This page cannot be seen from the preview
Don't miss anything!
FIT1043 - Introduction to Data Science Drew Conway's Data Science Venn Diagram ✔✔Data Science involves hacking skills, substantive expertise and math & statistics knowledge Danger Zone in Drew Conway's Venn diagram ✔✔Having hacking skills and substantive expertise but no math & statistics knowledge leads to trial and error experiments and possibly bad judgements Standard Value Chain ✔✔The tasks involved in data science Collection ✔✔Getting the data Wrangling ✔✔Involves data preprocessing, preparation, cleaning and transformation to get the data into a usable state for analysis Analysis ✔✔Discovery, learning, and visualization of data Presentation ✔✔Presenting the data to argue that the results are significant and useful Engineering ✔✔Storage and computational resources across full data lifecycle Governance ✔✔Overall management of data across the full lifecycle Operationalization ✔✔Putting the results to work to gain benefits/value from it Data Scientist ✔✔Addresses the data science process to extract meaning/value from the data Chief Data Scientist ✔✔Addresses data management, data engineering and data science goals Inner join ✔✔Only rows in common to both dataframes Left join ✔✔Keep all rows in left dataframe Right join ✔✔Keep all rows in right dataframe
Union (full outer join) ✔✔Keep all rows from both dataframes and sort lexicographically Categorical-nominal data ✔✔Discrete numbers of values with no inherent ordering (e.g. Gender) Categorical-ordinal data ✔✔Discrete number of states with ordering (e.g. Year level) Numeric-discrete data ✔✔Numeric and enumerable; you can count it (e.g. integers) Numeric-continuous data ✔✔Numeric but not enumerable (e.g. height measurements, as they can go into decimals) Machine learning is useful when: ✔✔human expertise is not available; humans cannot explain their expertise (as a set of rules); and/or humans are expensive to use for the work Interpretability issue ✔✔Proper documentation is needed to understand and use the data (e.g. data dictionary) Data format issue ✔✔Different data sources sometimes use different data formats, which makes it hard to integrate and manipulate Inconsistent and faulty data ✔✔Mistyped data, inconsistent entries or unrelated data can make data hard to work with Duplicate data may not necessarily entered in word-for-word exactly the same, but: ✔✔conveys the same information Open data ✔✔Data that is publicly available and machine-readable Predictive model ✔✔A model that makes a prediction based on a set of features of an object The different features that a predictive model can use are: ✔✔Classifiers (binary or categorical), real values (aka regression), a vector of real values Machine learning ✔✔Getting computers to perform a task without using explicit instructions, relying on patterns and inference instead
True Positive ✔✔When both the predicted and actual class is positive True Negative ✔✔When both the predicted and actual class is negative False Positive (Type I Error) ✔✔When the predicted class is positive but the actual class is negative False Negative (Type II Error) ✔✔When the predicted class is negative but the actual class is positive Accuracy ✔✔How often is the prediction correct. Sensitivity (recall) ✔✔When the actual value is positive, how often is the prediction correct? Specificity ✔✔When the actual value is negative, how often is the prediction correct? False positive rate ✔✔When the actual value is negative, how often is the prediction incorrect? Precision ✔✔When a positive value is predicted, how often is this prediction correct? Decision Trees ✔✔Used to predict binary or categorical outcomes by grouping inputs based on common features Regression Trees ✔✔Used to predict continuous values by grouping inputs based on their value Recursive Partitioning ✔✔Method to build regression and decision trees by dividing feature space into regions so that similar instances are grouped together Random Forest ✔✔An ensemble of decision trees Clustering ✔✔Grouping a set of data points into different subgroups based on similarity
K-Means Clustering ✔✔Find a certain number (k) of subgroups in a dataset by assigning data points to the nearest centroid and then adjusting the centroid; these two steps are repeated until there is no more change Feature Scaling ✔✔Method used to normalize range of independent variables Big Data ✔✔Data sets with sizes beyond the ability of commonly used software tools The 4 V's of Big Data ✔✔Volume (Scale of data), Variety (Different forms of data), Velocity (Analysis of streaming data), Veracity (Uncertainty of data) Metadata ✔✔Structured information that describes, explains, locates or otherwise makes it easier to retrieve, use or manage an information resource Types of metadata ✔✔Descriptive (Describes content for identification and retrieval), Structural (Documents relationships and links), Administrative (Helps to manage information) Moore's Law ✔✔Number of transistors per chip, and thus computing power, doubles every 2 years (starting from 1975) Koomey's Law ✔✔Amount of energy needed for computation will half each decade Bell's Law ✔✔A new, cheaper computer class forms roughly each decade establishing a new industry Zimmerman's Law ✔✔Surveillance is constantly increasing, and privacy is constantly decreasing Information Silo ✔✔An isolated information system that can't interact with other systems NoSQL ✔✔A new generation of database management systems that is not based on the traditional relational database model Object Database ✔✔A NoSQL database model that stores data as objects, which can be grouped into classes and defined by attributes and methods
Hadoop ✔✔An open-source Java implementation of map-reduce to partition computation across clusters Spark ✔✔An open-source implementation of map-reduce built on Hadoop infrastructure Deep learning ✔✔ A machine learning subfield of learning representations of data Data Management ✔✔The development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets Privacy ✔✔Having control over how one shares oneself with others Confidentiality ✔✔Information privacy; having control over how one's information is shared Security ✔✔Protection of data, preventing it from being improperly used Implicit Data ✔✔Data not explicitly stored but can be inferred with reasonable precision from available data Compliance ✔✔Process of ensuring you meet regulations Audit ✔✔Validation of compliance