Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Python data analysis, Summaries of Computer Science

data analytics text book for those who are learning data analytics using python

Typology: Summaries

2016/2017

Uploaded on 11/19/2017

kashyap-upadhyay
kashyap-upadhyay 🇮🇳

5

(1)

1 document

1 / 470

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
www.it-ebooks.info
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Python data analysis and more Summaries Computer Science in PDF only on Docsity!

Python for Data Analysis

by Wes McKinney

Copyright © 2013 Wes McKinney. All rights reserved. Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://my.safaribooksonline.com ). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

Editors: Julie Steele and Meghan Blanchette

Production Editor: Melanie Yarbrough

Copyeditor: Teresa Exley

Proofreader: BIM Publishing Services

Indexer: BIM Publishing Services

Cover Designer: Karen Montgomery

Interior Designer: David Futato

Illustrator: Rebecca Demarest

October 2012: First Edition.

Revision History for the First Edition:

2012-10-05 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449319793 for release details.

Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Python for Data Analysis , the cover image of a golden-tailed tree shrew, and related trade dress are trademarks of O’Reilly Media, Inc.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps.

While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein.

ISBN: 978-1-449-31979-

[LSI]

1349356084

Table of Contents

    1. Preliminaries Preface xi
    • What Is This Book About?
    • Why Python for Data Analysis?
      • Python as Glue
      • Solving the “Two-Language” Problem
      • Why Not Python?
    • Essential Python Libraries
      • NumPy
      • pandas
      • matplotlib
      • IPython
      • SciPy
    • Installation and Setup
      • Windows
      • Apple OS X
      • GNU/Linux
      • Python 2 and Python
      • Integrated Development Environments (IDEs)
    • Community and Conferences
    • Navigating This Book
      • Code Examples
      • Data for Examples
      • Import Conventions
      • Jargon
    • Acknowledgements
    1. Introductory Examples
    • 1.usa.gov data from bit.ly
      • Counting Time Zones in Pure Python
      • Counting Time Zones with pandas
    • MovieLens 1M Data Set
      • Measuring rating disagreement
    • US Baby Names 1880-2010
      • Analyzing Naming Trends
    • Conclusions and The Path Ahead
    1. IPython: An Interactive Computing and Development Environment
    • IPython Basics
      • Tab Completion
      • Introspection
      • The %run Command
      • Executing Code from the Clipboard
      • Keyboard Shortcuts
      • Exceptions and Tracebacks
      • Magic Commands
      • Qt-based Rich GUI Console
      • Matplotlib Integration and Pylab Mode
    • Using the Command History
      • Searching and Reusing the Command History
      • Input and Output Variables
      • Logging the Input and Output
    • Interacting with the Operating System
      • Shell Commands and Aliases
      • Directory Bookmark System
    • Software Development Tools
      • Interactive Debugger
      • Timing Code: %time and %timeit
      • Basic Profiling: %prun and %run -p
      • Profiling a Function Line-by-Line
    • IPython HTML Notebook
    • Tips for Productive Code Development Using IPython
      • Reloading Module Dependencies
      • Code Design Tips
    • Advanced IPython Features
      • Making Your Own Classes IPython-friendly
      • Profiles and Configuration
    • Credits
    1. NumPy Basics: Arrays and Vectorized Computation
    • The NumPy ndarray: A Multidimensional Array Object
      • Creating ndarrays
      • Data Types for ndarrays
      • Operations between Arrays and Scalars
      • Basic Indexing and Slicing
      • Boolean Indexing
      • Fancy Indexing
      • Transposing Arrays and Swapping Axes
    • Universal Functions: Fast Element-wise Array Functions
    • Data Processing Using Arrays
      • Expressing Conditional Logic as Array Operations
      • Mathematical and Statistical Methods
      • Methods for Boolean Arrays
      • Sorting
      • Unique and Other Set Logic
    • File Input and Output with Arrays
      • Storing Arrays on Disk in Binary Format
      • Saving and Loading Text Files
    • Linear Algebra
    • Random Number Generation
    • Example: Random Walks
      • Simulating Many Random Walks at Once
    1. Getting Started with pandas
    • Introduction to pandas Data Structures
      • Series
      • DataFrame
      • Index Objects
    • Essential Functionality
      • Reindexing
      • Dropping entries from an axis
      • Indexing, selection, and filtering
      • Arithmetic and data alignment
      • Function application and mapping
      • Sorting and ranking
      • Axis indexes with duplicate values
    • Summarizing and Computing Descriptive Statistics
      • Correlation and Covariance
      • Unique Values, Value Counts, and Membership
    • Handling Missing Data
      • Filtering Out Missing Data
      • Filling in Missing Data
    • Hierarchical Indexing
      • Reordering and Sorting Levels
      • Summary Statistics by Level
      • Using a DataFrame’s Columns
    • Other pandas Topics
      • Integer Indexing
      • Panel Data
    1. Data Loading, Storage, and File Formats
    • Reading and Writing Data in Text Format
      • Reading Text Files in Pieces
      • Writing Data Out to Text Format
      • Manually Working with Delimited Formats
      • JSON Data
      • XML and HTML: Web Scraping
    • Binary Data Formats
      • Using HDF5 Format
      • Reading Microsoft Excel Files
    • Interacting with HTML and Web APIs
    • Interacting with Databases
      • Storing and Loading Data in MongoDB
    1. Data Wrangling: Clean, Transform, Merge, Reshape
    • Combining and Merging Data Sets
      • Database-style DataFrame Merges
      • Merging on Index
      • Concatenating Along an Axis
      • Combining Data with Overlap
    • Reshaping and Pivoting
      • Reshaping with Hierarchical Indexing
      • Pivoting “long” to “wide” Format
    • Data Transformation
      • Removing Duplicates
      • Transforming Data Using a Function or Mapping
      • Replacing Values
      • Renaming Axis Indexes
      • Discretization and Binning
      • Detecting and Filtering Outliers
      • Permutation and Random Sampling
      • Computing Indicator/Dummy Variables
    • String Manipulation
      • String Object Methods
      • Regular expressions
      • Vectorized string functions in pandas
    • Example: USDA Food Database
    1. Plotting and Visualization
    • A Brief matplotlib API Primer
      • Figures and Subplots
      • Colors, Markers, and Line Styles
      • Ticks, Labels, and Legends
      • Annotations and Drawing on a Subplot
      • Saving Plots to File
      • matplotlib Configuration
    • Plotting Functions in pandas
      • Line Plots
      • Bar Plots
      • Histograms and Density Plots
      • Scatter Plots
    • Plotting Maps: Visualizing Haiti Earthquake Crisis Data
    • Python Visualization Tool Ecosystem
      • Chaco
      • mayavi
      • Other Packages
      • The Future of Visualization Tools?
    1. Data Aggregation and Group Operations
    • GroupBy Mechanics
      • Iterating Over Groups
      • Selecting a Column or Subset of Columns
      • Grouping with Dicts and Series
      • Grouping with Functions
      • Grouping by Index Levels
    • Data Aggregation
      • Column-wise and Multiple Function Application
      • Returning Aggregated Data in “unindexed” Form
    • Group-wise Operations and Transformations
      • Apply: General split-apply-combine
      • Quantile and Bucket Analysis
      • Example: Filling Missing Values with Group-specific Values
      • Example: Random Sampling and Permutation
      • Example: Group Weighted Average and Correlation
      • Example: Group-wise Linear Regression
    • Pivot Tables and Cross-Tabulation
      • Cross-Tabulations: Crosstab
    • Example: 2012 Federal Election Commission Database
      • Donation Statistics by Occupation and Employer
      • Bucketing Donation Amounts
      • Donation Statistics by State
    1. Time Series
    • Date and Time Data Types and Tools
      • Converting between string and datetime
    • Time Series Basics
      • Indexing, Selection, Subsetting
      • Time Series with Duplicate Indices
    • Date Ranges, Frequencies, and Shifting
      • Generating Date Ranges
      • Frequencies and Date Offsets
      • Shifting (Leading and Lagging) Data
    • Time Zone Handling
      • Localization and Conversion
      • Operations with Time Zone−aware Timestamp Objects
      • Operations between Different Time Zones
    • Periods and Period Arithmetic
      • Period Frequency Conversion
      • Quarterly Period Frequencies
      • Converting Timestamps to Periods (and Back)
      • Creating a PeriodIndex from Arrays
    • Resampling and Frequency Conversion
      • Downsampling
      • Upsampling and Interpolation
      • Resampling with Periods
    • Time Series Plotting
    • Moving Window Functions
      • Exponentially-weighted functions
      • Binary Moving Window Functions
      • User-Defined Moving Window Functions
    • Performance and Memory Usage Notes
    1. Financial and Economic Data Applications
    • Data Munging Topics
      • Time Series and Cross-Section Alignment
      • Operations with Time Series of Different Frequencies
      • Time of Day and “as of” Data Selection
      • Splicing Together Data Sources
      • Return Indexes and Cumulative Returns
    • Group Transforms and Analysis
      • Group Factor Exposures
      • Decile and Quartile Analysis
    • More Example Applications
      • Signal Frontier Analysis
      • Future Contract Rolling
        • Rolling Correlation and Linear Regression
      1. Advanced NumPy
      • ndarray Object Internals
        • NumPy dtype Hierarchy
      • Advanced Array Manipulation
        • Reshaping Arrays
        • C versus Fortran Order
        • Concatenating and Splitting Arrays
        • Repeating Elements: Tile and Repeat
        • Fancy Indexing Equivalents: Take and Put
      • Broadcasting
        • Broadcasting Over Other Axes
        • Setting Array Values by Broadcasting
      • Advanced ufunc Usage
        • ufunc Instance Methods
        • Custom ufuncs
      • Structured and Record Arrays
        • Nested dtypes and Multidimensional Fields
        • Why Use Structured Arrays?
        • Structured Array Manipulations: numpy.lib.recfunctions
      • More About Sorting
        • Indirect Sorts: argsort and lexsort
        • Alternate Sort Algorithms
        • numpy.searchsorted: Finding elements in a Sorted Array
      • NumPy Matrix Class
      • Advanced Array Input and Output
        • Memory-mapped Files
        • HDF5 and Other Array Storage Options
      • Performance Tips
        • The Importance of Contiguous Memory
        • Other Speed Options: Cython, f2py, C
  • Appendix: Python Language Essentials
  • Index

Preface

The scientific Python ecosystem of open source libraries has grown substantially over the last 10 years. By late 2011, I had long felt that the lack of centralized learning resources for data analysis and statistical applications was a stumbling block for new Python programmers engaged in such work. Key projects for data analysis (especially NumPy, IPython, matplotlib, and pandas) had also matured enough that a book written about them would likely not go out-of-date very quickly. Thus, I mustered the nerve to embark on this writing project. This is the book that I wish existed when I started using Python for data analysis in 2007. I hope you find it useful and are able to apply these tools productively in your work.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold Shows commands or other text that should be typed literally by the user.

Constant width italic Shows text that should be replaced with user-supplied values or by values deter- mined by context.

This icon signifies a tip, suggestion, or general note.

xi

This icon indicates a warning or caution.

Using Code Examples

This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “ Python for Data Analysis by William Wes- ley McKinney (O’Reilly). Copyright 2012 William McKinney, 978-1-449-31979-3.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.

Safari® Books Online

Safari Books Online ( www.safaribooksonline.com ) is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and cre- ative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.

Safari Books Online offers a range of product mixes and pricing programs for organi- zations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable da- tabase from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Tech- nology, and dozens more. For more information about Safari Books Online, please visit us online.

xii | Preface

CHAPTER 1

Preliminaries

What Is This Book About?

This book is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems. This book is not an exposition on analytical methods using Python as the implementation language.

When I say “data”, what am I referring to exactly? The primary focus is on structured data , a deliberately vague term that encompasses many different common forms of data, such as

  • Multidimensional arrays (matrices)
  • Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files
  • Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a SQL user)
  • Evenly or unevenly spaced time series

This is by no means a complete list. Even though it may not always be obvious, a large percentage of data sets can be transformed into a structured form that is more suitable for analysis and modeling. If not, it may be possible to extract features from a data set into a structured form. As an example, a collection of news articles could be processed into a word frequency table which could then be used to perform sentiment analysis.

Most users of spreadsheet programs like Microsoft Excel, perhaps the most widely used data analysis tool in the world, will not be strangers to these kinds of data.

1

ideas to be part of a larger production system written in, say, Java, C#, or C++. What people are increasingly finding is that Python is a suitable language not only for doing research and prototyping but also building the production systems, too. I believe that more and more companies will go down this path as there are often significant organ- izational benefits to having both scientists and technologists using the same set of pro- grammatic tools.

Why Not Python?

While Python is an excellent environment for building computationally-intensive sci- entific applications and building most kinds of general purpose systems, there are a number of uses for which Python may be less suitable.

As Python is an interpreted programming language, in general most Python code will run substantially slower than code written in a compiled language like Java or C++. As programmer time is typically more valuable than CPU time , many are happy to make this tradeoff. However, in an application with very low latency requirements (for ex- ample, a high frequency trading system), the time spent programming in a lower-level, lower-productivity language like C++ to achieve the maximum possible performance might be time well spent.

Python is not an ideal language for highly concurrent, multithreaded applications, par- ticularly applications with many CPU-bound threads. The reason for this is that it has what is known as the global interpreter lock (GIL), a mechanism which prevents the interpreter from executing more than one Python bytecode instruction at a time. The technical reasons for why the GIL exists are beyond the scope of this book, but as of this writing it does not seem likely that the GIL will disappear anytime soon. While it is true that in many big data processing applications, a cluster of computers may be required to process a data set in a reasonable amount of time, there are still situations where a single-process, multithreaded system is desirable.

This is not to say that Python cannot execute truly multithreaded, parallel code; that code just cannot be executed in a single Python process. As an example, the Cython project features easy integration with OpenMP, a C framework for parallel computing, in order to to parallelize loops and thus significantly speed up numerical algorithms.

Essential Python Libraries

For those who are less familiar with the scientific Python ecosystem and the libraries used throughout the book, I present the following overview of each library.

Essential Python Libraries | 3

NumPy

NumPy, short for Numerical Python, is the foundational package for scientific com- puting in Python. The majority of this book will be based on NumPy and libraries built on top of NumPy. It provides, among other things:

  • A fast and efficient multidimensional array object ndarray
  • Functions for performing element-wise computations with arrays or mathematical operations between arrays
  • Tools for reading and writing array-based data sets to disk
  • Linear algebra operations, Fourier transform, and random number generation
  • Tools for integrating connecting C, C++, and Fortran code to Python

Beyond the fast array-processing capabilities that NumPy adds to Python, one of its primary purposes with regards to data analysis is as the primary container for data to be passed between algorithms. For numerical data, NumPy arrays are a much more efficient way of storing and manipulating data than the other built-in Python data structures. Also, libraries written in a lower-level language, such as C or Fortran, can operate on the data stored in a NumPy array without copying any data.

pandas

pandas provides rich data structures and functions designed to make working with structured data fast, easy, and expressive. It is, as you will see, one of the critical in- gredients enabling Python to be a powerful and productive data analysis environment. The primary object in pandas that will be used in this book is the DataFrame, a two- dimensional tabular, column-oriented data structure with both row and column labels:

frame total_bill tip sex smoker day time size 1 16.99 1.01 Female No Sun Dinner 2 2 10.34 1.66 Male No Sun Dinner 3 3 21.01 3.5 Male No Sun Dinner 3 4 23.68 3.31 Male No Sun Dinner 2 5 24.59 3.61 Female No Sun Dinner 4 6 25.29 4.71 Male No Sun Dinner 4 7 8.77 2 Male No Sun Dinner 2 8 26.88 3.12 Male No Sun Dinner 4 9 15.04 1.96 Male No Sun Dinner 2 10 14.78 3.23 Male No Sun Dinner 2

pandas combines the high performance array-computing features of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL). It provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data. pandas is the primary tool that we will use in this book.

4 | Chapter 1: Preliminaries