Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data visualization using matplotlib, Summaries of Data Representation and Algorithm Design

Jawaharlal Nehru University Data Representation and Algorithm Design

In this how to visualize the data using matplotlib will be there.

Typology: Summaries

2021/2022

Available from 01/17/2022

desu-harshith 🇮🇳

3 documents

1 / 13

This page cannot be seen from the preview

Don't miss anything!

UNIT-V

Data Aggregation and Group Operations

GroupBy Mechanics( Iterating Over Groups, Selecting a Column or Subset of Columns, Grouping with Dicts

and Series, Grouping with Functions, Grouping by Index Levels) Data Aggregation(Column-wise and

Multiple Function Application, Returning Aggregated Data in “unindexed” Form), Group-wise Operations

and Transformations(Apply: General split-apply-combine, Quantile and Bucket Analysis, Example: Filling

Missing Values with Group-specific Values)

----------------------------------------------------------------------------------------------------------------------------------------------------------

Categorizing a data set and applying a function to each group, whether an aggregation or transformation, is

often a critical component of a data analysis workflow. After loading, merging, and preparing a data set, a

familiar task is to compute group statistics or possibly pivot tables for reporting or visualization purposes.

Pandas provide a flexible and high-performance GroupBy facility, enabling you to slice and dice, and

summarize data sets in a natural way.

1. GroupBy Mechanics

 In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or

otherwise, is split into groups based on one or more keys that you provide. The splitting is

performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows

(axis=0) or its columns (axis=1).

 Then a function is applied to each group, producing a new value.

 Finally, the results of all those function applications are combined into a result object.

The form of the resulting object will usually depend on what’s being done to the data.

Partial preview of the text

Download Data visualization using matplotlib and more Summaries Data Representation and Algorithm Design in PDF only on Docsity!

UNIT-V

Data Aggregation and Group Operations

GroupBy Mechanics( Iterating Over Groups, Selecting a Column or Subset of Columns, Grouping with Dicts and Series, Grouping with Functions, Grouping by Index Levels) Data Aggregation(Column-wise and Multiple Function Application, Returning Aggregated Data in “unindexed” Form), Group-wise Operations and Transformations(Apply: General split-apply-combine, Quantile and Bucket Analysis, Example: Filling Missing Values with Group-specific Values)

Categorizing a data set and applying a function to each group, whether an aggregation or transformation, is often a critical component of a data analysis workflow. After loading, merging, and preparing a data set, a familiar task is to compute group statistics or possibly pivot tables for reporting or visualization purposes. Pandas provide a flexible and high-performance GroupBy facility, enabling you to slice and dice, and summarize data sets in a natural way.

GroupBy Mechanics

 In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1).  Then a function is applied to each group, producing a new value.  Finally, the results of all those function applications are combined into a result object.

The form of the resulting object will usually depend on what’s being done to the data.

Example-1: Using DataFrame Data

import pandas as pd

import numpy as np

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'], 'key2' : ['one', 'two', 'one', 'two', 'one'],

'data1' : np.random.randn(5),

'data2' : np.random.randn(5)})

print(df)

Output:-

Suppose you wanted to compute the mean of the data1 column using the groups labels from key

grouped = df['data1'].groupby(df['key1'])

print(grouped)

Output

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001ECA4A11F10>

This grouped variable is now a GroupBy object. This object has all of the information needed to then apply some operation to each of the groups.

For example, to compute group means we can call the GroupBy’s mean method:

We can also perform grouping using multiple keys also

In this case, we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys.

1.2 Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of selecting those columns for aggregation. This means that:

Especially for large data sets, it may be desirable to aggregate only a few columns. For example, in the above data set, to compute means for just the data2 column and get the result as a DataFrame, we could write:

1.3 Grouping with Dicts and Series

Grouping information may exist in a form other than an array. Let’s consider another example DataFrame

Now, suppose I have a group correspondence for the columns and want to sum together the columns by group

mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f' : 'orange'}

Now, you could easily construct an array from this dict to pass to GroupBy, but instead we can just pass the dict:

by_column = people.groupby(mapping, axis=1)

by_column.sum()

The same functionality holds for Series, which can be viewed as a fixed size mapping. When I used Series as group keys in the above examples, pandas does, in fact, inspect each Series to ensure that its index is aligned with the axis it’s grouping

1.5 Grouping by Index Levels

A final convenience for hierarchically-indexed data sets is the ability to aggregate using one of the levels of an axis index. To do this, pass the level number or name using the level keyword:

2. Data Aggregation

Many common aggregation functions have optimized implementations that compute the statistics on the dataset in place

For link to CSV file Used in Code, click here

import pandas as pd

df = pd.read_csv("nba.csv")

df[:10]

Aggregation works with only numeric type columns.

2.1 Column-wise and Multiple Function Application

As you’ve seen above, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate with the desired function or calling a method like mean or std. However, you may want to aggregate using a different function depending on the column or multiple functions at once

df.aggregate(['sum', 'min'])

In Pandas, we can also apply different aggregation functions across different columns. For that, we need to pass a dictionary with key containing the column names and values containing the list of aggregation functions for any specific column.

importing pandas package

import pandas as pd

making data frame from csv file

df = pd.read_csv("nba.csv")

We are going to find aggregation for these columns

df.aggregate({"Number":['sum', 'min'],

"Age":['max', 'min'],

"Weight":['min', 'sum'],

"Salary":['sum']})

Group-wise Operations and Transformations

Aggregation is only one kind of group operation. It is a special case in the more general class of data transformations; that is, it accepts functions that reduce a one-dimensional array to a scalar value.

3.1 Apply: General split-apply-combine

There are two major differences between the transform and apply groupby methods.

 Input:

o apply implicitly passes all the columns for each group as a DataFrame to the custom function.

o while transform passes each column for each group individually as a Series to the custom function.

 Output:

o The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).

o The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

Examples

Let's create some sample data and inspect the groups so that you can see what I am talking about:

import pandas as pd

import numpy as np

df1 = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],

'a':[4,5,1,3], 'b':[6,10,3,11]})

print(df1)

Output:-

State a b

0 Texas 4 6

1 Texas 5 10

2 Florida 1 3

3 Florida 3 11

Let's create a simple custom function

transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:

def subtract_two(x):

return x['a'] - x['b']

df1.groupby('State').transform(subtract_two)

OUTPUT:-

KeyError: ('a', 'occurred at index a')

We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:

df1.groupby('State').apply(subtract_two)

output:-

State

Florida 2 -

3 -

Texas 0 -

1 -

dtype: int

3.2 Quantile and Bucket Analysis

As you may know, pandas has some tools, in particular cut and qcut, for slicing data up into buckets with bins of your choosing or by sample quantiles. Combining these functions with groupby, it becomes very simple to perform bucket or quantile analysis on a data set. Consider a simple random data set and an equal- length bucket categorization using cut:

The Factor object returned by cut can be passed directly to groupby. So we could compute a set of statistics for the data2 column like so:

These were equal-length buckets; to compute equal-size buckets based on samplequantiles, use qcut. I’ll pass labels=False to just get quantile numbers.

We can fill the NA values using the group means like so:

Data visualization using matplotlib, Summaries of Data Representation and Algorithm Design

Related documents

Partial preview of the text

Download Data visualization using matplotlib and more Summaries Data Representation and Algorithm Design in PDF only on Docsity!

UNIT-V

2. Data Aggregation

importing pandas package

making data frame from csv file

We are going to find aggregation for these columns