







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
In this how to visualize the data using matplotlib will be there.
Typology: Summaries
1 / 13
This page cannot be seen from the preview
Don't miss anything!
Data Aggregation and Group Operations
GroupBy Mechanics( Iterating Over Groups, Selecting a Column or Subset of Columns, Grouping with Dicts and Series, Grouping with Functions, Grouping by Index Levels) Data Aggregation(Column-wise and Multiple Function Application, Returning Aggregated Data in “unindexed” Form), Group-wise Operations and Transformations(Apply: General split-apply-combine, Quantile and Bucket Analysis, Example: Filling Missing Values with Group-specific Values)
Categorizing a data set and applying a function to each group, whether an aggregation or transformation, is often a critical component of a data analysis workflow. After loading, merging, and preparing a data set, a familiar task is to compute group statistics or possibly pivot tables for reporting or visualization purposes. Pandas provide a flexible and high-performance GroupBy facility, enabling you to slice and dice, and summarize data sets in a natural way.
In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1). Then a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object.
The form of the resulting object will usually depend on what’s being done to the data.
Example-1: Using DataFrame Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'], 'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
print(df)
Output:-
Suppose you wanted to compute the mean of the data1 column using the groups labels from key
grouped = df['data1'].groupby(df['key1'])
print(grouped)
Output
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001ECA4A11F10>
This grouped variable is now a GroupBy object. This object has all of the information needed to then apply some operation to each of the groups.
For example, to compute group means we can call the GroupBy’s mean method:
We can also perform grouping using multiple keys also
In this case, we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys.
1.2 Selecting a Column or Subset of Columns
Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of selecting those columns for aggregation. This means that:
Especially for large data sets, it may be desirable to aggregate only a few columns. For example, in the above data set, to compute means for just the data2 column and get the result as a DataFrame, we could write:
1.3 Grouping with Dicts and Series
Grouping information may exist in a form other than an array. Let’s consider another example DataFrame
Now, suppose I have a group correspondence for the columns and want to sum together the columns by group
mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f' : 'orange'}
Now, you could easily construct an array from this dict to pass to GroupBy, but instead we can just pass the dict:
by_column = people.groupby(mapping, axis=1)
by_column.sum()
The same functionality holds for Series, which can be viewed as a fixed size mapping. When I used Series as group keys in the above examples, pandas does, in fact, inspect each Series to ensure that its index is aligned with the axis it’s grouping
1.5 Grouping by Index Levels
A final convenience for hierarchically-indexed data sets is the ability to aggregate using one of the levels of an axis index. To do this, pass the level number or name using the level keyword:
Many common aggregation functions have optimized implementations that compute the statistics on the dataset in place
For link to CSV file Used in Code, click here
import pandas as pd
df = pd.read_csv("nba.csv")
df[:10]
Aggregation works with only numeric type columns.
2.1 Column-wise and Multiple Function Application
As you’ve seen above, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate with the desired function or calling a method like mean or std. However, you may want to aggregate using a different function depending on the column or multiple functions at once
df.aggregate(['sum', 'min'])
In Pandas, we can also apply different aggregation functions across different columns. For that, we need to pass a dictionary with key containing the column names and values containing the list of aggregation functions for any specific column.
import pandas as pd
df = pd.read_csv("nba.csv")
df.aggregate({"Number":['sum', 'min'],
"Age":['max', 'min'],
"Weight":['min', 'sum'],
"Salary":['sum']})
Aggregation is only one kind of group operation. It is a special case in the more general class of data transformations; that is, it accepts functions that reduce a one-dimensional array to a scalar value.
3.1 Apply: General split-apply-combine
There are two major differences between the transform and apply groupby methods.
Input:
o apply implicitly passes all the columns for each group as a DataFrame to the custom function.
o while transform passes each column for each group individually as a Series to the custom function.
Output:
o The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
o The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.
So, transform works on just one Series at a time and apply works on the entire DataFrame at once.
Examples
Let's create some sample data and inspect the groups so that you can see what I am talking about:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})
print(df1)
Output:-
State a b
0 Texas 4 6
1 Texas 5 10
2 Florida 1 3
3 Florida 3 11
Let's create a simple custom function
transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:
def subtract_two(x):
return x['a'] - x['b']
df1.groupby('State').transform(subtract_two)
OUTPUT:-
KeyError: ('a', 'occurred at index a')
We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:
df1.groupby('State').apply(subtract_two)
output:-
State
Florida 2 -
3 -
Texas 0 -
1 -
dtype: int
3.2 Quantile and Bucket Analysis
As you may know, pandas has some tools, in particular cut and qcut, for slicing data up into buckets with bins of your choosing or by sample quantiles. Combining these functions with groupby, it becomes very simple to perform bucket or quantile analysis on a data set. Consider a simple random data set and an equal- length bucket categorization using cut:
The Factor object returned by cut can be passed directly to groupby. So we could compute a set of statistics for the data2 column like so:
These were equal-length buckets; to compute equal-size buckets based on samplequantiles, use qcut. I’ll pass labels=False to just get quantile numbers.
We can fill the NA values using the group means like so: