Getting Started with Pandas – Lesson 3

David Suárez
September 16, 2021
Data Science, Software Architecture Sonar
Pandas

Share This Post

Table of Contents

Introduction

We begin with the third post of our data science training saga with Pandas. In this article we are going to make a summary of the different functions that are used in Pandas to perform Iteration, Maps, Grouping and Sorting. These functions allow us to make transformations of the data giving us useful information and insights.

Iteration, Maps, Grouping and Sorting

The 2009 data set ‘Wine Quality Dataset’ elaborated by Cortez et al. available at UCI Machine Learning , is a well-known dataset that contains wine quality information. It includes data about red and white wine physicochemical properties and a quality score.

Before we start, we are going to visualize a head of our didactic dataset that we are going to follow to show the examples using pandas head function.

Iteration

We start with the functions related to iterating through a dataset. We might want to use this function when we want to iterate row by row.

The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects.

If we iterate over a DataFrame, we get the column names.

for element in df:
    print(element)


fixed acidity           
volatile acidity       
citric acid              
residual sugar       
chlorides               
free sulfur dioxide     
total sulfur dioxide    
density    
pH           
sulphates
alcohol
quality

To iterate over the rows of the DataFrame, we can use the following functions:

Item

Consistent with the dict-like interface, items() and iteritems() iterates through key-value pairs:

Getting Started with Pandas - Lesson 2

+ Series: (index, scalar value) pairs

+ DataFrame: (column, Series) pairs

for key, value in wines.items():
    print(key)
    print(value)
    
    
    fixed acidity
0        7.4
1        7.8
2        7.8
3       11.2
4        7.4
        ... 
1594     6.2
1595     5.9
1596     6.3
1597     5.9
1598     6.0
Name: fixed acidity, Length: 1599, dtype: float64
volatile acidity
0       0.700
1       0.880
2       0.760
3       0.280
4       0.700
        ...

Iterrows

It allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator yielding each index value along with a Series containing the data in each row:

for row_index, row in wines.iterrows():
    print(row_index, row, sep="\n")
    
    0
fixed acidity            7.4000
volatile acidity         0.7000
citric acid              0.0000
residual sugar           1.9000
chlorides                0.0760
free sulfur dioxide     11.0000
total sulfur dioxide    34.0000
density                  0.9978
pH                       3.5100
sulphates                0.5600
alcohol                  9.4000
quality                  5.0000
Name: 0, dtype: float64
1
fixed acidity            7.8000
volatile acidity         0.8800
citric acid              0.0000
residual sugar           2.6000
chlorides                0.0980
free sulfur dioxide     25.0000
total sulfur dioxide    67.0000
density                  0.9968
pH                       3.2000
sulphates                0.6800
alcohol                  9.8000
quality                  5.0000
Name: 1, dtype: float64
2
fixed acidity            7.800
volatile acidity         0.760
citric acid              0.040
residual sugar           2.300
...

Itertuples

The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

for row in wines.itertuples():
    print(row)    

Pandas(Index=0, _1=7.4, _2=0.7, _3=0.0, _4=1.9, chlorides=0.076, _6=11.0, _7=34.0, density=0.9978, pH=3.51, sulphates=0.56, alcohol=9.4, quality=5)
Pandas(Index=1, _1=7.8, _2=0.88, _3=0.0, _4=2.6, chlorides=0.098, _6=25.0, _7=67.0, density=0.9968, pH=3.2, sulphates=0.68, alcohol=9.8, quality=5)
...

Conclusion

The Pandas library has provided us with 3 different functions which make iteration over the given data sets relatively easier. They are:

iteritems(): This function in the Pandas library helps the user to iterate over each and every element present in the set, column-wise. This function will be useful in case we want to look for something row by row but column by column. This way you don’t have to iterate over all the columns.

Cooperative AI Workshop at Neurips 2021

iterrows(): This function in the Pandas library helps the user to iterate over each and every element present in the set, row-wise. This function will be useful in case we want to iterate full-row by full-row so we can search a specific row-value without iterating the whole dataset.

itertuple(): This function in the Pandas library helps the user to iterate over each row present in the data set while forming a tuple out of the given data. This function will be useful when we need to iterate full-row by full-row but the output has to be tuple format.

Maps

We continue with the two most important functions to map a Series or Dataset.

Map

The Pandas map() function is used to map each value from a Series object to another value using a dictionary/function/Series.It is a convenience function to map values of a Series from one domain to another domain, as it allows us to make an operation for transforming all rows of a given column in a dataset.

For example, we can transform the series obtained from the `density` column by executing a function that multiplies each of its values by 100.

data['density'].map(lambda x: x * 100)    

0       99.780
1       99.680
2       99.700
3       99.800
4       99.780
         ...  
1594    99.490
1595    99.512
1596    99.574
1597    99.547
1598    99.549
Name: density, Length: 1599, dtype: float64...

Apply

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument:

For example, we can restore the values of the `density` column by executing a function that divides each of its values by 100, without having to extract the Series from the Dataframe since the maps function works with Dataframes.

def divide_by_100(x):
    x.density = x.density / 100
    return x

data.apply(divide_by_100, axis='columns')

Grouping

The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object grouping by `quality` you may do the following:

wines.groupby(["quality"]).quality.count()

quality
3     10
4     53
5    681
6    638
7    199
8     18
Name: quality, dtype: int64

You can also create the GroupBy object and apply a custom function, for example in this case we are going to group by `quality` and` alcohol` and obtain the highest density from each one:

wines.groupby(['quality', 'alcohol']).apply(lambda df: df.loc[df.density.idxmax()])

Finally, within the grouping section, one of the most useful functions in data analysis is the aggregation function.

Proxy / Cache: A faster local environment

In this case we are going to group by `quality` and we are going to obtain the maximum and minimum value of `alcohol` for each group.

wines.groupby(['quality']).alcohol.agg([min, max])

Sorting

In this case we are going to use a different dataset to clearly explain all the sorting functionality within Pandas. For this we are going to first observe the small example dataset that we are going to manipulate which we will call `unsorted_df`: