Table of Contents
We begin with the third post of our data science training saga with Pandas. In this article we are going to make a summary of the different functions that are used in Pandas to perform Iteration, Maps, Grouping and Sorting. These functions allow us to make transformations of the data giving us useful information and insights.
Iteration, Maps, Grouping and Sorting
The 2009 data set ‘Wine Quality Dataset’ elaborated by Cortez et al. available at UCI Machine Learning , is a well-known dataset that contains wine quality information. It includes data about red and white wine physicochemical properties and a quality score.
Before we start, we are going to visualize a head of our didactic dataset that we are going to follow to show the examples using pandas head function.
We start with the functions related to iterating through a dataset. We might want to use this function when we want to iterate row by row.
The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects.
If we iterate over a DataFrame, we get the column names.
for element in df: print(element) fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
To iterate over the rows of the DataFrame, we can use the following functions:
Consistent with the dict-like interface, items() and iteritems() iterates through key-value pairs:
+ Series: (index, scalar value) pairs
+ DataFrame: (column, Series) pairs
for key, value in wines.items(): print(key) print(value) fixed acidity 0 7.4 1 7.8 2 7.8 3 11.2 4 7.4 ... 1594 6.2 1595 5.9 1596 6.3 1597 5.9 1598 6.0 Name: fixed acidity, Length: 1599, dtype: float64 volatile acidity 0 0.700 1 0.880 2 0.760 3 0.280 4 0.700 ...
It allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator yielding each index value along with a Series containing the data in each row:
for row_index, row in wines.iterrows(): print(row_index, row, sep="\n") 0 fixed acidity 7.4000 volatile acidity 0.7000 citric acid 0.0000 residual sugar 1.9000 chlorides 0.0760 free sulfur dioxide 11.0000 total sulfur dioxide 34.0000 density 0.9978 pH 3.5100 sulphates 0.5600 alcohol 9.4000 quality 5.0000 Name: 0, dtype: float64 1 fixed acidity 7.8000 volatile acidity 0.8800 citric acid 0.0000 residual sugar 2.6000 chlorides 0.0980 free sulfur dioxide 25.0000 total sulfur dioxide 67.0000 density 0.9968 pH 3.2000 sulphates 0.6800 alcohol 9.8000 quality 5.0000 Name: 1, dtype: float64 2 fixed acidity 7.800 volatile acidity 0.760 citric acid 0.040 residual sugar 2.300 ...
The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.
for row in wines.itertuples(): print(row) Pandas(Index=0, _1=7.4, _2=0.7, _3=0.0, _4=1.9, chlorides=0.076, _6=11.0, _7=34.0, density=0.9978, pH=3.51, sulphates=0.56, alcohol=9.4, quality=5) Pandas(Index=1, _1=7.8, _2=0.88, _3=0.0, _4=2.6, chlorides=0.098, _6=25.0, _7=67.0, density=0.9968, pH=3.2, sulphates=0.68, alcohol=9.8, quality=5) ...
The Pandas library has provided us with 3 different functions which make iteration over the given data sets relatively easier. They are:
iteritems(): This function in the Pandas library helps the user to iterate over each and every element present in the set, column-wise. This function will be useful in case we want to look for something row by row but column by column. This way you don’t have to iterate over all the columns.
iterrows(): This function in the Pandas library helps the user to iterate over each and every element present in the set, row-wise. This function will be useful in case we want to iterate full-row by full-row so we can search a specific row-value without iterating the whole dataset.
itertuple(): This function in the Pandas library helps the user to iterate over each row present in the data set while forming a tuple out of the given data. This function will be useful when we need to iterate full-row by full-row but the output has to be tuple format.
We continue with the two most important functions to map a Series or Dataset.
The Pandas map() function is used to map each value from a Series object to another value using a dictionary/function/Series.It is a convenience function to map values of a Series from one domain to another domain, as it allows us to make an operation for transforming all rows of a given column in a dataset.
For example, we can transform the series obtained from the `density` column by executing a function that multiplies each of its values
data['density'].map(lambda x: x * 100) 0 99.780 1 99.680 2 99.700 3 99.800 4 99.780 ... 1594 99.490 1595 99.512 1596 99.574 1597 99.547 1598 99.549 Name: density, Length: 1599, dtype: float64...
Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument:
For example, we can restore the values
def divide_by_100(x): x.density = x.density / 100 return x data.apply(divide_by_100, axis='columns')
The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object grouping by `quality` you may do the following:
wines.groupby(["quality"]).quality.count() quality 3 10 4 53 5 681 6 638 7 199 8 18 Name: quality, dtype: int64
You can also create the GroupBy object and apply a custom function, for example in this case we are going to group by `quality` and` alcohol` and obtain the highest density from each one:
wines.groupby(['quality', 'alcohol']).apply(lambda df: df.loc[df.density.idxmax()])
Finally, within the grouping section, one of the most useful functions in data analysis is the aggregation function.
In this case we are going to group by `quality` and we are going to obtain the maximum and minimum value of `alcohol` for each group.
In this case we are going to use a different dataset to clearly explain all the sorting functionality within Pandas. For this we are going to first observe the small example dataset that we are going to manipulate which we will call `unsorted_df`:
+ Sort by index
+ Sort by index descending order
+ Sort by columns
+ Sort by values
Training your abilities
If you want to bring your skills further in Data Science, we have created a course that you can download for free here.
Over the next chapter, we will get a deep dive into the functions we use to missing data treatment.