Table of Contents
Introduction
We begin with the second post of our training saga with Pandas. In this article we are going to make a summary of the different functions that are used in Pandas to perform Indexing, Selection and Filtering.
Indexing, Selecting and Filtering
Before we start, we are going to visualize a head of our didactive dataset that we are going to follow to show the examples. It is a well-known dataset that contains wine information.
As an introduction, we are going to explain some functions that can be very useful when obtaining a broader view of the state of our dataset.
Getting info
Info
We will start with info() function, that offers us insights about the number of columns, name of every column, We start with the info() function that provides us with information about the number and names of columns, the number of non-null elements, and the data type of each column.
df.info()
Wines Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
Dtypes
We continue with the dtypes attribute that shows us exclusively the data type of each column.
df.dtypes
Wines Dataset:
fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object
Describe
The following function provides us with information on numerous statistical calculations that help us understand the distribution of our dataset.
df.describe()
Wines Dataset:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000
Indexing and Selection
Here we are going to take a deep dive into explaining the two main indexing and selection pandas functions: ‘iloc’ and ‘loc’
+ .loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:
– A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
– A list or array of labels [‘a’, ‘b’, ‘c’].
– A slice object with labels ‘a’:’f’ (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)
– A boolean array (any NA values will be treated as False).
– A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
+ .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:
– An integer e.g. 5.
– A list or array of integers [4, 3, 0].
– A slice object with ints 1:7.
– A boolean array (any NA values will be treated as False).
– A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
There is no better way to understand how a function works than showing examples, so here you have a wide range of use examples to see what are the different ways to use them.
`iloc` examples
+ Get First Row
df.iloc[0]
+ Get first column
df.iloc[:, 0]
+ Get first colum of the first row
df.iloc[0:1, 0]
+ Get rows from 3 to 5
df.iloc[3:5]
+ Get rows 3, 7, 10
df.iloc[[3, 7, 10]]
+ Get last five rows
df.iloc[-5:]
`loc` examples
+ Get First Row of colum ‘quality’
df.loc[0, 'quality']
+ Get all rows from columns ‘quality’, ‘sulphates’, ‘alcohol’
df.loc[:, ['quality', 'sulphates', 'alcohol']]
+ Get from row called ‘litres’ forward from columns ‘quality’ to ‘alcohol’
df1.loc['litres':, 'quality':'alcohol']
+ Get rows from 3 to 5 (Different from iloc)
df.loc[3:5]
Filtering
One of the things that helps us the most when we are working with data is being able to filter it according to certain conditions. For them, the `loc`’ function allows us to introduce these conditions in the following way:
+ Get all wines which quality is greater than 6
wines.loc[wines.quality > 6]
+ Get all wines which quality is greater than 5 and less than 8
wines.loc[(wines.quality > 5) & (wines.quality < 8)]
+ Get all wines which quality is equal to 5 or equal to 7
wines.loc[(wines.quality == 5) | (wines.quality == 7)]
Training your abilities
If you want to bring your skills further in Data Science, we have created a course that you can download for free here.
Over the next chapter, we will get a deep dive into the functions we use to iterate, map, group and sort.