Getting Started with Pandas – Lesson 2

Share This Post

Introduction

We begin with the second post of our training saga with Pandas. In this article we are going to make a summary of the different functions that are used in Pandas to perform Indexing, Selection and Filtering.

Indexing, Selecting and Filtering

Before we start, we are going to visualize a head of our didactive dataset that we are going to follow to show the examples. It is a well-known dataset that contains wine information.

L4cf9q sQuElZ4nStA8Y4OM82Cvsmwo5fSH95LYVhLCcikX7L5PVEOeTIeo7TOPJ2zt2ThhToK3Ha6k8AsX6SDyuJj63d

As an introduction, we are going to explain some functions that can be very useful when obtaining a broader view of the state of our dataset.

Getting info

Info

We will start with info() function, that offers us insights about the number of columns, name of every column, We start with the info() function that provides us with information about the number and names of columns, the number of non-null elements, and the data type of each column.

df.info()

Wines Dataset: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Dtypes

We continue with the dtypes attribute that shows us exclusively the data type of each column.

df.dtypes

Wines Dataset: 

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

Describe

The following function provides us with information on numerous statistical calculations that help us understand the distribution of our dataset.

df.describe()


Wines Dataset: 

fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000
mean	8.319637	0.527821	0.270976	2.538806	0.087467	15.874922	46.467792	0.996747	3.311113	0.658149	10.422983	5.636023
std	1.741096	0.179060	0.194801	1.409928	0.047065	10.460157	32.895324	0.001887	0.154386	0.169507	1.065668	0.807569
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000
25%	7.100000	0.390000	0.090000	1.900000	0.070000	7.000000	22.000000	0.995600	3.210000	0.550000	9.500000	5.000000
50%	7.900000	0.520000	0.260000	2.200000	0.079000	14.000000	38.000000	0.996750	3.310000	0.620000	10.200000	6.000000
75%	9.200000	0.640000	0.420000	2.600000	0.090000	21.000000	62.000000	0.997835	3.400000	0.730000	11.100000	6.000000
max	15.900000	1.580000	1.000000	15.500000	0.611000	72.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000

Indexing and Selection

Here we are going to take a deep dive into explaining the two main indexing and selection pandas functions: ‘iloc’ and ‘loc’

  Machine Learning interview with Gema Parreño - Lead Data Scientist at Apiumhub

+ .loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:

    – A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).

    – A list or array of labels [‘a’, ‘b’, ‘c’].

    – A slice object with labels ‘a’:’f’ (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)

    – A boolean array (any NA values will be treated as False).

    – A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

+ .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:

  – An integer e.g. 5.

  – A list or array of integers [4, 3, 0].

  – A slice object with ints 1:7.

  – A boolean array (any NA values will be treated as False).

  – A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

There is no better way to understand how a function works than showing examples, so here you have a wide range of use examples to see what are the different ways to use them.

  Comparing IaC tools for Azure

`iloc` examples

+ Get First Row

df.iloc[0]

+ Get first column

df.iloc[:, 0]

+ Get first colum of the first row

df.iloc[0:1, 0]

+ Get rows from 3 to 5

df.iloc[3:5]

+ Get rows 3, 7, 10

df.iloc[[3, 7, 10]]

+ Get last five rows

df.iloc[-5:]

`loc` examples

+ Get First Row of colum ‘quality’

df.loc[0, 'quality']

+ Get all rows from columns ‘quality’, ‘sulphates’, ‘alcohol’

df.loc[:, ['quality', 'sulphates', 'alcohol']]

+ Get from row called ‘litres’ forward from columns ‘quality’ to ‘alcohol’

df1.loc['litres':, 'quality':'alcohol']

+ Get rows from 3 to 5 (Different from iloc)

df.loc[3:5]

Filtering

One of the things that helps us the most when we are working with data is being able to filter it according to certain conditions. For them, the `loc`’ function allows us to introduce these conditions in the following way:

+ Get all wines which quality is greater than 6

wines.loc[wines.quality > 6]

+ Get all wines which quality is greater than 5 and less than 8

wines.loc[(wines.quality > 5) & (wines.quality < 8)]

+ Get all wines which quality is equal to 5 or equal to 7

wines.loc[(wines.quality == 5) | (wines.quality == 7)]

Training your abilities

If you want to bring your skills further in Data Science, we have created a course that you can download for free here.

Over the next chapter, we will get a deep dive into the functions we use to iterate, map, group and sort.

Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Subscribe To Our Newsletter

Get updates from our latest tech findings

Have a challenging project?

We Can Work On It Together

apiumhub software development projects barcelona
Secured By miniOrange