Getting Started with Pandas – Lesson 3

Share This Post

Introduction

We begin with the third post of our data science training saga with Pandas. In this article we are going to make a summary of the different functions that are used in Pandas to perform Iteration, Maps, Grouping and Sorting. These functions allow us to make transformations of the data giving us useful information and insights.

Iteration, Maps, Grouping and Sorting

The 2009 data set  ‘Wine Quality Dataset’ elaborated by Cortez et al. available at UCI Machine Learning , is a well-known dataset that contains wine quality information. It includes data about red and white wine physicochemical properties and a quality score. 

Before we start, we are going to visualize a head of our didactic dataset that we are going to follow to show the examples using pandas head function. 

DMnGnN8fXlizhiIx5pyQ1GiyU5nf wnrHL31tWGy07sPB0O3UOezZ7whwcQNwRhrlVR3gH0SLk0M1ex4rr3ikZpNzwec5ogzex6XeMMRa

Iteration

We start with the functions related to iterating through a dataset. We might want to use this function when we want to iterate row by row.

The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects.

If we iterate over a DataFrame, we get the column names.

for element in df:
    print(element)


fixed acidity           
volatile acidity       
citric acid              
residual sugar       
chlorides               
free sulfur dioxide     
total sulfur dioxide    
density    
pH           
sulphates
alcohol
quality

To iterate over the rows of the DataFrame, we can use the following functions:

Item

Consistent with the dict-like interface, items() and iteritems() iterates through key-value pairs:

  Server-Side Web Pages With Kotlin (pt. 1)

+ Series: (index, scalar value) pairs

+ DataFrame: (column, Series) pairs

for key, value in wines.items():
    print(key)
    print(value)
    
    
    fixed acidity
0        7.4
1        7.8
2        7.8
3       11.2
4        7.4
        ... 
1594     6.2
1595     5.9
1596     6.3
1597     5.9
1598     6.0
Name: fixed acidity, Length: 1599, dtype: float64
volatile acidity
0       0.700
1       0.880
2       0.760
3       0.280
4       0.700
        ...

Iterrows

It allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator yielding each index value along with a Series containing the data in each row:

for row_index, row in wines.iterrows():
    print(row_index, row, sep="\n")
    
    0
fixed acidity            7.4000
volatile acidity         0.7000
citric acid              0.0000
residual sugar           1.9000
chlorides                0.0760
free sulfur dioxide     11.0000
total sulfur dioxide    34.0000
density                  0.9978
pH                       3.5100
sulphates                0.5600
alcohol                  9.4000
quality                  5.0000
Name: 0, dtype: float64
1
fixed acidity            7.8000
volatile acidity         0.8800
citric acid              0.0000
residual sugar           2.6000
chlorides                0.0980
free sulfur dioxide     25.0000
total sulfur dioxide    67.0000
density                  0.9968
pH                       3.2000
sulphates                0.6800
alcohol                  9.8000
quality                  5.0000
Name: 1, dtype: float64
2
fixed acidity            7.800
volatile acidity         0.760
citric acid              0.040
residual sugar           2.300
...

Itertuples

The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

for row in wines.itertuples():
    print(row)    

Pandas(Index=0, _1=7.4, _2=0.7, _3=0.0, _4=1.9, chlorides=0.076, _6=11.0, _7=34.0, density=0.9978, pH=3.51, sulphates=0.56, alcohol=9.4, quality=5)
Pandas(Index=1, _1=7.8, _2=0.88, _3=0.0, _4=2.6, chlorides=0.098, _6=25.0, _7=67.0, density=0.9968, pH=3.2, sulphates=0.68, alcohol=9.8, quality=5)
...

Conclusion

The Pandas library has provided us with 3 different functions which make iteration over the given data sets relatively easier. They are:

iteritems(): This function in the Pandas library helps the user to iterate over each and every element present in the set, column-wise. This function will be useful in case we want to look for something row by row but column by column. This way you don’t have to iterate over all the columns.

  Mock your UI tests with Wiremock

iterrows(): This function in the Pandas library helps the user to iterate over each and every element present in the set, row-wise. This function will be useful in case we want to iterate full-row by full-row so we can search a specific row-value without iterating the whole dataset. 

itertuple(): This function in the Pandas library helps the user to iterate over each row present in the data set while forming a tuple out of the given data. This function will be useful when we need to iterate full-row by full-row but the output has to be tuple format.

Maps

We continue with the two most important functions to map a Series or Dataset.

Map

The Pandas map() function is used to map each value from a Series object to another value using a dictionary/function/Series.It is a convenience function to map values of a Series from one domain to another domain, as it allows us to make an operation for transforming all rows of a given column in a dataset.

For example, we can transform the series obtained from the `density` column by executing a function that multiplies each of its values ​​by 100.

data['density'].map(lambda x: x * 100)    

0       99.780
1       99.680
2       99.700
3       99.800
4       99.780
         ...  
1594    99.490
1595    99.512
1596    99.574
1597    99.547
1598    99.549
Name: density, Length: 1599, dtype: float64...

Apply

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument:

For example, we can restore the values ​​of the `density` column by executing a function that divides each of its values ​​by 100, without having to extract the Series from the Dataframe since the maps function works with Dataframes.

def divide_by_100(x):
    x.density = x.density / 100
    return x

data.apply(divide_by_100, axis='columns')

S7BP zBf ZHVKdSeaGDJuWiiVI tZ4Tvj2md nAspc1s8m6F3zW12T7C wrpbpZsXrnz3h7jpD1TFr5m3o3Ooyqx0ou gOG9NZKAxT03uawAK vHkNSsD8Gg4SqVOmzKvgpdg25f=s0

Grouping

The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object grouping by `quality` you may do the following:

wines.groupby(["quality"]).quality.count()

quality
3     10
4     53
5    681
6    638
7    199
8     18
Name: quality, dtype: int64

You can also create the GroupBy object and apply a custom function, for example in this case we are going to group by `quality` and` alcohol` and obtain the highest density from each one:

wines.groupby(['quality', 'alcohol']).apply(lambda df: df.loc[df.density.idxmax()])

0vXl9g0fFcD oH dPoMbdik9opotd7x3cHoF3UHd eK iosmSn1f3VqQvjdzMJxmb0HPLS0

Finally, within the grouping section, one of the most useful functions in data analysis is the aggregation function.

  Comparing IaC tools for Azure

In this case we are going to group by `quality` and we are going to obtain the maximum and minimum value of `alcohol` for each group.

wines.groupby(['quality']).alcohol.agg([min, max])

lrwbfT H6MM6v jb8cHFeilj CYua90yG6r9 jCGyjnGtd j876Ey11aERP9ftXy92gQ1YPL ZSU7f0kNSaUM74fXIOeknRV EvzyKe K0WvJLotwCE0CKZK4TWsU2T1dNKqp0Q=s0

Sorting

In this case we are going to use a different dataset to clearly explain all the sorting functionality within Pandas. For this we are going to first observe the small example dataset that we are going to manipulate which we will call `unsorted_df`:

nGTN6DveDXKuA8hAgKew4FvlH17gg kkexiDhP

+ Sort by index

unsorted_df.sort_index()

llQLVBys4nN1x1kCyaY o6zt0x42KDGG8suFS qYeCMClP10kLlEiXRXKm BtXoUMptrdbsyFKiGd334O1n6RIUeCg

+ Sort by index descending order

unsorted_df.sort_index(ascending=False)

zTnUOge3yx1NXj9mhSKTg2h1Ij oc4ECrJysD3SqURZ1SdpcslVcq1M YBbhI7JOJwhwancyq8k5yDIm6TwOWBwSGZ2BCN4duduXx ijoaCvj16I2MGfpXcB QzUIbqBSm6RCSUL=s0

+ Sort by columns

unsorted_df.sort_index(axis=1)

rzWy8T567k3uelJM4j7lHgUo1 KUaJtFr1kUeHU BKrxqajjKNAqj0MpSmTuffBQ95Gy2SPnv744G

+ Sort by values

unsorted_df.sort_values(by="two")

11s11ew3gPhPXYcXKqbDYwmbvcrkoAWoTCqmK0YmXxjBhO3K6E8CpWSvcJGea9iGIS4FzNZsBvp4kUWzlWbPA3bZhrdPDP3QEKLPfU3YEACLHI2avNkipbj szh29ffopg4XeFeH=s0

Training your abilities

If you want to bring your skills further in Data Science, we have created a course that you can download for free here.

Over the next chapter, we will get a deep dive into the functions we use to missing data treatment.

Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Subscribe To Our Newsletter

Get updates from our latest tech findings

Have a challenging project?

We Can Work On It Together

apiumhub software development projects barcelona
Secured By miniOrange