Getting Started with Numpy – Lesson 1

Share This Post

Introduction

NumPy is a third-party library for numerical computing, optimized for working with single- and multi-dimensional arrays. Its primary type is the array type called ndarray. This library contains many routines for statistical analysis.

Creating, Getting Info, Selecting and Util Functions

The 2009 data set  ‘Wine Quality Dataset’ elaborated by Cortez et al. available at UCI Machine Learning , is a well-known dataset that contains wine quality information.It includes data about red and white wine physicochemical properties and a quality score. 

 Before we start, we are going to visualize the head a little example dataset 

t ozBeiHHe7CXrn7kqTQb7yhWmbBp3i3dPEEAx4uyG5DLf4TZWrK8ww83eOtvVjZffZkoRBFAHgNvsvRaB46G0vxTtZbe29TC 5gCKlMX 9Zk7w3Oc0nWOLbYi7HMYPGdHfRHsVg=s0

Creating

In Numpy you can create arrays in different ways, we are going to see examples of the most common and those that can be most useful for data processing.

Unidimensional array from list:

Import numpy as np
list = [1, 2, 3]
uni_numpy_array = np.array(list)

array([1, 2, 3])

Multidimensional array from list:

list = [[1, 2, 3], [4, 5, 6]]
multi_numpy_array = np.array(list)

array([[1, 2, 3],
       [4, 5, 6]])

Multidimensional array all values are zeros:

zeros_array = np.zeros((3, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

Multidimensional array all values are random:

random_array = np.random.rand(3, 4)

array([[0.98195491, 0.34964712, 0.13426036, 0.55065786],
       [0.4180283 , 0.36018953, 0.44374156, 0.4366695 ],
       [0.69893273, 0.01089244, 0.4297768 , 0.6985924 ]])

Getting Info

There are several functions that can help us extract information from the data. We are going to explain one by one with examples of its operation and its usefulness.

Get array dimensions:

For this we are going to use the `shape()` function that returns the number of rows and the number of columns (rows, columns).

wines_df.shape

(1599, 12)

Get data type:

NumPy has several different data types, which mostly map to Python data types, like float, and str. You can find a full listing of most important NumPy data types here:

  Business Analytics tools & use cases

1. float – numeric floating point data.

2. int – integer data.

3. string – character data.

4. object – Python objects.

In this case we will use the `dtype` attribute that returns the data type of the array.

wines_df.dtype

dtype('float64')

Selecting

Use the syntax np.array[i,j] to retrieve an element at row index i and column index j from the array.

To retrieve multiple elements, use the syntax np.array[(row_values), (column_values)] where row_values and column_values are a tuple of the same size.

Now we are going to show different examples of how to select elements within an array:

Get first row:

first_row = wines_df[:1]

array([[ 7.4   ,  0.7   ,  0.    ,  1.9   ,  0.076 , 11.    , 34.    ,
         0.9978,  3.51  ,  0.56  ,  9.4   ,  5.    ]])

Select the second element from the third row:

second_third = wines_df[2, 1:2]

array([0.76])

Select the first three items from the fourth column:

first_three_items = wines_df[:3, 3]

array([1.9, 2.6, 2.3])

Select the entire fourth column:

fourth_column = wines_df[:, 3]

array([1.9, 2.6, 2.3, ..., 2.3, 2. , 3.6])

Util Functions

Numpy is a library that has an infinity of mathematical operation functions, so we are going to try to summarize in several examples the functions that as Data Scientist we are going to use with more probability.

Sum up the whole 11th column:

twelveth_column_sum = wines_df[:, 11].sum()

9012.0

Sum up all the columns:

all_columns_sum = wines_df.sum(axis=0)

array([13303.1    ,   843.985  ,   433.29   ,  4059.55   ,   139.859  ,
       25384.     , 74302.     ,  1593.79794,  5294.47   ,  1052.38   ,
       16666.35   ,  9012.     ])

Mean of the first row:

first_row_mean = wines_df[:1].mean()

6.211983333333333

Return a bool array where the position value of the 11th column is True if the value was minor than 5 and False in other case:

bool_array = wines_df[:,11] > 5

array([False, False, False, ...,  True, False,  True])

Get the traspose matrix of wines matrix:

traspose = np.transpose(wines_df)
traspose.shape

(12, 1599)

Get the flatten array of wines:

flatten = wines_df.ravel()
flatten.shape

(19188,)

Turn the 12th row of wines into a 2-dimensional array with 3 rows and 4 columns:

wines_df[1:2].reshape((3,4))

array([[ 7.8   ,  0.88  ,  0.    ,  2.6   ],
       [ 0.098 , 25.    , 67.    ,  0.9968],
       [ 3.2   ,  0.68  ,  9.8   ,  5.    ]])

Training your abilities

If you want to bring your skills further in Data Science, we have created a course that you can download for free here.

  Data as a service ( DaaS ) benefits & trends

Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Subscribe To Our Newsletter

Get updates from our latest tech findings

Have a challenging project?

We Can Work On It Together

apiumhub software development projects barcelona
Secured By miniOrange