Practical Introduction to Jupyter Notebooks— 1. Installation, Setup, and Data Handling
Jupyter is basically the home of any Data Scientist, we can do amazing simple things in it and also some rather complex things.
In this 4 part guide we’ll explore all these possibilities using a practical approach. Our journey will be short but heavily insightful:
- Installation, Setup, and Data Handling
- Large Data Manipulation
- Feature Extraction
- Sequential Modeling
The complete source code for all four parts of this series can be found on my github page.
As you can see our Travel will be long, but we’ll do it in a very short period of time. And as this is a practical introduction, with every new area of exploration we shall have an example project to help us get familiar with our landscape.
Our area of study today will be the installation and setup of Jupyter and by at the very end of exploration we shall engage in the interesting and common task of build an Iris Flower Classification model. Let’s get started!
Installation
Before we can start our exploration, we need to land on Jupyter and we’ll do this by installing Jupyter on our machine. Installing Jupyter can be done in a number of ways. But the easiest way is using the Conda setup. You can head to the website over here and while you marvel at the landscape, let me give you some insights about it.
As you can see the landscape is daunting. This is because Conda can be either really big (ANACONDA) or really small, The miniconda. The choice of which one you want to explore is up to you and you alone, all I can do is guide you.
Miniconda is less that 100MBs in size, it’s a small package that mainly contains only the minimum requirements to explore Jupyter and run Python.
On the other hand Anaconda is like a massive rattle snake, it is closer 1GB in size and contains many (if not all) of the Data Science packages you’ll ever need during your exploration.
Whichever you choose will be valid for our exploration, but since this is your first time on Jupyter, I suggest the smaller Miniconda since it will give us more chance to manually setup our Python environment and learn about every little biome along the way. So go ahead and select whichever one you want to download.
Once you’ve made you choice, both ANACONDA and miniconda offer multiple download options for mac and Linux users. Be sure to install whichever one you’re used to.
Creating A New Notebook
Now that we’ve successfully arived, let’s start our exploration by creating a new Notebook. To do this, create a new empty folder to house our notebooks and run the following command in there:
jupyter notebook
This will open up you Web Browser and display the following page.
And viola, we’re finally on Jupyter! Let’s start moving around and creating some new files.
To do this, simply click the New button to the top right and select Python 3 among the options.
This will open a new Tab and create a new called Untitled in your project folder. Let’s call this file “Iris Flower Classification” as this is what we shall be using this file for. To rename the file to this name, inside the notebook, click the name-Untitled- at the top and an input will appear for you insert your our new name, once you’re done, click Rename.
Understanding the Cells
Because this is a new environment for you, it might be a bit daunting. But fear not, I’m here to show you around. Because this is a four part series we shall have a lot of opportunity to get familiar with the other areas of our environment but for now I will explain the only thing you need to know to understand what we’re doing.
CELLS: There is a box in the center of your screen that reads “In []: ” to to it’s right, this is a Notebook Cell. Just like in Excel, cells in Jupyter contain individualized data. However in Juypter we only a single column of cells, one coming after the other, not a grid. These cells will be our main means of organization as we shall split different areas of our study into different cells.
Starting Our Exploration
With that out of the way, we can start our exploration. Think of notebooks in the same way as you think of Blogs or Publications, they allow us to write detailed analyses of our thought process but also run Python code in the same environment.
In this first cell, let’s create a header for our notebook. To do this, we shall convert the cell into a markdown cell. To do this, select the cell, and go to the top bar where it reads “code”, click this and select “markdown”.
Markdown in Jupyter is used to guide your readers through the computations you are doing. It allows the users to follow along with your work an is ignored by the Python executer.
Now that our cell is marked as markdown, paste the following text into it.
# Iris Flower Classification
We shall start with a very simple Iris Flower Classification problem. This will allow us to learn how to import data into Jupyter and run and train Python models, without getting overlay complex
Our markdown is a display language we use to format text in Jupyter. In particular, this markdown have a header “Iris Flower Classification” that is made bold by prefixing it with ‘#’.
In this markdown you can create headers using the ‘#’ symbol before your input. Double click the cell to edit it, and start adding more ‘#’s before the header. As you can see, the more ‘#’s you add, the smaller the header becomes. In the image bellow, the number in the headers represents the number of ‘#’s used.
Now if you press Shift+Enter you will see that it changes the appearance of the cell and makes it look cleaner.
Shift+Enter is the shortcut in Jupyter to run a cell. Normally, this means executing the Pythong code in a cell but since our cell here is a markdown cell, it simply displays our markdown with the formatting we assigned to it.
Importing Our Libraries
Every explorer needs some tools, let’s get ourselves equipped with the libraries we shall need. Now as a matter of preference I tend to import all my libraries in the very first cell in my notebooks, I also tend to teach the same to my students.
To do this, let’s add a new cell by pressing Shift+Enter and in the cell just bellow our markdown let’s add this code:
#import important libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
Run this cell by pressing Shift+Enter, here you might be greeted with an error, saying module not found.
This is one of the main things we signed up for when we installed miniconda instead of ANACONDA. For those that installed ANACONDA instead, this error won’t appear, please read through this section nonetheless though as you might be face with this problem in the future as well.
So when we try to use a library in our code and it isn’t installed we get this error, fixing this is easy. Head over to the folder we were you notebook is located and open a command line at that location. Then install the missing library by running the following command:
pip install package_name
Here you need to replace the package_name with the name of the package we are trying to install. You can install the packages we need for this project in a single command:
pip install pandas matplotlib numpy scikit-learn
Now back in our notebook we can see that we can run the cell without getting any errors.
Importing Our Data
We need a landscape to explore. Start by adding the following markdown to explain to the reader what you are doing.
### Data Importation
We the import our data, this is simply a collection of information about our flowers a shown
Next we need to download some sample data to use from here. Download this data, give it a favourable name (I called mine “Iris Data.csv”) and save it in the same folder as out Jupyter notebook so we can access it.
The let’s add a new cell to import our data and print it out. Using this code.
iris_df = pd.read_csv("Iris Data.csv")
print (iris_df)
This code imports our data, reads it, and prints it. Run this cell to see the output.
The print command outputs a small section of the iris data. As you can see, the data is contains sepal_length, sepal_width, petal_length, petal_width and species. The other fields are inputs and only the species will be our output.
Understanding Our Data
In general, I tend to tell my students that if they don’t understand data, they shouldn’t expect the AI’s they build to understand it either. While this isn’t entirely true, in most advanced cases we are normally required to manufacture data features to help our models form relationships. We can’t manufacture these features if we don’t understand our data ourselves.
Luckily, Jupyter makes this process very easy. Create a new cell and run the following code:
#Now, the inputs into our learning model will be the character of the flowers, these are: {sepal_length,
# sepal_width, petal_length and petal_width}
#Our outputs will be the species of the flower as follows
print("Target Labels", iris_df["species"].unique())
This code output all unique values our species (dependent variable) can take on. We can see from running this cell that we have tree species : ‘Iris-setosa’ ‘Iris-versicolor’ ‘Iris-virginica’
Now let’s try and make out how our model might learn to distinguish these species based on the features we have. To do this, create a new markdown cell and add this information to it.
Now, let's plot some graphs of our data to try and understant it better.
Then add a code cell and run this code.
# We shall start with a very simple 2D scatter plot of our data showing the sepal_length and sepal_width as well as the
# species of each plot
import plotly.express as px
fig = px.scatter(iris_df, x="sepal_width", y="sepal_length", color="species")
fig.show()
Without moving on too quick, let’s analyze the code. Particulary, we are using a new library called Plotly and as you can see we start by importing it, right before we need it.
While this isn’t the model that I prefer to teach, I wanted to give you an idea of what it looks like. My preference is that we import all libraries we shall use in our code in the first cell of the notebook. If you prefer to do things like this as well, move the import statement to the first code cell of the notebook and be sure to run the cell to make sure the library is imported. Then return to this cell and run it again.
The graph is a scatter plot showing input sepal_width and sepal_length and shows the the species of the input in it’s color.
We can see from the above that the data clusterizes and forms two groups, one group in particular includes only Iris-setosa. It becomes clear to us that the model will learn to classify inputs with higher sepal_width and lower sepal_length as setosa.
However, for the other two species (versicolor and virginia) we might need a different plot to split them. Let’s try plotting a 3d graph that also includes the petal_length to understand these two species further.
Run the following code in a new code cell to plot the 3D graph.
fig3d = px.scatter_3d(iris_df, x="sepal_width", y="sepal_length", z="petal_length", color="species")
fig3d.show()
Taking a look at our new plot, we can see that despite havng relatively same sepal_width and sepal_length, versicolor and virginia are very different when it comes to petal_length. Aside from a few mix ups, virginia and vesicolor can efficiently be sepparated using their petal length. With that in mind, let’s create our analysis AI and get to work.
Training Our Model
Based on what we have observed after sufficent analysis of our data, we can tell that the k-nearest neighbors is the best candidate for our task.
Without going into too much detail, this algorithm places samples in a position in a multidimensional graph using their properties, it then assumes that the identity of a sample is an average of it’s nearest neighbors. We can adjust how many neighbors it uses as well and this is what the k in k-nearest neighbors stands for.
Create a new cell and run the following code to split the data frame into X (input) and Y (output) groups.
X = iris_df.drop("species", axis=1)
Y = iris_df["species"]
Next let’s create our training and testing splits
from sklearn.model_selection import train_test_split
x_train, x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,
random_state=0)
As you can see, again, this function uses a new library and it is up to you to decide whether or not you want to import this in this cell or want to do it in the first code cell of the notebook. Next, let’s import nearest neighbours algorithm and fit it.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(x_train, y_train)
Here we are importing the dataset and training it on our training set. Once trained, let’s test it and print out it’s performance.
from sklearn.metrics import accuracy_score
y_train_pred = knn.predict(x_train) #our prediction on the training data
y_test_pred = knn.predict(x_test) # our prediction on the test data
y_train_pred_accuracy = accuracy_score(y_train, y_train_pred)
y_test_pred_accuracy = accuracy_score(y_test, y_test_pred)
print("Training set accuracy: ", y_train_pred_accuracy)
print("Testing set accuracy: ", y_test_pred_accuracy)
If you run this final cell, you will get an output showing how well the model performs on both test and training sets.
As you can see the model score a whopping 100% on both our data frames and this simply shows that the task of flower classification based on premeasured characters is perfect for the KNN model.
Thanks a bunch for following along, please be sure to clap for and share this post and reach out with any inquires. In the next post we shall proceed to explore the more complicated problem of Large Data Manipulation. We shall build an algorithm to recognize written digits. See you then!