Practical Introduction to Jupyter Notebooks — 3. Large Feature Extraction
Alright, welcome back to our Jupyter explorations. It’s another day here, and we have a new task at hand. In our previous travel we learnt about large data transformation. We took some extremely high dimensional data and transformed it into a dimension that made sense for our training purposes. Today we have data that’s not just in wrong dimensions, the data is in a completely wrong format.
The complete source code and data for all four parts of this series can be found on my github page.
We’re going to find a way to convert this data into a format that’s suitable for our AI to analyze. Let’s get started with our Spam Email Classification.
#As usual, we start by importing all our needed libraries.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Data Importation
By this time, I reckon you’re starting to get the hung of things. We’re going to start by importing our data and trying to do some pre-training analysis on it.
In this case, the data we’re importing is a single csv called “sec-data.csv” and as such we shall need to split our data ourselves. Into the training and test sets.
mail_df = pd.read_csv("sec-data.csv") #Load the data
Let’s go ahead and print some details about our data
print( "Data shape: ",mail_df.shape, "\n")
print("Data summary: ","\n", mail_df)
Alright, perfect we have 5572 samples, and each sample has two features; The text and whether it hand mail(ham) or spam. Since we’re at risk of having null entries in our data, let’s check for that, and if we do have some of those, we can remove them.
print(mail_df.info())
From the looks of it our data is spotless, and we do not have any null entries.
Now we need to modify our data to make it trainable. We need to do this because currently our output is either “spam” or “ham” but our ML models are not able to output or input text. It is thus common that in these sorts of situations, we convert the text input and output into some digit that represents the range of possible variants we can have.
Let’s do this in this dataset by replacing the ‘Category’ feature with 1 or 0 depending on whether it is spam or ham respectively.
mail_df.loc[mail_df['Category'] == 'spam', 'Category',] = 0
mail_df.loc[mail_df['Category'] == 'ham', 'Category',] = 1
Let’s check what our data frame looks like now.
Perfect, now we can create our data splits
X = mail_df['Message']
Y = mail_df['Category']
X_train,X_test,Y_train, Y_test = train_test_split(X,Y,test_size=0.2, random_state=3)
Alright, now that we’ve split our data, it’s time to vectorize it. By this, we mean convert data that was previously not in numerical format into unique numerical representations.
features = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = features.fit_transform(X_train)
X_test_features = features.transform(X_test)
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')
Now we can go ahead and create our model and start to train it on our data.
model = LogisticRegression()
model.fit(X_train_features, Y_train)
Now that we have a fully trained model, we can test our model to see just how well it performs on the test dataset. To do this we shall use accuray_score that’s part of sklearn.
y_test_prediction = model.predict(X_test_features)
accuracy = accuracy_score(Y_test, y_test_prediction)
print(accuracy)
There we go, with that we can see that our model performs really well even on unseen data. Alright, that’s it for our third portion. In our next part we’re going to tackle sequential data modeling.
We’ll build a model that, given a training set and generate similar text. See you soon!