Getting Started with Machine Learning- An Introduction with Python

Dr. Joyjit Chatterjee
6 min readDec 24, 2019

This article is part of a series of articles aimed to foster love for Machine Learning/Deep Learning!

Article 1: A Very Brief — Introduction to Deep Learning https://medium.com/@joyjitece/a-beginners-guide-to-deep-learning-d5746c816ea8

Article 2: You are reading it! (Getting Started with Machine Learning-An Introduction to Python). Note- This article assumes a minimal working knowledge of Python (Data types, importing libraries, reading/writing files etc). If you aren’t familiar, check out https://www.programiz.com/python-programming this amazing website which teaches you Python in the simplest of ways.

Python, Python, Python! We keep on hearing about Python in today’s data science community. This programming language, due to the simplicity of use, and the multitude of libraries and packages which can be used to accomplish a range of tasks in just a few lines of code has rightly received the popularity it deserves. If you are new to the Machine Learning community, I welcome you towards a journey full of passion, and success! This article aims to introduce a complete novice to the basics of using Python for Machine Learning, and would consist of further articles in the Medium Series towards going into more detailed complexities.

Java, C, C++. MATLAB, R- These can of course be used for Machine Learning, but why would we really use them if we have in-built methods and such a wonderful developer community for the most amazing packages in Python. And yes, it is all Free, and Open Source!

Why Python for Machine Learning?

Numpy- A variety of methods for manipulating and performing operations on arrays, matrices, vectors…….

Pandas: Want to read CSV files as your dataset, text files…….? Why worry when Pandas is here. Pandas supports a variety of file formats, and you can even visualise stuff using this amazing library, perform arithmetic operations on your dataframe (we call the read files dataframe by convention in Pandas) and so on.

Scikit-learn- All you need to do any sort of machine learning. You can find everything ranging from decision trees, classifiers, regression models and so on. Supervised learning, or unsupervised learning, name it and you have it here.

Tensorflow/Keras/Pytorch- If you want to dive (deep) into the world of Deep Learning, Python has support for arguably the most popular libraries used by major organisations for applying AI towards deployable systems. Google, Amazon, Facebook AI, Twitter, Netflix…….. All of them are using some sort of Tensorflow/Keras/Pytorch in the background behind all of the movie recommendations they generate, friend recommendations on Facebook, Twitter trends, Amazon Product Recommendations, Spam Filters (Earlier they only used Regular expressions but now, things are becoming smarter, and everything uses AI in some form, cool isn’t it?)

NLTK: The Natural Language Toolkit (NLTK) package consists of an array of methods for performing Natural Language Processing (NLP) tasks at the most convenient manner you could ever imagine. Stop words, Stemming and Lemmatization, support for variety of Languages (English, French, German, anything else?), again, name it and you have it. If you are doing any sort of NLP, I can vouch for the fact that you are going to need this NLTK library one way or the other.

The very first thing- You have a dataset and you want to load it (read it). How to go about it?

import pandas as pd #Import keyword is used to import a specific library in Python
df = pd.read_csv('Enter the path of your CSV file here) #read_csv() method can be used to specify a file path to read into a dataframe
print(df.head()) #Show the first few instances in the dataframe

This piece of script simply loads the Pandas library as pd, and then you specify the file path to your CSV file (or any other readable file), which is put into a dataframe (the convention in Pandas). Now, anytime you want to perform any operations with your original CSV file, you simply need to use df,no need to worry about the original file path now!

Now that we have read the file successfully into the dataframe, next comes splitting our dataframe into features and labels. Assuming that we are handling a supervised machine learning task, wherein you have certain columns of the dataframe which are specifying attributes of your task, and one of the final columns (for example) is the labels. So, imagine you are simply doing a Classification problem and want to classify your data to respective labels, given the features/predictors/attributes.

To execute any Machine Learning model, we first need to split our dataset into features and labels. Assume X is the features and y the labels:-

X = df.iloc[:,:-1] #.iloc is used to access all columns in the dataframe till (except) the last column. These are features
y = df.iloc[:,-1] #y stores the last column values, and these are labels

Suppose we now want to classify the labels into different calsses, and given an unseen instance of features, we want to predict what class/label corresponds to it. The key here is to first split our data into training and test sets. Training set is the partition of our data which is used to teach our model how to learn, and test set is our unseen data on which we do inference.

Why split data into train-test?

Because, the key aim of any ML task is to make predictions on new, unseen data. If you feed in all your data into the model, how would you know if the model does its job correctly? Or, for that matter, what would you predict on, you need something to validate your model performance, its accuracy, error metrics…..

from sklearn.model_selection import train_test_split #train_test_split method can be used to split/partition your datasetX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) #This splits your original data into 2 partitions, training data (70% of original), and test data (30% of original)

All set! Now, the task of choosing our model. For the sake of sanity and ease of convenience in this article, let us take the simplest possible Neural Network- The Multi Layer Perception (MLP)! What it does is simply perform forward and backward propagation to minimize the loss/gradients during training. (To read more about neural networks, check out my article here https://medium.com/@joyjitece/a-beginners-guide-to-deep-learning-d5746c816ea8)

Now, we want to use this network (model) to classify our dataset. What are we waiting for, let us train the model and teach it how to learn!

from sklearn.neural_network import MLPClassifier #Import the MLP model from scikit-learn
mlp_network = MLPClassifier() #Now, MLPClassifier() is in mlp_network. So all we need now is to work with mlp_network and teach it on dealing with our data!
#Let us train (fit the model!) Fitting the model means teaching the model on our training dataset. It will learn automatically the relationships between features, and what group of features most-likely lead to a particular label (output class)mlp_network.fit(X_train,y_train) #Notice we used training data, not test!#Now make predictions on unseen test data....model_predictions = mlp_network.predict(X_test)#Let us now calculate the model accuracy. We already know the correct labels for our test (unseen data) right? We can now compare this with the model_predictionsfrom sklearn.metrics import accuracy_score
acc_result = accuracy_score(y_test,model_predictions)
#accuracy_score() method takes in the true value and predicted value. and evaluates the accuracy score for the model!
print(acc_result)

And voila! Here we are, we have just used Python, my favourite (and probably the favourite of billions of people across the globe) programming language to train the most basic neural network model on our own CSV file (which can be anything,, feed in your own data, take anything other than the convention Fisher-Iris dataset too as I am sure you would enjoy it!

I would keep on continuing this series with further posts down the future hopefully! If you would like to connect with me on LinkedIn, reach out at http://linkedin.com/in/joyjitchatterjee/

--

--

Dr. Joyjit Chatterjee

Data Science & AI Innovation | Forbes Under 30 Europe | Green Talents Awardee (German Govt.) | PhD in Machine Learning (Hull, UK) | All views my own!