Preface – Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits


You have already seen Harvard Business Review describing data science as the sexiest job of the 21st century. You have been watching terms such as machine learning and artificial intelligence pop up around you in the news all the time. You aspire to join this league of machine learning data scientists soon. Or maybe, you are already in the field but want to take your career to the next level. You want to learn more about the underlying statistical and mathematical theory, and apply this new knowledge using the most commonly used tool among practitioners, scikit-learn.

This book is here for you. It begins with an explanation of machine learning concepts and fundamentals and strikes a balance between theoretical concepts and their applications. Each chapter covers a different set of algorithms and shows you how to use them to solve real-life problems. You'll also learn various key supervised and unsupervised machine learning algorithms using practical examples. Whether it is an instance-based learning algorithm, Bayesian estimation, a deep neural network, a tree-based ensemble, or a recommendation system, you'll gain a thorough understanding of its theory and learn when to apply it to real-life problems.

This book will not stop at scikit-learn, but will help you add even more tools to your toolbox. You will augment scikit-learn with other tools such as pandas, Matplotlib, imbalanced-learn, and scikit-surprise. By the end of this book, you will be able to orchestrate these tools together to take a data-driven approach to providing end-to-end machine learning solutions.

Who this book is for

This book is for machine learning data scientists who want to master the theoretical and practical sides of machine learning algorithms and understand how to use them to solve real-life problems. Working knowledge of Python and a basic understanding of underlying mathematical and statistical concepts is required. Nevertheless, this book will walk you through the new concepts to cater to both new and experienced data scientists.

What this book covers

Chapter 1, Introduction to Machine Learning, will introduce you to the different machine learning paradigms, using examples from industry. You will also learn how to use data to evaluate the models you build.

Chapter 2, Making Decisions with Trees, will explain how decision trees work and teach you how to use them for classification as well as regression. You will also learn how to derive business rules from the trees you build.

Chapter 3, Making Decisions with Linear Equations, will introduce you to linear regression. After understanding its modus operandi, we will learn about related models such as ridge, lasso, and logistic regression. This chapter will also pave the way toward understanding neural networks later on in this book.

Chapter 4, Preparing Your Data, will cover how to deal with missing data using the impute functionality. We will then use scikit-learn, as well as an external library called categorical-encoding, to prepare the categorical data for the algorithms that we are going to use later on in the book.

Chapter 5, Image Processing with Nearest Neighbors, will explain the k-Nearest Neighbors algorithms and their hyperparameters. We will also learn how to prepare images for the nearest neighbors classifier.

Chapter 6, Classifying Text Using Naive Bayes, will teach you how to convert textual data into numbers and use machine learning algorithms to classify it. We will also learn about techniques to deal with synonyms and high data dimensionality.

Chapter 7, Neural Networks – Here Comes the Deep Learning, will dive into how to use neural networks for classification and regression. We will also learn about data scaling since it is a requirement for quicker convergence.

Chapter 8, Ensembles – When One Model Is Not Enough, will cover how to reduce the bias or variance of algorithms by combining them into an ensemble. We will also learn about the different ensemble methods, from bagging to boosting, and when to use each of them.

Chapter 9, The Y is as Important as the X, will teach you how to build multilabel classifiers. We will also learn how to enforce dependencies between your model outputs and make a classifier's probabilities more reliable with calibration.

Chapter 10, Imbalanced Learning Not Even 1% Win the Lottery, will introduce the use of an imbalanced learning helper library and explore different ways for over- and under-sampling. We will also learn how to use the sampling methods with the ensemble models.

Chapter 11, Clustering – Making Sense of Unlabeled Data, will cover clustering as an unsupervised learning algorithm for making sense of unlabeled data.

Chapter 12, Anomaly Detection – Finding Outliers in Data, will explore the different types of anomaly detection algorithms.

Chapter 13, Recommender Systems – Get to Know Their Taste, will teach you how to build a recommendation system and deploy it in production.

To get the most out of this book

You will need Python 3.x installed on your computer. It is a good practice to set up a virtual environment to install the required libraries into. It's up to you whether you wish to use Python's venv module, the virtual environment provided by Anaconda, or any other option you like. I'll be using pip to install the libraries needed in the book, but once more, it is up to you whether you prefer to use conda or any other alternatives.

In Chapter 1, Introduction to Machine Learning, I will explain the essential libraries you need to install to get started. I will show you how to install them using the same versions tested here, so we are both on the same page throughout the rest of this book. Whenever we need to install any additional libraries in the later chapters, I will also explain how to set them up.

I used Jupyter Notebooks to run the code in this book and display the accompanying graphs. I recommend that you also go to the Project Jupyter site and install Jupyter Notebook or Jupyter Lab. This setup is usually recommended when running experimental code. It helps you cut your code into pieces, iterate on each part separately, and display the resulting graphs alongside the code. When it comes to writing production code, then you may use your favorite integrated development environment (IDE) instead.

In addition to the software needed, you will sometimes need to download additional datasets. I will provide links to the required datasets when needed, and give step-by-step explanations on how to download and preprocess them.

I wrote the entire book and ran its code on a MacBook Pro with 16 GB RAM. I expect the code here to run on any other operating system, whether it is Microsoft Windows or any one of the different Linux distributions. It is more common for machine learning algorithms to hit a memory limitation before hitting a CPU bottleneck. Nevertheless, for most of the code and the datasets used here, I would expect computers with less memory than mine to still work fine.

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from your account at If you purchased this book elsewhere, you can visit and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at
  2. Select the Support tab.
  3. Click on Code Downloads.
  4. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here:

Conventions used

There are a number of text conventions used throughout this book.

CodeInText:Indicates code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles.Here is an example:"We are going to use its fit_transform variable and a transform method."

A block of code is set as follows:

import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Any command-line input or output is written as follows:

          $ pip install jupyter
$ pip install matplotlib

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "One-hot encoding is recommended for linear models and K-Nearest Neighbor (KNN) algorithms."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book,mention the book title in the subject of your message and email us

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us atcopyright@packt.comwith a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please


Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit