jithin pradeep Cognitive Research Scientist | AI and Mixed reality Enthusiast

Representing Categorical values in Machine learning

Lets start with an example of categorical value, every one would be familiar with Iris dataset the species value [“Iris Setosa”,”Iris Versicolour”,”Iris Virginica”] is a categorical variable. It’s difficult to work with categorical values directly, be it be complex machine learning algorithm or a simple statistical test done using SPSS we would require to encode the information in an integer encoding scheme.

Two approaches are listed below, I would take iris dataset species data value as example.

One hot encoding is binary representation technique used with categorical data. Machine learning cannot work directly with categorical data this true is with respect to both input and output. One hot encoding is also called as one-of-K scheme. To achieve one hot encoding we must first convert the categorical values to number(simple integer encoding), now question is then why do we need to convert the numbers again to binary vectors ? why not use the number directly to represent the data after all machine learning algorithm do work with number?

Answer is yes indeed, we could use a number or an integer encoding scheme directly, and rescale them where ever required. This would work with data which have a natural ordinal relationship between the categories and hence the number representing the category, example temperature{cold, warm, hot} or speed {low, medium, high} . Problem might arise when there is no ordinal relationship, in such case the representation might affect the machine learning performance. Here we would be interested is defining the relationship in terms of probability like values for each label value creating a vector with probability like value for each class for each data record. There by creating a better and an implicitly representation of relationship. When a one hot encoding is used for the output variable, it may offer a more fine and distinct set of predictions than a single label.

Example for problem, for categorical data without an ordinal relationship while using simple integer encoding. Consider the iris dataset again, there is no relationship between the class of species. If we represent them with simple integer encoding the resultant representation will be like Iris setosa will be encoded as 0 , Iris Versicolour as 1 and Iris Virginica as 2. Let say inference the class based on value Iris setosa < Iris versicolour < Iris Virginica. Things would get complex if we take an average of two class avg(0 ,2) = 1 , which equivalent of saying the average of Iris setosa and Iris Virginica is Iris versicolour. I understand the example are primitive but conclusion is using such a kind of representation might lead us to misleading pattern of data which ideally did not even existed in the first place.

So how do we implement one hot encoding, l personally prefer using my own code but we do have scikit-learn and keras packages providing us the function to achieve the encoding scheme. Now a days I use scikit-learn.preprocessing (LabelEncoder() and OneHotEncoder() there are many more function to explore ) for my day-to-day data preprocessing task, I have used keras (to_categorical())as well.

Implementation One Hot Encoding in python

In [2]:

import pandas as pd
import os 
import numpy as np

# fetching program path
pgmPath = os.getcwd();
# Loading Iris dataset for the purpose of demo
datasetPath = pgmPath + '\Dataset\irisdataset.csv'
datasetColumns = ['sepal length', 'sepal width', 'petal length', 'petal width','class']
featureColumns = ['sepal length', 'sepal width', 'petal length', 'petal width']
classColumn = ['class']

#loading the dataset to pandas dataframe
irisDF = pd.read_csv(datasetPath)

Using Scikit-learn, LabelEncoder is simple integer encoding scheme and OneHotEncoder is used for One hot encoding. In one hot encoding there is an index position which is set to 1 and all other values corresponding to other class will be set to 0. Hence if we have three classes within the categorical variable length of the Binary vector would be 3, ie binary vector will have three elements.

In [3]:

from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [4]:

#List unique values in the irisDF['class'] column


array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Simple Integer encoding using, LabelEncoder(), For More details on LabelEncoder() :


In [7]:

labelEncoderObj = LabelEncoder()
integerEncoding = labelEncoderObj.fit_transform(irisDF['class'].unique())

[0 1 2]

One Hot encoding using OneHotEncoder(), For More details on OneHotEncoder() :


In [8]:

oneHotEncoderObj = OneHotEncoder(sparse=False)
# reshaping IntegerEnodering variable (len(integerEncoding),1)
integerEncoding = integerEncoding.reshape(len(integerEncoding), 1)
oneHotEncoding = oneHotEncoderObj.fit_transform(integerEncoding)

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]

Now Let see, how to decode the one hot encoding to reterive back the categorical class values

In [10]:

for itemEncoded in oneHotEncoding :
    inverted = labelEncoderObj.inverse_transform([argmax(itemEncoded)])
    print("Class Label {0} and One hot Encoding {1}".format(inverted, itemEncoded))

Class Label ['Iris-setosa'] and One hot Encoding [ 1.  0.  0.]
Class Label ['Iris-versicolor'] and One hot Encoding [ 0.  1.  0.]
Class Label ['Iris-virginica'] and One hot Encoding [ 0.  0.  1.]