Tutoring for Classification Trees, Entropy and Information Gain

Graduatetutor.com provides tutoring for classification trees. We first help you understand entropy and information gain which form the foundation of classification trees.

Classification is derived from “classes”. Classification is the process of sorting data into segments, groups, categories or classes – the reason for the terminology ‘classification’! Examples of where classification is used include sorting out emails into spam vs. not spam, decisions on approving or rejecting a mortgage application, hire or fire, accept or reject a job offer, etc.

Types of Classification: Binary, Multi-Class & Multi-Label

Classification can be done into two classes, called binary classification. Examples of binary classification include hire or fire, spam or not spam. Classification can also be done into more than two classes called multi-classes. Examples of multi-class classification include identifying images into types of fruits, identifying the type of customer, genre of music, etc. You can also have multi-label classification where a movie can be about a person, culture and time period.

There are various ways or methods used in classification. Some of them include judgement, case based, decision trees, k-nearest neighbor, artificial neural networks, naive bayes. Each of these have their advantages and disadvantages.

Entropy and Information Gain

Entropy and Information Gain are two terms used in arriving at classes through a classification tree. This concept was first mooted in 1948 by Claude Shannon in “A Mathematical Theory of Communication” well before the term data science was conceived!

What is Entropy?

The Oxford dictionary defines Entropy as

a thermodynamic quantity representing the unavailability of a system’s thermal energy for conversion into mechanical work, often interpreted as the degree of disorder or randomness in the system. For example: “the second law of thermodynamics says that entropy always increases with time”
lack of order or predictability; gradual decline into disorder. For example: “a marketplace where entropy reigns supreme”
a logarithmic measure of the rate of transfer of information in a particular message or language.

The third definition is what is relevant in statistical analysis, data science and information theory! Wikipedia does a better job of defining entropy. According to Wikipedia , the entropy of a random variable information theory is: the average level of “information”, “surprise”, or “uncertainty” inherent in the variable’s possible outcomes.

Another way to think about entropy is as a measure of disorder. I have found it useful to use the analogy of water forms to explain entropy. Water in the form of ice can be said to have low entropy as the particles inside move slowly. Water in the form of a liquid can be said to have a moderate level of entropy as the particles inside move at a moderate speed. Water in the form of vapor can be said to have high entropy as the particles inside move at the highest speed. Here entropy kind off indicates ‘movement’.

How is Entropy Measured?

Claude in “A Mathematical Theory of Communication” quantified entropy as

Entropy = – sum of P(x) * (log (Px), base 2)

Why does Claude Shannon use a Logarithm in this measure of entropy? In his paper he states three reasons 1) practicality, 2) makes the measure more intuitive and 3) mathematically suited. My interpretation of his reason is to make the measure of entropy more linear. The base of the logarithm can change. Most practical applications use a log to the base 2.

And in English: Entropy is the negative of the sum of the probabilities of every outcome multiplied by the logarithm of probabilities for each outcome.

A higher value of Entropy means that there is a higher level of disorder. Viewed from another angle, it indicates lower levels of purity. From an interpretation standpoint a higher level of disorder means that it provides lower assistance in classifying your data. Entropy ranges from 0 to 1. An entropy of 1 indicates that the variable provides no useful information to aid with classification. Whereas an entropy of 0 indicates that it is very useful

We can compute the entropy of both the dependent variable and independent variables. In other words, we can compute the entropy of every type of information we have. For example, let us assume that we are given a set of online customer data and asked to build a classification tree that helps predict fraud. Let us assume the variables we have are gender, country, device and size of transaction (Large, Medium, Small). We can compute the entropy of each of these three variables.

Note: we look at the variable (other than the variable we are trying to classify into) with the highest entropy to start our classification tree. In our example, we use country as it had the highest entropy.

Information Gain

Now that you have a better understanding of entropy, we will turn our attention to another term – information gain. Entropy and information gain are use simultaneously to arrive at a classification tree. A classification tree is nothing but a decision tree that you follow with various pieces of information to classify a data point into various buckets.

Information gain is simply how much of additional information have we gained from one piece of information. For example, in our fraud illustration, if we know that a transaction is from a specific category of device, how much of information have we gained towards identifying if it is a fraud or now?

Information gain is computed by subtracting the Entropy of one level to another.

Information Gain (L1,L2) = Entropy (L1) – Entropy(L2)

Note: The entropy of each level is the weighted average of all the categories in each level.

If you think of the top level as a parent and the next level as children, information gain is defined as follows.

Information Gain = Entropy (parent) – Entropy(children)

Building the Classification Tree

You are ready to start building a classification tree once you have understood entropy and information gain.

We chose the variable with the highest information gain to be our starting node on the decision tree. We then look at the information gain of all the other variables from the chosen variable and select the variable gain with the highest information gain to be the next variable.

We continue this process till we find no significant information gain.

Next steps on Classification Trees

This article helps you understand what entropy is and compute the entropy for classification. We did not build on the intuition for the entropy concept in this article. But this video tries goes deeper on the intuition of entropy.