menu

RANDOM FOREST WITH US ADULT INCOME DATASET

Random Forest

Random Forest is simply a collection of Decision Trees that have been generated using a random subset of data. In order to understand a Random Forest, you have to understand a Decision Tree. 

Decision trees


The goal is to create a model that predicts the value of a target variable based on several input variables. A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. The topmost node in a tree is the root node. A tree can be “learned” by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner.


Appropriate problems for decision tree learning
  • Instances are represented by attribute-value pairs
  • The target function has discrete output values
  • Disjunctive descriptions may be required
  • The training data may contain errors : Decision tree methods are robust to errors
  • The training data may contain missing attribute values.

Decision tree learning has been applied to problems such as classify medical patients by their disease, equipment malfunctions by their cause and loan applicants by their likelihood of defaulting on payments.

In real data mining situations, the problem is that the decision tree constructed may become gigantic. So it becomes important to find techniques to minimize the size of the tree, as far as possible. The idea is to put the decision nodes which give the most information highest (closest to the root) in the decision tree.


The basic decision tree learning algorithm

Most algorithms used in decision tree learning algorithms are variations of ID3 algorithm.The ID3 algorithm can be summarized in the following way:

1- Calculate the entropy of every attribute using the data set

2- Split the set \(S\) into subsets using the attribute for which the resulting entropy (after splitting) is minimum (or, equivalently, information gain is maximum)

3- Make a decision tree node containing that attribute

4- Recurse on subsets using remaining attributes.

Which attribute is the best classifier?

First we define a statistical property called, information gain that measures how well a given attribute separates the training examples according to their target classification. ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree.

In order to define information gain measure first we define entropy that characterizes the (im)purity of an collection of examples. Given a collection \(S\) containing positive and negative example of some target concept, the entropy of \(S\) relative to this boolean classification is:

\[Entropy\space(S) = -p_+log_2p_+ -p_{-}log_2p_-\]

Where \(p_+\) is the proportion of positive example in \(S\) and \(p_{-}\) is the proportion of negative example in \(S\). To illustrate suppose \(S\) is a collection of 14 examples of some boolean concept ,including 9 positive and 5 negative The the entropy is calculated in the following way:

\[Entropy\space(S) = - \frac{9}{14}log_2(\frac{9}{14}) - \frac{5}{14}log_2(\frac{5}{14})\]

Note the entropy is 1 when we have an equal quantity of negative and positive values. More generally, if the target attribute can take on c different values, then the entropy of \(S\) relative


Which attribute to select as the root?
Windy Outlook Humidity Temperature Play golf?
Weak sunny high hot no
Strong sunny high hot no
Weak overcast high hot Yes
Weak rain normal cool Yes
Strong overcast normal cool Yes
Weak sunny hight mild no
Weak sunny normal cool Yes
Weak rain normal mild Yes
Strong sunny normal mild Yes
Strong overcast high mild Yes
Weak overcast normal hot Yes
Strong rain high mild no
Strong rain normal cool no
Weak rain high mild Yes


Let’s calculate the entropy

  • Outlook -> sunny: 3 examples yes, 2 examples no.
\[E(Outlook)_{sunny} = − \frac{2}{5}log_2(\frac{2}{5}) −\frac{3}{5}log_2(\frac{3}{5})=0.971\]
  • Outlook -> overcast: 4 examples yes, 0 examples no
\[E(Outlook)_{overcast}= −1 log_2(1)−0.log_2(0)=0\]
  • Outlook->rainy : 2 examples yes, 3 examples no.
\[E(Outlook)_{rainy} = − \frac{3}{5}log_2(\frac{3}{5})- \frac{2}{5}log_2(\frac{2}{5})0.971\]


How can we compute the quality of the entire split?

Compute the weighted average over all sets resulting from the split weighted by their size.

\[I(S,A) = \sum_ {i} \frac{|S_i|}{|S|} E(S_i)\]

Average entropy for attribute Outlook:

\[I(Outlook)=\frac{5}{14}⋅0.971+ \frac{4}{14}.0 +\frac{5}{14}⋅0.971=0.693\]

When an attribute \(A\) splits the set \(S\) into subsets \(S_i\)

  • we compute the average entropy
  • and compare the sum to the entropy of the original set \(S\)


Information Gain for Attribute A
\[Gain(S,A) = E(S)- I(S,A) = E(S)- \sum_ {i} \frac{|S_i|}{|S|} E(S_i)\]

The attribute that maximizes the difference is selected: Maximizing information gain is equivalent to minimizing average entropy, because \(E(S)\) is constant for all attributes A.

Entropy:
  • Humidity -> High: 3 examples ‘Yes’, 4 examples ‘No’
\[E(Humidity)_{high}= − \frac{3}{7}log_2(\frac{3}{7}) −\frac{4}{7}log_2(\frac{4}{7})=0.985\]
  • Humidity -> Normal: 6 examples ‘Yes’, 1 example ‘No’
\[E(Humidity)_{normal}= − \frac{6}{7}log_2(\frac{6}{7}) −\frac{1}{7}log_2(\frac{1}{7})=0.592\]
  • Wind ->Weak: 6 examples ‘Yes’, 2 examples ‘No’
\[E(Wind)_{Weak}= − \frac{6}{8}log_2(\frac{6}{8}) −\frac{2}{8}log_2(\frac{2}{8})=0.811\]
  • Wind ->Strong :3 examples ‘yes’, 3 examples ‘No’
\[E(Wind)_{Strong}= − \frac{3}{6}log_2(\frac{3}{6}) −\frac{3}{6}log_2(\frac{3}{6})= 1\]
  • Temperature -> Hot : 2 examples ‘No’, 2 example ‘Yes’
\[E(Temperature)_{Hot}= − \frac{2}{4}log_2(\frac{2}{4}) −\frac{2}{4}log_2(\frac{2}{4})= 1\]
  • Temperature-> Cool: 3 examples ‘Yes’, 1 examples ‘No’
\[E(Temperature)_{Mild}= − \frac{3}{4}log_2(\frac{3}{4}) −\frac{1}{4}log_2(\frac{1}{4})=0.8112\]
  • Temperature-> Mild: 4 examples ‘Yes’, 2 ‘No’
\[E(Temperature)_{Mild}= − \frac{4}{6}log_2(\frac{4}{6}) −\frac{2}{6}log_2(\frac{2}{6}) = 0.9182\]
  • The set:
\[E(S) = − \frac{9}{14}log_2(\frac{9}{14}) −\frac{5}{14}log_2(\frac{5}{14})=0.940\]
Information Gain
\[Gain (S, Humidity) = 0.940- \frac{7}{14}0.985 - \frac{7}{14}0.592 = 0.151\] \[Gain (S, Outlook) = 0.940- \frac{7}{14}0.985 - \frac{7}{14}0.592 = 0.246\] \[Gain (S, Wind) = 0.940- \frac{7}{14}0.811 - \frac{7}{14}1 = 0.048\] \[Gain (S, Temperature) = 0.940- \frac{4}{14}1 - \frac{4}{14}0.8112 - \frac{6}{14}0.9182 = 0.029\]

Outlook has the highest value, so it is selected as the root node.

Entropy
  • Humidity -> High : 3 examples sunny and all of them belongs to ‘No’ class
\[E(Humidity_{high}) =− 1.log_2(1) − 0.log_2(0) = 0\]
  • Humidity -> Normal: 2 examples sunny and all of them are ‘Yes’
\[E(Humidity_{normal}) =− 1.log_2(1) − 0.log_2(0) = 0\]
  • Wind -> Weak: 3 examples sunny, 2 are ‘no’, and the other is ‘Yes’
\[E(wind_{Weak})= −\frac{2}{3}log_2(\frac{2}{3}) −\frac{1}{3}log_2(\frac{1}{3})=0.9182\]
  • Wind -> Strong: 2 examples sunny, 1 is ‘yes’, and the other ‘no’
\[E(wind_{Strong}) = −\frac{1}{2}log_2(\frac{1}{2}) −\frac{1}{2}log_2(\frac{1}{2})= 1\]
  • Temperature -> Hot: 2 examples sunny and both ‘No’
\[E(Temperature_{Hot}) =− 1.log_2(1) − 0.log_2(0) = 0\]
  • Temperature -> Mild: 2 examples sunny, 1 ‘No and 1 ‘Yes’.
\[E(Temperature_{Mild}) = −\frac{1}{2}log_2(\frac{1}{2}) −\frac{1}{2}log_2(\frac{1}{2})= 1\]
  • Temperature -> Cool: 1 examples sunny and belongs to ‘Yes’ class.
\[E(Temperature_{Cool}) =− 1.log_2(1) − 0.log_2(0) = 0\]


Information gain:
\[Gain(Sunny,Humidity) = 0.971 -\frac{3}{5}.0 - \frac{2}{5}0 = 0.971\] \[Gain (Sunny, Wind) = 0.971 -\frac{3}{5}0.9182 -\frac{2}{5}1.0 = 0.20\] \[Gain (Sunny, Temperature) = 0.971 -\frac{2}{5}.0 - \frac{1}{5}.0 - \frac{2}{5}.1 = 0.571\]

We choose the biggest value, so Humidity is selected.

Then we do it all again for the two left attributes and we decide the attribute with the highest value.

Final Decision tree


Issues in decision tree learning

Overfitting typically occurs when we try to model the data training perfectly.It might be due to insufficient examples. Overfitting results in decision trees that are more complex than necessary How can we spot overfitting? Low training error and testing error is comparably high

Measures against overfitting or growing the tree too deep

Pre-Pruning : Stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data.

Post-Puning :Allows the tree to overfit the data, and then post-prune the tree

In practice we usually use the second approach as it is easier to post-prune the data than estimating the time to stop growing the tree.

The important step of tree pruning is to define a criterion be used to determine the correct final tree size. We can use the following methods

  • Use a validation set, to evaluate the effect of post-pruning nodes from the tree. Then we have a validation train to build a tree, a validation set to prune the tree and then the set to evaluate the model.
  • Build the tree by using the training set, then apply a statistical test to estimate whether pruning or expanding a particular node is likely to produce an improvement beyond the training set. (E.g Chi-square test)
  • Minimum Description Length Principle : Use an explicit measure of the complexity for encoding the training set and the decision tree, stopping growth of the tree when this encoding size (size(tree) + size(misclassifications(tree)) is minimized.

 Rule Post-Pruning

1- Growing the tree until the training data is fit as well as possible and allowing overfitting.

2- Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to the leaf node.

3- Prune each rule by removing any preconditions that result in improving its estimated accuracy.

4- Sort the pruned rules by estimated accuracy.

To illustrate this considerer the last diagram. Each attribute test along the path from the root to the leaf node becomes a rule antecedent (precondition) and the classification at the leaf node becomes the rule consequent (postcondition).

For example the leftmost path of the tree is translated into the rule : IF (OUTLOOK=SUNNY) ^ (HUMIDITY = NORMAL) THEN (PLAYTENNIS = YES)

Next each rule is pruned by removing any precondition whose removal doesn’t worsen its estimated accuracy. Given the above rule we will considerer to remove (OUTLOOK=SUNNY) and (HUMIDITY = NORMAL) and select whichever of this pruning steps improve estimated accuracy. No pruning step is performed if it reduces the estimated rule accuracy. So how do we measure this? One way is to use a validation set disjoint for the training set

Important Parameters to tune in Scikit Algorithm

1) N_ESTIMATORS: The more trees the less likely the algorithm is to overfit. The higher the number of trees the better to learn the data.

2) CRITERION: Here you can use Entropy or Gini criteria.

3) MAX_DEPTH : Gives limit on vertical depth decide up to which level pruning is required.

4) MIN SAMPLES SPLIT : Minimum number of samples required to split an internal node. High value: Underfitting, Low value: Overfitting

5) MIN SAMPLES LEAF: The branch will stop splitting once the leaves have that number of samples each.

6) MAX FEATURES: Represents the number of features to consider when looking for the best split The smaller, the less likely to overfit, but too small will start to introduce under fitting.


US Adult Income Dataset

Now let's put this into practice with us adult income salary . We are going to use random forest in order to predict whether an individual's salary is greater than 50 K or not.

  • First we explore the data to understand the trends in the corpus.
  • Then use this information to predict whether an -individual made more or less than $50,000

Each entry contains the following information about an individual:

  • Age
  • Work-class: Term to represent the employment status of an individual. Such as : Private, Local­gov, State­gov, Without­pay, Never­worked and so on
  • Fnlwgt: Term to represent final weight.The number of units in the target population that the responding unit represents
  • Education: the highest level of education achieved by an individual. Such as Bachelors, Some­college, 11th, HS­grad, Prof­school,9th, 7th­8th, 12th, Masters, 1st­4th, 10th and so on.
  • Education ­num: the highest level of education achieved in numerical form
  • Marital­ status: marital status of an individual. Such as Divorced, Never­ married, Separated, Widowed,
  • Occupation: the general type of occupation of an individual Such as Tech­support, Craft­repair, Other­service, Sales, Exec­managerial and so on
  • Relationship: Represents the responding unit’s role in the family. Such as Wife, Husband and so on
  • Race: Descriptions of an individual’s race Such as: White, Asian­Pac­Islander, Amer­Indian­Eskimo, Other, Black.
  • Sex: the biological sex of the individual Such as :Male, Female
  • Capital­ gain: capital gains for an individual
  • Capital ­loss: capital loss for an individual
  • Hours ­per ­week: the hours an individual has reported to work per week
  • Native ­country: country of origin for an individual Such as United­States, Cambodia, England, Puerto­Rico and so on
  • Income: whether or not an individual makes more than $50,000 annually. ○ <=50k, >50k

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


dataset = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
row_names = ["Age", "Workclass", "Fnlwgt", "Education", "EducationNum", "MaritalStatus",
        "Occupation", "Relationship", "Race", "Gender", "CapitalGain", "CapitalLoss",
        "HoursPerWeek", "Country", "Income"]
us_adult_income = pd.read_csv(dataset, names=row_names,na_values=[' ?'])

us_adult_income.describe()
Age Fnlwgt EducationNum CapitalGain CapitalLoss HoursPerWeek
count 32561.000 32561.000 32561.000 32561.000 32561.000 32561.000
mean 38.582 189778.367 10.081 1077.649 87.304 40.437
std 13.640 105549.978 2.573 7385.292 402.960 12.347
min 17.000 12285.000 1.000 0.000 0.000 1.000
25% 28.000 117827.000 9.000 0.000 0.000 40.000
50% 37.000 178356.000 10.000 0.000 0.000 40.000
75% 48.000 237051.000 12.000 0.000 0.000 45.000
max 90.000 1484705.000 16.000 99999.000 4356.000 99.000
Distribution of educational attainment

plt.figure(figsize=(25,8))
sns.countplot(y='Education', data = us_adult_income, order = us_adult_income.Education.value_counts().index)
plt.title('Education')
plt.xlabel('Number of Occurrences', fontsize=12)
plt.ylabel('Degree', fontsize=12)
plt.show()

Most of the individuals in the dataset have at most a high school education, and have attended an University while only a small portion have a doctorate.

Job title Distribution

plt.figure(figsize=(25,8))
sns.countplot(y='Occupation',order= us_adult_income.Occupation.value_counts().index, alpha=0.8,data = us_adult_income)
plt.title('Occupation')
plt.xlabel('Number of Occurrences', fontsize=12)
plt.ylabel('Job title', fontsize=12)
plt.show()

Countries Distribution

plt.figure(figsize=(20,4))
sns.countplot(y='Country',order=us_adult_income.Country.value_counts().index[:10], alpha=0.8,data = us_adult_income)
plt.title('Countries')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Countries', fontsize=12)
plt.show()

Most the people that have been surveyed are from US.

Race Distribution

plt.figure(figsize=(20,4))
plt.title('Race Distribution')
plt.ylabel('Race', fontsize=15)
sns.countplot(y = 'Race', data=us_adult_income)

The sample size of Whites in the dataset is disproportionately large in comparison to all other races.

Gender vs Income
sns.barplot(x = 'Gender',y='Age' ,hue='Income', data=us_adult_income)

The percentage of males who make greater than $50,000 is much greater than the percentage of females that make the same amount.

Box plot for hours per week grouped by gender vs Income

sns.boxplot(x = 'Gender',y='HoursPerWeek' ,hue='Income', data=us_adult_income)

The median of hours per week for women is smaller than the one for men. The majority of people who perceive more than 50 K are men

Education vs Income

plt.figure(figsize=(25,10))
sns.countplot(x = 'Education',hue='Income', data=us_adult_income)

People who achieve a higher level of education tend to earn more than 50K

Occupation vs Income

plt.figure(figsize=(25,10))
sns.countplot(x = 'Occupation',hue='Income', data=us_adult_income)

Exec­-managerial and Prof­-specialty stand out as having very high percentages of individuals making over $50,000. In addition, Priv-House-Serv make only less than 50 K

Marital Status vs Income

plt.figure(figsize=(30,10))
sns.countplot(x = 'MaritalStatus',hue='Income', data=us_adult_income)

Work-class vs Income

plt.figure(figsize=(20,10))
sns.countplot(x = 'Workclass',hue='Income', data=us_adult_income)

Federal-gov and self-emp-inc are the the two classes which almost equal in people who make more than 50K and less.

Age vs Income
plt.figure(figsize=(20,10))
plt.title('Age vs Income')
plt.xlabel('Age', fontsize=15)
sns.countplot(x = 'Age',hue='Income', data=us_adult_income)

As we might notice seems impossible for people who are less than 20 years old or more than 80 years old to earn more than 50K . At the age of 37 is where we have the most part of people earning 50K. The same occurs at age 46

Race vs Income

plt.figure(figsize=(20,4))
plt.title('Race vs Income')
plt.xlabel('Race', fontsize=15)
sns.countplot(x = 'Race',hue='Income', data=us_adult_income)

Income Distribution vs Age

density = sns.FacetGrid(us_adult_income, col = "Income",height=5)
density.map(sns.kdeplot, "Age")

Correlation Heat map

fig, (ax) = plt.subplots(1, 1, figsize=(10,6))

hm = sns.heatmap(us_adult_income.corr(),
                 ax=ax,           # Axes in which to draw the plot, otherwise use the currently-active Axes.
                 cmap="coolwarm", # Color Map.
                 #square=True,    # If True, set the Axes aspect to “equal” so each cell will be square-shaped.
                 annot=True,
                 fmt='.2f',       # String formatting code to use when adding annotations.
                 #annot_kws={"size": 14},
                 linewidths=.05)

fig.subplots_adjust(top=0.93)
fig.suptitle('US DATA INCOME Correlation Heatmap',
              fontsize=14,
              fontweight='bold')

There is not an strong correlation that we can point out but we could mention that Fnlwgt have an inverse correlation to the quantitive variables in the showed in the heat map, except Capital gain.

Since Education Num and Education are the same we are only going to keep the quantitive variable.


del us_adult_income['Education']

Nan Values in our dataset

us_adult_income.isnull().sum()

Variable Nan values
Age 0
Workclass 1836
Fnlwgt 0
EducationNum 0
MaritalStatus 0
Occupation 1843
Relationship 0
Race 0
Gender 0
CapitalGain 0
CapitalLoss 0
HoursPerWeek 0
Country 583
Income 0

us_adult_income.Income.value_counts()

Target Quantity
<=50K 24720
>50K 7841

Most of our dataset belongs to the class <=50K

Encoding

We encode the categorical variables in order to train our model. We replace the missing values with the statistical mean of each column.


one_hot_encoding = pd.get_dummies(us_adult_income)
one_hot_encoding.fillna(one_hot_encoding .mean(),inplace=True)
one_hot_encoding["Income"] = one_hot_encoding['Income_ >50K']
del one_hot_encoding['Income_ <=50K']
del one_hot_encoding['Income_ >50K']
sns.set(font_scale=2.8)
sns.clustermap(one_hot_encoding.corr(), metric="correlation",figsize=(50, 50))

Split the dataset into train and testing

X = one_hot_encoding.iloc[:,0:-1]
y = one_hot_encoding.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 42)

Random Forest Classifier

rf = RandomForestClassifier(max_depth=25,n_estimators=200, min_samples_leaf=1,min_samples_split=50,criterion='entropy', oob_score=True,random_state=42)
rf.fit(X_train, y_train)


predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')

Estimation

Out-of-bag score estimate: 0.862
Mean accuracy score: 0.866

Confusion Matrix

cm = confusion_matrix(y_test, predicted)
fig, (ax) = plt.subplots(1, 1, figsize=(8,7))
sns.heatmap(cm, annot=True, ax = ax,fmt='g')
# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix')
sns.set(font_scale=1.5)
ax.xaxis.set_ticklabels(['Income > 50 K', 'Income < 50 K'])
ax.yaxis.set_ticklabels(['Income > 50 K', 'Income < 50 K'])

Conclusion

Random forest performs well, maybe if we have more individuals that make more than 50 K, it might be easier for the model to predict those cases. Maybe we could have chosen to remove the column Fnlwgt, which seems not to aport a great deal of information, the same with Capital Gains\Loss and Relationship.

References