RANDOM FOREST WITH US ADULT INCOME DATASET

08/03/2019 18:00

Machine Learning

Random forest

Decision trees

Pruning

Random Forest

Random Forest is simply a collection of Decision Trees that have been generated using a random subset of data. In order to understand a Random Forest, you have to understand a Decision Tree.

Decision trees

The goal is to create a model that predicts the value of a target variable based on several input variables. A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. The topmost node in a tree is the root node. A tree can be “learned” by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner.

Appropriate problems for decision tree learning

Instances are represented by attribute-value pairs
The target function has discrete output values
Disjunctive descriptions may be required
The training data may contain errors : Decision tree methods are robust to errors
The training data may contain missing attribute values.

Decision tree learning has been applied to problems such as classify medical patients by their disease, equipment malfunctions by their cause and loan applicants by their likelihood of defaulting on payments.

In real data mining situations, the problem is that the decision tree constructed may become gigantic. So it becomes important to find techniques to minimize the size of the tree, as far as possible. The idea is to put the decision nodes which give the most information highest (closest to the root) in the decision tree.

The basic decision tree learning algorithm

Most algorithms used in decision tree learning algorithms are variations of ID3 algorithm.The ID3 algorithm can be summarized in the following way:

1- Calculate the entropy of every attribute using the data set

2- Split the set $S$ into subsets using the attribute for which the resulting entropy (after splitting) is minimum (or, equivalently, information gain is maximum)

3- Make a decision tree node containing that attribute

4- Recurse on subsets using remaining attributes.

Which attribute is the best classifier?

First we define a statistical property called, information gain that measures how well a given attribute separates the training examples according to their target classification. ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree.

In order to define information gain measure first we define entropy that characterizes the (im)purity of an collection of examples. Given a collection $S$ containing positive and negative example of some target concept, the entropy of $S$ relative to this boolean classification is:

\[Entropy\space(S) = -p_+log_2p_+ -p_{-}log_2p_-\]

Where $p_+$ is the proportion of positive example in $S$ and $p_{-}$ is the proportion of negative example in $S$. To illustrate suppose $S$ is a collection of 14 examples of some boolean concept ,including 9 positive and 5 negative The the entropy is calculated in the following way:

\[Entropy\space(S) = - \frac{9}{14}log_2(\frac{9}{14}) - \frac{5}{14}log_2(\frac{5}{14})\]

Note the entropy is 1 when we have an equal quantity of negative and positive values. More generally, if the target attribute can take on c different values, then the entropy of $S$ relative

Which attribute to select as the root?

Windy	Outlook	Humidity	Temperature	Play golf?
Weak	sunny	high	hot	no
Strong	sunny	high	hot	no
Weak	overcast	high	hot	Yes
Weak	rain	normal	cool	Yes
Strong	overcast	normal	cool	Yes
Weak	sunny	hight	mild	no
Weak	sunny	normal	cool	Yes
Weak	rain	normal	mild	Yes
Strong	sunny	normal	mild	Yes
Strong	overcast	high	mild	Yes
Weak	overcast	normal	hot	Yes
Strong	rain	high	mild	no
Strong	rain	normal	cool	no
Weak	rain	high	mild	Yes

Let’s calculate the entropy

Outlook -> sunny: 3 examples yes, 2 examples no.

\[E(Outlook)_{sunny} = − \frac{2}{5}log_2(\frac{2}{5}) −\frac{3}{5}log_2(\frac{3}{5})=0.971\]

Outlook -> overcast: 4 examples yes, 0 examples no

\[E(Outlook)_{overcast}= −1 log_2(1)−0.log_2(0)=0\]

Outlook->rainy : 2 examples yes, 3 examples no.

\[E(Outlook)_{rainy} = − \frac{3}{5}log_2(\frac{3}{5})- \frac{2}{5}log_2(\frac{2}{5})0.971\]

How can we compute the quality of the entire split?

Compute the weighted average over all sets resulting from the split weighted by their size.

\[I(S,A) = \sum_ {i} \frac{|S_i|}{|S|} E(S_i)\]

Average entropy for attribute Outlook:

\[I(Outlook)=\frac{5}{14}⋅0.971+ \frac{4}{14}.0 +\frac{5}{14}⋅0.971=0.693\]

When an attribute $A$ splits the set $S$ into subsets $S_i$

we compute the average entropy
and compare the sum to the entropy of the original set $S$

Information Gain for Attribute A

\[Gain(S,A) = E(S)- I(S,A) = E(S)- \sum_ {i} \frac{|S_i|}{|S|} E(S_i)\]

The attribute that maximizes the difference is selected: Maximizing information gain is equivalent to minimizing average entropy, because $E(S)$ is constant for all attributes A.

Entropy:

Humidity -> High: 3 examples ‘Yes’, 4 examples ‘No’

\[E(Humidity)_{high}= − \frac{3}{7}log_2(\frac{3}{7}) −\frac{4}{7}log_2(\frac{4}{7})=0.985\]

Humidity -> Normal: 6 examples ‘Yes’, 1 example ‘No’

\[E(Humidity)_{normal}= − \frac{6}{7}log_2(\frac{6}{7}) −\frac{1}{7}log_2(\frac{1}{7})=0.592\]

Wind ->Weak: 6 examples ‘Yes’, 2 examples ‘No’

\[E(Wind)_{Weak}= − \frac{6}{8}log_2(\frac{6}{8}) −\frac{2}{8}log_2(\frac{2}{8})=0.811\]

Wind ->Strong :3 examples ‘yes’, 3 examples ‘No’

\[E(Wind)_{Strong}= − \frac{3}{6}log_2(\frac{3}{6}) −\frac{3}{6}log_2(\frac{3}{6})= 1\]

Temperature -> Hot : 2 examples ‘No’, 2 example ‘Yes’

\[E(Temperature)_{Hot}= − \frac{2}{4}log_2(\frac{2}{4}) −\frac{2}{4}log_2(\frac{2}{4})= 1\]

Temperature-> Cool: 3 examples ‘Yes’, 1 examples ‘No’

\[E(Temperature)_{Mild}= − \frac{3}{4}log_2(\frac{3}{4}) −\frac{1}{4}log_2(\frac{1}{4})=0.8112\]

Temperature-> Mild: 4 examples ‘Yes’, 2 ‘No’

\[E(Temperature)_{Mild}= − \frac{4}{6}log_2(\frac{4}{6}) −\frac{2}{6}log_2(\frac{2}{6}) = 0.9182\]

The set:

\[E(S) = − \frac{9}{14}log_2(\frac{9}{14}) −\frac{5}{14}log_2(\frac{5}{14})=0.940\]

Information Gain

\[Gain (S, Humidity) = 0.940- \frac{7}{14}0.985 - \frac{7}{14}0.592 = 0.151\] \[Gain (S, Outlook) = 0.940- \frac{7}{14}0.985 - \frac{7}{14}0.592 = 0.246\] \[Gain (S, Wind) = 0.940- \frac{7}{14}0.811 - \frac{7}{14}1 = 0.048\] \[Gain (S, Temperature) = 0.940- \frac{4}{14}1 - \frac{4}{14}0.8112 - \frac{6}{14}0.9182 = 0.029\]

Outlook has the highest value, so it is selected as the root node.

Entropy

Humidity -> High : 3 examples sunny and all of them belongs to ‘No’ class

\[E(Humidity_{high}) =− 1.log_2(1) − 0.log_2(0) = 0\]

Humidity -> Normal: 2 examples sunny and all of them are ‘Yes’

\[E(Humidity_{normal}) =− 1.log_2(1) − 0.log_2(0) = 0\]

Wind -> Weak: 3 examples sunny, 2 are ‘no’, and the other is ‘Yes’

\[E(wind_{Weak})= −\frac{2}{3}log_2(\frac{2}{3}) −\frac{1}{3}log_2(\frac{1}{3})=0.9182\]

Wind -> Strong: 2 examples sunny, 1 is ‘yes’, and the other ‘no’

\[E(wind_{Strong}) = −\frac{1}{2}log_2(\frac{1}{2}) −\frac{1}{2}log_2(\frac{1}{2})= 1\]

Temperature -> Hot: 2 examples sunny and both ‘No’

\[E(Temperature_{Hot}) =− 1.log_2(1) − 0.log_2(0) = 0\]

Temperature -> Mild: 2 examples sunny, 1 ‘No and 1 ‘Yes’.

\[E(Temperature_{Mild}) = −\frac{1}{2}log_2(\frac{1}{2}) −\frac{1}{2}log_2(\frac{1}{2})= 1\]

Temperature -> Cool: 1 examples sunny and belongs to ‘Yes’ class.

\[E(Temperature_{Cool}) =− 1.log_2(1) − 0.log_2(0) = 0\]

Information gain:

\[Gain(Sunny,Humidity) = 0.971 -\frac{3}{5}.0 - \frac{2}{5}0 = 0.971\] \[Gain (Sunny, Wind) = 0.971 -\frac{3}{5}0.9182 -\frac{2}{5}1.0 = 0.20\] \[Gain (Sunny, Temperature) = 0.971 -\frac{2}{5}.0 - \frac{1}{5}.0 - \frac{2}{5}.1 = 0.571\]

We choose the biggest value, so Humidity is selected.

Then we do it all again for the two left attributes and we decide the attribute with the highest value.

Final Decision tree

Issues in decision tree learning

Overfitting typically occurs when we try to model the data training perfectly.It might be due to insufficient examples. Overfitting results in decision trees that are more complex than necessary How can we spot overfitting? Low training error and testing error is comparably high

Measures against overfitting or growing the tree too deep

Pre-Pruning : Stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data.

Post-Puning :Allows the tree to overfit the data, and then post-prune the tree

In practice we usually use the second approach as it is easier to post-prune the data than estimating the time to stop growing the tree.

The important step of tree pruning is to define a criterion be used to determine the correct final tree size. We can use the following methods

Use a validation set, to evaluate the effect of post-pruning nodes from the tree. Then we have a validation train to build a tree, a validation set to prune the tree and then the set to evaluate the model.
Build the tree by using the training set, then apply a statistical test to estimate whether pruning or expanding a particular node is likely to produce an improvement beyond the training set. (E.g Chi-square test)
Minimum Description Length Principle : Use an explicit measure of the complexity for encoding the training set and the decision tree, stopping growth of the tree when this encoding size (size(tree) + size(misclassifications(tree)) is minimized.

Rule Post-Pruning

1- Growing the tree until the training data is fit as well as possible and allowing overfitting.

2- Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to the leaf node.

3- Prune each rule by removing any preconditions that result in improving its estimated accuracy.

4- Sort the pruned rules by estimated accuracy.

To illustrate this considerer the last diagram. Each attribute test along the path from the root to the leaf node becomes a rule antecedent (precondition) and the classification at the leaf node becomes the rule consequent (postcondition).

For example the leftmost path of the tree is translated into the rule : IF (OUTLOOK=SUNNY) ^ (HUMIDITY = NORMAL) THEN (PLAYTENNIS = YES)

Next each rule is pruned by removing any precondition whose removal doesn’t worsen its estimated accuracy. Given the above rule we will considerer to remove (OUTLOOK=SUNNY) and (HUMIDITY = NORMAL) and select whichever of this pruning steps improve estimated accuracy. No pruning step is performed if it reduces the estimated rule accuracy. So how do we measure this? One way is to use a validation set disjoint for the training set

Important Parameters to tune in Scikit Algorithm

1) N_ESTIMATORS: The more trees the less likely the algorithm is to overfit. The higher the number of trees the better to learn the data.

2) CRITERION: Here you can use Entropy or Gini criteria.

3) MAX_DEPTH : Gives limit on vertical depth decide up to which level pruning is required.

4) MIN SAMPLES SPLIT : Minimum number of samples required to split an internal node. High value: Underfitting, Low value: Overfitting

5) MIN SAMPLES LEAF: The branch will stop splitting once the leaves have that number of samples each.

6) MAX FEATURES: Represents the number of features to consider when looking for the best split The smaller, the less likely to overfit, but too small will start to introduce under fitting.

US Adult Income Dataset

Now let's put this into practice with us adult income salary . We are going to use random forest in order to predict whether an individual's salary is greater than 50 K or not.

First we explore the data to understand the trends in the corpus.
Then use this information to predict whether an -individual made more or less than $50,000

Each entry contains the following information about an individual:

Age
Work-class: Term to represent the employment status of an individual. Such as : Private, Localgov, Stategov, Withoutpay, Neverworked and so on
Fnlwgt: Term to represent final weight.The number of units in the target population that the responding unit represents
Education: the highest level of education achieved by an individual. Such as Bachelors, Somecollege, 11th, HSgrad, Profschool,9th, 7th8th, 12th, Masters, 1st4th, 10th and so on.
Education num: the highest level of education achieved in numerical form
Marital status: marital status of an individual. Such as Divorced, Never married, Separated, Widowed,
Occupation: the general type of occupation of an individual Such as Techsupport, Craftrepair, Otherservice, Sales, Execmanagerial and so on
Relationship: Represents the responding unit’s role in the family. Such as Wife, Husband and so on
Race: Descriptions of an individual’s race Such as: White, AsianPacIslander, AmerIndianEskimo, Other, Black.
Sex: the biological sex of the individual Such as :Male, Female
Capital gain: capital gains for an individual
Capital loss: capital loss for an individual
Hours per week: the hours an individual has reported to work per week
Native country: country of origin for an individual Such as UnitedStates, Cambodia, England, PuertoRico and so on
Income: whether or not an individual makes more than $50,000 annually. ○ <=50k, >50k

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

dataset = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
row_names = ["Age", "Workclass", "Fnlwgt", "Education", "EducationNum", "MaritalStatus",
        "Occupation", "Relationship", "Race", "Gender", "CapitalGain", "CapitalLoss",
        "HoursPerWeek", "Country", "Income"]
us_adult_income = pd.read_csv(dataset, names=row_names,na_values=[' ?'])

us_adult_income.describe()

	Age	Fnlwgt	EducationNum	CapitalGain	CapitalLoss	HoursPerWeek
count	32561.000	32561.000	32561.000	32561.000	32561.000	32561.000
mean	38.582	189778.367	10.081	1077.649	87.304	40.437
std	13.640	105549.978	2.573	7385.292	402.960	12.347
min	17.000	12285.000	1.000	0.000	0.000	1.000
25%	28.000	117827.000	9.000	0.000	0.000	40.000
50%	37.000	178356.000	10.000	0.000	0.000	40.000
75%	48.000	237051.000	12.000	0.000	0.000	45.000
max	90.000	1484705.000	16.000	99999.000	4356.000	99.000

Distribution of educational attainment

plt.figure(figsize=(25,8))
sns.countplot(y='Education', data = us_adult_income, order = us_adult_income.Education.value_counts().index)
plt.title('Education')
plt.xlabel('Number of Occurrences', fontsize=12)
plt.ylabel('Degree', fontsize=12)
plt.show()

Most of the individuals in the dataset have at most a high school education, and have attended an University while only a small portion have a doctorate.

Job title Distribution

plt.figure(figsize=(25,8))
sns.countplot(y='Occupation',order= us_adult_income.Occupation.value_counts().index, alpha=0.8,data = us_adult_income)
plt.title('Occupation')
plt.xlabel('Number of Occurrences', fontsize=12)
plt.ylabel('Job title', fontsize=12)
plt.show()

Countries Distribution

plt.figure(figsize=(20,4))
sns.countplot(y='Country',order=us_adult_income.Country.value_counts().index[:10], alpha=0.8,data = us_adult_income)
plt.title('Countries')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Countries', fontsize=12)
plt.show()

Most the people that have been surveyed are from US.

Race Distribution

plt.figure(figsize=(20,4))
plt.title('Race Distribution')
plt.ylabel('Race', fontsize=15)
sns.countplot(y = 'Race', data=us_adult_income)

The sample size of Whites in the dataset is disproportionately large in comparison to all other races.

Gender vs Income

sns.barplot(x = 'Gender',y='Age' ,hue='Income', data=us_adult_income)

The percentage of males who make greater than $50,000 is much greater than the percentage of females that make the same amount.

Box plot for hours per week grouped by gender vs Income

sns.boxplot(x = 'Gender',y='HoursPerWeek' ,hue='Income', data=us_adult_income)

The median of hours per week for women is smaller than the one for men. The majority of people who perceive more than 50 K are men

Education vs Income

plt.figure(figsize=(25,10))
sns.countplot(x = 'Education',hue='Income', data=us_adult_income)

People who achieve a higher level of education tend to earn more than 50K

Occupation vs Income

plt.figure(figsize=(25,10))
sns.countplot(x = 'Occupation',hue='Income', data=us_adult_income)

Exec-managerial and Prof-specialty stand out as having very high percentages of individuals making over $50,000. In addition, Priv-House-Serv make only less than 50 K

Marital Status vs Income

plt.figure(figsize=(30,10))
sns.countplot(x = 'MaritalStatus',hue='Income', data=us_adult_income)

Work-class vs Income

plt.figure(figsize=(20,10))
sns.countplot(x = 'Workclass',hue='Income', data=us_adult_income)

Federal-gov and self-emp-inc are the the two classes which almost equal in people who make more than 50K and less.

Age vs Income

plt.figure(figsize=(20,10))
plt.title('Age vs Income')
plt.xlabel('Age', fontsize=15)
sns.countplot(x = 'Age',hue='Income', data=us_adult_income)

As we might notice seems impossible for people who are less than 20 years old or more than 80 years old to earn more than 50K . At the age of 37 is where we have the most part of people earning 50K. The same occurs at age 46

Race vs Income

plt.figure(figsize=(20,4))
plt.title('Race vs Income')
plt.xlabel('Race', fontsize=15)
sns.countplot(x = 'Race',hue='Income', data=us_adult_income)

Income Distribution vs Age

density = sns.FacetGrid(us_adult_income, col = "Income",height=5)
density.map(sns.kdeplot, "Age")

Correlation Heat map

fig, (ax) = plt.subplots(1, 1, figsize=(10,6))

hm = sns.heatmap(us_adult_income.corr(),
                 ax=ax,           # Axes in which to draw the plot, otherwise use the currently-active Axes.
                 cmap="coolwarm", # Color Map.
                 #square=True,    # If True, set the Axes aspect to “equal” so each cell will be square-shaped.
                 annot=True,
                 fmt='.2f',       # String formatting code to use when adding annotations.
                 #annot_kws={"size": 14},
                 linewidths=.05)

fig.subplots_adjust(top=0.93)
fig.suptitle('US DATA INCOME Correlation Heatmap',
              fontsize=14,
              fontweight='bold')

There is not an strong correlation that we can point out but we could mention that Fnlwgt have an inverse correlation to the quantitive variables in the showed in the heat map, except Capital gain.

Since Education Num and Education are the same we are only going to keep the quantitive variable.

del us_adult_income['Education']

Nan Values in our dataset

us_adult_income.isnull().sum()

Variable	Nan values
Age	0
Workclass	1836
Fnlwgt	0
EducationNum	0
MaritalStatus	0
Occupation	1843
Relationship	0
Race	0
Gender	0
CapitalGain	0
CapitalLoss	0
HoursPerWeek	0
Country	583
Income	0

us_adult_income.Income.value_counts()

Target	Quantity
<=50K	24720
>50K	7841

Most of our dataset belongs to the class <=50K

Encoding

We encode the categorical variables in order to train our model. We replace the missing values with the statistical mean of each column.

one_hot_encoding = pd.get_dummies(us_adult_income)
one_hot_encoding.fillna(one_hot_encoding .mean(),inplace=True)
one_hot_encoding["Income"] = one_hot_encoding['Income_ >50K']
del one_hot_encoding['Income_ <=50K']
del one_hot_encoding['Income_ >50K']
sns.set(font_scale=2.8)
sns.clustermap(one_hot_encoding.corr(), metric="correlation",figsize=(50, 50))

Split the dataset into train and testing

X = one_hot_encoding.iloc[:,0:-1]
y = one_hot_encoding.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 42)

Random Forest Classifier

rf = RandomForestClassifier(max_depth=25,n_estimators=200, min_samples_leaf=1,min_samples_split=50,criterion='entropy', oob_score=True,random_state=42)
rf.fit(X_train, y_train)

predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')

Estimation

Out-of-bag score estimate: 0.862
Mean accuracy score: 0.866

Confusion Matrix

cm = confusion_matrix(y_test, predicted)
fig, (ax) = plt.subplots(1, 1, figsize=(8,7))
sns.heatmap(cm, annot=True, ax = ax,fmt='g')
# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix')
sns.set(font_scale=1.5)
ax.xaxis.set_ticklabels(['Income > 50 K', 'Income < 50 K'])
ax.yaxis.set_ticklabels(['Income > 50 K', 'Income < 50 K'])

Conclusion

Random forest performs well, maybe if we have more individuals that make more than 50 K, it might be easier for the model to predict those cases. Maybe we could have chosen to remove the column Fnlwgt, which seems not to aport a great deal of information, the same with Capital Gains\Loss and Relationship.