Jackknife variance estimates for random forest

In statistics, jackknife variance estimates for random forest are a way to estimate the variance in random forest models, in order to eliminate the bootstrap effects.

Jackknife variance estimates

The sampling variance of bagged learners is:

V ( x ) = V a r [ θ ^ ( x ) ] {\displaystyle V(x)=Var[{\hat {\theta }}^{\infty }(x)]}

Jackknife estimates can be considered to eliminate the bootstrap effects. The jackknife variance estimator is defined as:[1]

V ^ j = n 1 n i = 1 n ( θ ^ ( i ) θ ¯ ) 2 {\displaystyle {\hat {V}}_{j}={\frac {n-1}{n}}\sum _{i=1}^{n}({\hat {\theta }}_{(-i)}-{\overline {\theta }})^{2}}

In some classification problems, when random forest is used to fit models, jackknife estimated variance is defined as:

V ^ j = n 1 n i = 1 n ( t ¯ ( i ) ( x ) t ¯ ( x ) ) 2 {\displaystyle {\hat {V}}_{j}={\frac {n-1}{n}}\sum _{i=1}^{n}({\overline {t}}_{(-i)}^{\star }(x)-{\overline {t}}^{\star }(x))^{2}}

Here, t {\displaystyle t^{\star }} denotes a decision tree after training, t ( i ) {\displaystyle t_{(-i)}^{\star }} denotes the result based on samples without i t h {\displaystyle ith} observation.

Examples

E-mail spam problem is a common classification problem, in this problem, 57 features are used to classify spam e-mail and non-spam e-mail. Applying IJ-U variance formula to evaluate the accuracy of models with m=15,19 and 57. The results shows in paper( Confidence Intervals for Random Forests: The jackknife and the Infinitesimal Jackknife ) that m = 57 random forest appears to be quite unstable, while predictions made by m=5 random forest appear to be quite stable, this results is corresponding to the evaluation made by error percentage, in which the accuracy of model with m=5 is high and m=57 is low.

Here, accuracy is measured by error rate, which is defined as:

E r r o r R a t e = 1 N i = 1 N j = 1 M y i j , {\displaystyle ErrorRate={\frac {1}{N}}\sum _{i=1}^{N}\sum _{j=1}^{M}y_{ij},}

Here N is also the number of samples, M is the number of classes, y i j {\displaystyle y_{ij}} is the indicator function which equals 1 when i t h {\displaystyle ith} observation is in class j, equals 0 when in other classes. No probability is considered here. There is another method which is similar to error rate to measure accuracy:

l o g l o s s = 1 N i = 1 N j = 1 M y i j l o g ( p i j ) {\displaystyle logloss={\frac {1}{N}}\sum _{i=1}^{N}\sum _{j=1}^{M}y_{ij}log(p_{ij})}

Here N is the number of samples, M is the number of classes, y i j {\displaystyle y_{ij}} is the indicator function which equals 1 when i t h {\displaystyle ith} observation is in class j, equals 0 when in other classes. p i j {\displaystyle p_{ij}} is the predicted probability of i t h {\displaystyle ith} observation in class j {\displaystyle j} .This method is used in Kaggle[2] These two methods are very similar.

Modification for bias

When using Monte Carlo MSEs for estimating V I J {\displaystyle V_{IJ}^{\infty }} and V J {\displaystyle V_{J}^{\infty }} , a problem about the Monte Carlo bias should be considered, especially when n is large, the bias is getting large:

E [ V ^ I J B ] V ^ I J n b = 1 B ( t b t ¯ ) 2 B {\displaystyle E[{\hat {V}}_{IJ}^{B}]-{\hat {V}}_{IJ}^{\infty }\approx {\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}}

To eliminate this influence, bias-corrected modifications are suggested:

V ^ I J U B = V ^ I J B n b = 1 B ( t b t ¯ ) 2 B {\displaystyle {\hat {V}}_{IJ-U}^{B}={\hat {V}}_{IJ}^{B}-{\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}}
V ^ J U B = V ^ J B ( e 1 ) n b = 1 B ( t b t ¯ ) 2 B {\displaystyle {\hat {V}}_{J-U}^{B}={\hat {V}}_{J}^{B}-(e-1){\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}}

References

  1. ^ Wager, Stefan; Hastie, Trevor; Efron, Bradley (2014-05-14). "Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife". Journal of Machine Learning Research. 15 (1): 1625–1651. arXiv:1311.4555. Bibcode:2013arXiv1311.4555W. PMC 4286302. PMID 25580094.
  2. ^ "Otto Group Product Classification Challenge". Kaggle.