Title: | Bayesian Additive Regression Trees |
---|---|
Description: | This is an implementation of BART:Bayesian Additive Regression Trees, by Chipman, George, McCulloch (2010). |
Authors: | Hugh Chipman <[email protected]>, Robert McCulloch <[email protected]> |
Maintainer: | Robert McCulloch <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.3-1.5 |
Built: | 2025-03-02 04:18:32 UTC |
Source: | https://github.com/cran/BayesTree |
BART is a Bayesian “sum-of-trees” model.
For numeric response , we have
,
where
.
For a binary response ,
, where
denotes the standard normal cdf (probit link).
In both cases, is the sum of many tree models.
The goal is to have very flexible inference for the uknown
function
.
In the spirit of “ensemble models”, each tree is constrained by a prior to be a weak learner so that it contributes a small amount to the overall fit.
bart( x.train, y.train, x.test=matrix(0.0,0,0), sigest=NA, sigdf=3, sigquant=.90, k=2.0, power=2.0, base=.95, binaryOffset=0, ntree=200, ndpost=1000, nskip=100, printevery=100, keepevery=1, keeptrainfits=TRUE, usequants=FALSE, numcut=100, printcutoffs=0, verbose=TRUE) ## S3 method for class 'bart' plot( x, plquants=c(.05,.95), cols =c('blue','black'), ...)
bart( x.train, y.train, x.test=matrix(0.0,0,0), sigest=NA, sigdf=3, sigquant=.90, k=2.0, power=2.0, base=.95, binaryOffset=0, ntree=200, ndpost=1000, nskip=100, printevery=100, keepevery=1, keeptrainfits=TRUE, usequants=FALSE, numcut=100, printcutoffs=0, verbose=TRUE) ## S3 method for class 'bart' plot( x, plquants=c(.05,.95), cols =c('blue','black'), ...)
x.train |
Explanatory variables for training (in sample) data. |
y.train |
Dependent variable for training (in sample) data. |
x.test |
Explanatory variables for test (out of sample) data. |
sigest |
The prior for the error variance ( |
sigdf |
Degrees of freedom for error variance prior. Not used if y is binary. |
sigquant |
The quantile of the prior that the rough estimate (see sigest) is placed at.
The closer the quantile is to 1,
the more aggresive the fit will be as you are putting more prior weight
on error standard deviations ( |
k |
For numeric y,
k is the number of prior standard deviations |
power |
Power parameter for tree prior. |
base |
Base parameter for tree prior. |
binaryOffset |
Used for binary |
ntree |
The number of trees in the sum. |
ndpost |
The number of posterior draws after burn in, ndpost/keepevery will actually be returned. |
nskip |
Number of MCMC iterations to be treated as burn in. |
printevery |
As the MCMC runs, a message is printed every printevery draws. |
keepevery |
Every keepevery draw is kept to be returned to the user. |
keeptrainfits |
If true the draws of |
usequants |
Decision rules in the tree are of the form
|
numcut |
The number of possible values of c (see usequants).
If a single number if given, this is used for all variables.
Otherwise a vector with length equal to ncol(x.train) is required,
where the |
printcutoffs |
The number of cutoff rules c to printed to screen before the MCMC is run. Give a single integer, the same value will be used for all variables. If 0, nothing is printed. |
verbose |
Logical, if FALSE supress printing. |
x |
Value returned by |
plquants |
In the plots, beliefs about |
cols |
Vector of two colors. First color is used to plot the median of |
... |
Additional arguments passed on to plot. |
BART is an Bayesian MCMC method.
At each MCMC interation, we produce a draw from the joint posterior
in the numeric
case
and just
in the binary
case.
Thus, unlike a lot of other modelling methods in R, we do not produce a single model object
from which fits and summaries may be extracted. The output consists of values
(and
in the numeric case) where * denotes a particular draw.
The
is either a row from the training data (x.train) or the test data (x.test).
The plot
method sets mfrow to c(1,2) and makes two plots.
The first plot is the sequence of kept draws of
including the burn-in draws. Initially these draws will decline as BART finds fit
and then level off when the MCMC has burnt in.
The second plot has on the horizontal axis and posterior intervals for
the corresponding
on the vertical axis.
bart
returns a list assigned class ‘bart’.
In the numeric case, the list has components:
yhat.train |
A matrix with (ndpost/keepevery) rows and nrow(x.train) columns.
Each row corresponds to a draw |
yhat.test |
Same as yhat.train but now the x's are the rows of the test data. |
yhat.train.mean |
train data fits = mean of yhat.train columns. |
yhat.test.mean |
test data fits = mean of yhat.test columns. |
sigma |
post burn in draws of sigma, length = ndpost/keepevery. |
first.sigma |
burn-in draws of sigma. |
varcount |
a matrix with (ndpost/keepevery) rows and nrow(x.train) columns. Each row is for a draw. For each variable (corresponding to the columns), the total count of the number of times that variable is used in a tree decision rule (over all trees) is given. |
sigest |
The rough error standard deviation ( |
y |
The input dependent vector of values for the dependent variable. |
In the binary case, the returned list has the components
yhat.train, yhat.test, and varcount as above. In addition the list
has a binaryOffset component giving the value used.
Note that in the binary , case yhat.train and yhat.test are
+ binaryOffset. If you want draws of the probability
you need to apply the normal cdf (
pnorm
)
to these values.
There was a bug in BayesTree_0.1-0 (now fixed of course).
If the number of test observations was less than the number of trees
(200 is the default), the yhat.test and yhat.test.mean components were suspect.
Hugh Chipman: [email protected]
Robert McCulloch: [email protected].
Chipman, H., George, E., and McCulloch R. (2010) Bayesian Additive Regression Trees. The Annals of Applied Statistics, 4,1, 266-298.
Chipman, H., George, E., and McCulloch R. (2006) Bayesian Ensemble Learning. Advances in Neural Information Processing Systems 19, Scholkopf, Platt and Hoffman, Eds., MIT Press, Cambridge, MA, 265-272.
Friedman, J.H. (1991) Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–67.
##simulate data (example from Friedman MARS paper) f = function(x){ 10*sin(pi*x[,1]*x[,2]) + 20*(x[,3]-.5)^2+10*x[,4]+5*x[,5] } sigma = 1.0 #y = f(x) + sigma*z , z~N(0,1) n = 100 #number of observations set.seed(99) x=matrix(runif(n*10),n,10) #10 variables, only first 5 matter Ey = f(x) y=Ey+sigma*rnorm(n) lmFit = lm(y~.,data.frame(x,y)) #compare lm fit to BART later ##run BART set.seed(99) bartFit = bart(x,y,ndpost=200) #default is ndpost=1000, this is to run example fast. plot(bartFit) # plot bart fit ##compare BART fit to linear matter and truth = Ey fitmat = cbind(y,Ey,lmFit$fitted,bartFit$yhat.train.mean) colnames(fitmat) = c('y','Ey','lm','bart') print(cor(fitmat))
##simulate data (example from Friedman MARS paper) f = function(x){ 10*sin(pi*x[,1]*x[,2]) + 20*(x[,3]-.5)^2+10*x[,4]+5*x[,5] } sigma = 1.0 #y = f(x) + sigma*z , z~N(0,1) n = 100 #number of observations set.seed(99) x=matrix(runif(n*10),n,10) #10 variables, only first 5 matter Ey = f(x) y=Ey+sigma*rnorm(n) lmFit = lm(y~.,data.frame(x,y)) #compare lm fit to BART later ##run BART set.seed(99) bartFit = bart(x,y,ndpost=200) #default is ndpost=1000, this is to run example fast. plot(bartFit) # plot bart fit ##compare BART fit to linear matter and truth = Ey fitmat = cbind(y,Ey,lmFit$fitted,bartFit$yhat.train.mean) colnames(fitmat) = c('y','Ey','lm','bart') print(cor(fitmat))
Converts factors to dummies.
Note that with more than one level, BART needs a dummy for each level of a factor
(unlike in linear regression where one of the dummies is dropped).
makeind( x, all=TRUE)
makeind( x, all=TRUE)
x |
Data frame of explanatory variables. |
all |
If all=TRUE, a factor with p levels will be replaced by all p dummies. If all=FALSE, the pth dummy is dropped. |
Uses function class.ind from the nnet library. Note that if you have train and test data frames, it may be best to rbind the two together, apply makeind to the result, and then pull them back apart.
A matrix.
Numerical variables come first, and then the appended dummies.
Hugh Chipman: [email protected]
Robert McCulloch: [email protected].
x1 = 1:10 x2 = as.factor(c(rep(1,5),rep(2,5))) x3 = as.factor(c(1,1,1,2,2,2,4,4,4,4)) levels(x3) = c('rob','hugh','ed') x = data.frame(x1,x2,x3) junk = makeind(x)
x1 = 1:10 x2 = as.factor(c(rep(1,5),rep(2,5))) x3 = as.factor(c(1,1,1,2,2,2,4,4,4,4)) levels(x3) = c('rob','hugh','ed') x = data.frame(x1,x2,x3) junk = makeind(x)
Run bart
at test observations constructed so that
a plot can be created
displaying the effect of
a single variable (pdbart
) or pair of variables (pd2bart
).
Note the y is a binary with with
the standard
normal cdf, then the plots are all on the
scale.
pdbart( x.train, y.train, xind=1:ncol(x.train), levs=NULL, levquants=c(.05,(1:9)/10,.95), pl=TRUE, plquants=c(.05,.95), ...) ## S3 method for class 'pdbart' plot( x, xind = 1:length(x$fd), plquants =c(.05,.95),cols=c('black','blue'), ...) pd2bart( x.train, y.train, xind=1:2, levs=NULL, levquants=c(.05,(1:9)/10,.95), pl=TRUE, plquants=c(.05,.95), ...) ## S3 method for class 'pd2bart' plot( x, plquants =c(.05,.95), contour.color='white', justmedian=TRUE, ...)
pdbart( x.train, y.train, xind=1:ncol(x.train), levs=NULL, levquants=c(.05,(1:9)/10,.95), pl=TRUE, plquants=c(.05,.95), ...) ## S3 method for class 'pdbart' plot( x, xind = 1:length(x$fd), plquants =c(.05,.95),cols=c('black','blue'), ...) pd2bart( x.train, y.train, xind=1:2, levs=NULL, levquants=c(.05,(1:9)/10,.95), pl=TRUE, plquants=c(.05,.95), ...) ## S3 method for class 'pd2bart' plot( x, plquants =c(.05,.95), contour.color='white', justmedian=TRUE, ...)
x.train |
Explanatory variables for training (in sample) data. |
y.train |
Dependent variable for training (in sample) data. |
xind |
Integer vector indicating which variables are to be plotted. |
levs |
Gives the values of a variable at which the plot is to be constructed. |
levquants |
If levs in NULL, the values of each variable used in the plot is
set to the quantiles (in x.train) indicated by levquants. |
pl |
For |
plquants |
In the plots, beliefs about |
... |
Additional arguments. |
x |
For plot.*, object returned from pdbart or pd2bart. |
cols |
Vector of two colors. |
contour.color |
Color for contours plotted on top of the image. |
justmedian |
Boolean, if true just one plot is created for
the median of |
We divide the predictor vector into a subgroup of interest,
and the complement
.
A prediction
can
then be written as
. To estimate the effect of
on the prediction, Friedman suggests the partial dependence
function
where is the
observation of
in the data. Note
that
will generally not be one of the observed data
points. Using BART it is straightforward to then estimate and even
obtain uncertainty bounds for
. A draw of
from the induced BART posterior on
is obtained by
simply computing
as a byproduct of each MCMC draw
. The median (or average)
of these MCMC draws
then yields an
estimate of
, and lower and upper quantiles can be used
to obtain intervals for
.
In pdbart
consists of a single variable in
and in
pd2bart
it is a pair of variables.
This is a computationally intensive procedure.
For example, in pdbart
, to compute the partial dependence plot
for 5 values, we need
to compute
for all possible
and there
would be
of these where
is the sample size.
All of that computation would be done for each kept BART draw.
For this reason running BART with keepevery larger than 1 (eg. 10)
makes the procedure much faster.
The plot methods produce the plots and don't return anything.
pdbart
and pd2bart
return lists with components
given below. The list returned by pdbart
is assigned class
‘pdbart’ and the list returned by pd2bart
is assigned
class ‘pd2bart’.
fd |
A matrix whose For For |
levs |
The list of levels used, each component corresponding to a variable. |
xlbs |
vector of character strings which are the plotting labels used for the variables. |
The remaining components returned in the list are the same as in the value of bart
.
They are simply passed on from the BART run used to create the partial dependence plot.
The function plot.bart
can be applied to the object returned by pdbart
or
pd2bart
to examine the BART run.
Hugh Chipman: [email protected].
Robert McCulloch: [email protected].
Chipman, H., George, E., and McCulloch R. (2010) Bayesian Additive Regression Trees. The Annals of Applied Statistics, 4,1, 266-298.
##simulate data f = function(x) { return(.5*x[,1] + 2*x[,2]*x[,3]) } sigma=.2 # y = f(x) + sigma*z n=100 #number of observations set.seed(27) x = matrix(2*runif(n*3)-1,ncol=3) ; colnames(x) = c('rob','hugh','ed') Ey = f(x) y = Ey + sigma*rnorm(n) lmFit = lm(y~.,data.frame(x,y)) #compare lm fit to BART later par(mfrow=c(1,3)) #first two for pdbart, third for pd2bart ##pdbart: one dimensional partial dependence plot set.seed(99) pdb1 = pdbart(x,y,xind=c(1,2), levs=list(seq(-1,1,.2),seq(-1,1,.2)),pl=FALSE, keepevery=10,ntree=100,nskip=100,ndpost=200) #should run longer! plot(pdb1,ylim=c(-.6,.6)) ##pd2bart: two dimensional partial dependence plot set.seed(99) pdb2 = pd2bart(x,y,xind=c(2,3), levquants=c(.05,.1,.25,.5,.75,.9,.95),pl=FALSE, ntree=100,keepevery=10,verbose=FALSE,nskip=100,ndpost=200) #should run longer! plot(pdb2) ##compare BART fit to linear model and truth = Ey fitmat = cbind(y,Ey,lmFit$fitted,pdb1$yhat.train.mean) colnames(fitmat) = c('y','Ey','lm','bart') print(cor(fitmat)) ## plot.bart(pdb1) displays the BART run used to get the plot.
##simulate data f = function(x) { return(.5*x[,1] + 2*x[,2]*x[,3]) } sigma=.2 # y = f(x) + sigma*z n=100 #number of observations set.seed(27) x = matrix(2*runif(n*3)-1,ncol=3) ; colnames(x) = c('rob','hugh','ed') Ey = f(x) y = Ey + sigma*rnorm(n) lmFit = lm(y~.,data.frame(x,y)) #compare lm fit to BART later par(mfrow=c(1,3)) #first two for pdbart, third for pd2bart ##pdbart: one dimensional partial dependence plot set.seed(99) pdb1 = pdbart(x,y,xind=c(1,2), levs=list(seq(-1,1,.2),seq(-1,1,.2)),pl=FALSE, keepevery=10,ntree=100,nskip=100,ndpost=200) #should run longer! plot(pdb1,ylim=c(-.6,.6)) ##pd2bart: two dimensional partial dependence plot set.seed(99) pdb2 = pd2bart(x,y,xind=c(2,3), levquants=c(.05,.1,.25,.5,.75,.9,.95),pl=FALSE, ntree=100,keepevery=10,verbose=FALSE,nskip=100,ndpost=200) #should run longer! plot(pdb2) ##compare BART fit to linear model and truth = Ey fitmat = cbind(y,Ey,lmFit$fitted,pdb1$yhat.train.mean) colnames(fitmat) = c('y','Ey','lm','bart') print(cor(fitmat)) ## plot.bart(pdb1) displays the BART run used to get the plot.