--- title: "Making pdp faster" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Making pdp faster} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4.5, fig.align = "center", message = FALSE, warning = FALSE ) ``` Computing partial dependence is embarrassingly simple but potentially expensive: for each point in the grid of predictor values, the model has to score a modified copy of the entire training set. This vignette covers the options **pdp** provides for speeding that up. ```{r boston-rf} library(pdp) library(randomForest) data(boston) set.seed(101) boston.rf <- randomForest(cmedv ~ ., data = boston, ntree = 250) ``` ## Batched predictions (`batch.size`) By default, `partial()` calls `predict()` once per grid point. Most `predict()` methods have non-trivial per-call overhead, so it is often much faster to *stack* several grid points together and score them in a single call. The `batch.size` argument specifies the (approximate) maximum number of rows to score per call: ```{r batch} system.time( # classic: one predict() call per grid point pd1 <- partial(boston.rf, pred.var = "lstat", train = boston) ) system.time( # batched: score up to one million rows per predict() call pd2 <- partial(boston.rf, pred.var = "lstat", train = boston, batch.size = 1e6) ) identical(pd1, pd2) ``` Batching only changes *how* the predictions are computed, not the result. To see this, plot the classic curve and overlay the batched results as points --- they fall exactly on the curve: ```{r batch-plot} plot(pd1) # classic (line) tinyplot::tinyplot_add(yhat ~ lstat, data = pd2, type = "p", col = "red2") ``` The trade-off is memory: with `batch.size = 1e6`, up to a million rows are held in memory at once. Pick a batch size that comfortably fits your machine. Note that `batch.size` requires the prediction function to return one prediction per row of `newdata`, so it cannot be combined with a `pred.fun` that aggregates its own predictions. ## Parallel processing `partial()` can compute the grid points (or batches) in parallel via the [foreach](https://cran.r-project.org/package=foreach) package. Register a parallel backend (e.g., with **doParallel**) and set `parallel = TRUE`: ```{r parallel, eval=FALSE} library(doParallel) cl <- makeCluster(4) # use 4 workers registerDoParallel(cl) pd <- partial(boston.rf, pred.var = c("lstat", "rm"), chull = TRUE, train = boston, parallel = TRUE) stopCluster(cl) ``` If the model's `predict()` method lives in a package, pass it along via `paropts` (e.g., `paropts = list(.packages = "ranger")`) so the workers can find it. ## The recursive method for gbm For **gbm** models, `partial()` defaults to `recursive = TRUE`, which uses Friedman's weighted tree traversal method (implemented in C++) instead of the brute force approach. It is *much* faster since it never has to touch the training data: ```{r gbm} library(gbm) set.seed(103) boston.gbm <- gbm(cmedv ~ ., data = boston, distribution = "gaussian", n.trees = 500, interaction.depth = 3, shrinkage = 0.1) system.time( pd.recursive <- partial(boston.gbm, pred.var = "lstat", n.trees = 500, train = boston) # recursive = TRUE is the default ) system.time( pd.brute <- partial(boston.gbm, pred.var = "lstat", n.trees = 500, recursive = FALSE, train = boston, batch.size = 1e6) ) ``` Overlaying the results shows that the two methods produce nearly the same curve: ```{r gbm-plot} plot(pd.recursive) # recursive (line) tinyplot::tinyplot_add(yhat ~ lstat, data = pd.brute, type = "p", col = "red2") ``` The small differences are expected: rather than averaging predictions over the training data, the recursive method weights each path through the tree by the proportion of training observations that followed it, which is not quite the same thing when the predictors are correlated (as they are here). Note that the recursive method cannot be used with `ice = TRUE`, `pred.fun`, or `inv.link`. ## Fast approximate PDPs (`approx = TRUE`) Finally, `approx = TRUE` computes a much cheaper *approximation* to partial dependence: rather than averaging over the training data, predictions are computed for a single "exemplar" observation (continuous predictors are fixed at their median; categorical predictors at their most frequent value). This is in the same spirit as the **plotmo** package and is most useful as a quick first look at very large data sets: ```{r approx} partial(boston.rf, pred.var = "lstat", approx = TRUE, plot = TRUE, train = boston) ``` The underlying `exemplar()` function is also exported; you can achieve the same thing (with more control) by passing an exemplar record to `train` and, optionally, your own `pred.grid`.