Tuesday, February 10, 2009

Large file sizes while persisting R rpart models to disk

I am using rpart to build a model for later predictions. To save the prediction across restarts and share the data across nodes I have been using "save" to persist the result of rpart to a file and "load" it later. But the saved size was becoming unusually large (even with binary, compressed mode). The size was also proportional to the amount of data that was used to create the model.

After tinkering a bit, I figured out that most of the size was because of the rpart$functions attribute. If I set it to NULL, the size seems to drop dramatically. It can be seen with the following lines of R code, where there is a difference, though it is small. The difference is more pronounced with large datasets.

fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
save(fit, file="fit1.sav")
fit$functions <- NULL
save(fit, file="fit2.sav")

What is the reason behind it? The functions themselves seem small. So where it all the bulk coming from?

There was some insight from this message in r-devel group which talks about using function level environments for caching purpose. A query to r-help threw some more light, thanks to all who kindly responded to the query. I do not have much understanding of R internals yet, but to summarize, what I gathered is, when a function is created, R 'remembers' all the variables present in the environment when the function was created along with the function. It appeared to me that the correct fix would require an understanding of the rpart source. For now, I'm going ahead using my approach of setting the functions to NULL.

Another alternative could be not to use R for predictions. Use R to create the rpart tree, export the rpart as a PMML using the pmml package in R. And then use another tool (e.g. Weka), outside R, to re-instantiate the model for predictions. See a discussion here.

No comments: