POSTS

Using R as a Production Machine Learning Language (Part I)

October 17, 2017

There’s often confusion amoung the data science and machine learning crowd about the quality of R as a production level language for deploying predictive models. Understandably, many programmers come from languages that don’t behave quite like R does and focus on libraries that R has few of. However, R helps us solve problems of data analysis much more easily that many other languages and some statistical methods have only been implemented in R. This post has two main goals:

helping full-time developers better understand and appreciate what R is capable of;
helping data scientists use their predictive models in reliable business settings.

Specifically, I want to demonstrate how we can take a trained predictive model (i.e. statistical learning model or machine learning model) and deploy it to a server as a RESTful API that accepts JSON requests and returns JSON responses. To do this, we will use the package plumber. By the end of the post we’ll have R code that allows us access a website through the browser or programmatically in order to get predictions.

what we’ll have at the end of this blog
post

Note: The plumber github repo has a giant warning about breaking changes coming in version 0.4. This is a GOOD thing! When developing an API like this, you probably want to run it locally in development and run it locally behind a serious web server like nginx in production. This post was created using plumber version 0.4.2 so if you follow along you should not have any problems with the version updates.

Why is this necessary?

There are many ways predictive models can help an organization such as forecasting finances to better control budget and spending, recommending products to customers for increased conversion and/or upselling, or predicting internet traffic to better prepare infrastructure and reduce costs. There are also many ways we can architect how we use the model, each with trade-offs. Often with web applications, getting the model as close (or preferably inside) of the database will yield best performance, but unless you have a database that implements R code as stored procedures, that often requires precomputing responses. A RESTful API on the other hand allows us to remain more flexible by only submitting predictions upon request, thus enabling the ability to turn predictions on/off or change models more quickly. Let’s take a more real-world scenario as an example.

Our objective: recommend a single item for sale to users on their next login.

If we go with the first deployment method, we would train our model on the historical data available at the time and upload our predictions to a database table (let’s call it user_item_recommendations). This method requires us to work in batch mode. We train our data, submit our predictions, wait for results to come in, then start all over. There’s nothing wrong with this approach and it might be right for your organization, especially if you are very concerned about page load times. Since the prediction is already calculated and the data sits very close together (in the same database), the web developers just need to join the users table to our user_item_recommendations table by the user_id (uuid, uid, or whatever you use) to quickly identify what item to display when loading a custom page for each user upon login. A potential downside is that this process may require modification of the user login page code, in which case you might have to wait for developer time to update that part of the codebase. Another potential downside is that your recommendations might not be very useful unless they change frequently, which will almost certainly be detrimental to the database performance and so will have to run at weird hours in the night and maybe limited to only update a percentage of your users at once.

The API deployment is able to bypass the downsides of the batch update, database deployment, at the cost of potentially longer page load times. The good news is that modern javascript has the ability to load content asynchronously so predictions can come in and load content after most of the page has already rendered. In this case, we might rethink the objective and end up not really caring about loading the login web page with the recommended item, but instead just needing to provide a prediction when triggered by the login event. That is, when a user actually completes the login to our website, we could respond in a number of different manners such as sending a SMS/text message, sending an email, displaying a pop-up or maybe just banner image in the sidebar. If you take a peek under the hood in many modern websites, you’ll see this pattern of call and response often. When loading a page, the browser sends requests based on the url and cookies you submit, returning custom content.

The real advantage of using the API method for deploying predictive models is it allows us to more closely align with the Unix Philosophy. By that, I mean we can treat each unique predictive model as a separate codebase that serves one and only one purpose (make a prediction based on provided data). This philosophy helps better isolate our code which makes it easier to understand (which can be hard enough as is with modeling!), easier to maintain, easier to implement versioning, and easier to montior our results.

Getting Started with Plumber on a trained model

This post assumes you already went through the hard part of data collection, data cleaning (aka munging, transforming, wrangling, etc.), model training and model validating. If so, save it as a .Rds – a binary file native to R that allows us to efficiently transfer models saved in R over a network.

If you haven’t done this yet, I’m going to quickly breeze though building one on the Titanic example from Kaggle to predict Survival. Note that a smaller version is actually built into R, which you can load with data(Titanic).

Prepping Data

Here we start by reading in the data in csv file format to a variable called titanic_data.

library(readr)
titanic_data <- read_csv("plumber_titanic/train.csv")

In order to ensure the model we build ends up in useful and consistent, we’ll need the data that gets passed into it to be clean and consistent. By that I mean things like factors having the same levels and numeric values that make sense for what they represent. To accomplish that, we create a simple function to coerce the raw data into the data we expect. We’ll then be able to apply this function to our training dataset, as well as any future incoming data we want to predict on.

transform_titantic_data <- function(input_titantic_data) {
  ouput_titantic_data <- data.frame(
    survived = factor(input_titantic_data$Survived, levels = c(0, 1)),
    pclass = factor(input_titantic_data$Pclass, levels = c(1, 2, 3)),
    female = tolower(input_titantic_data$Sex) == "female",
    age = factor(dplyr::if_else(input_titantic_data$Age < 18,
                 "child", "adult", "unknown"), 
                 levels = c("child", "adult", "unknown"))
  )
}

clean_titanic <- transform_titantic_data(titanic_data)

Train Model

When training a machine learning model, it’s important to use validation techniques otherwise you can’t change it without ruining the assumption that it will continue to work on new (unseen) data. Here we implement a very simple technique known as train/test split and train our model on just the train_df split. For the model, we use a logistic regression to estimate the probability of survival based on the passenger’s age, ticket class, and perceived gender. In business settings, logistic regression is really practical because it is fairly easy to train (no GPUs needed), very fast at prediction (fast response time for our time), and the results are much easier to explain and understand. Specifically the logistic regression coefficients can help us determine the impact of each independent variable on the odds of the outcome (survival).

set.seed(42)
training_rows <- sample(1:nrow(clean_titanic),
                        size = floor(0.7*nrow(clean_titanic)))
train_df <- clean_titanic[training_rows, ]
test_df <- clean_titanic[-training_rows, ]

titanic_glm <- glm(survived ~ pclass + female + age, 
                   data = clean_titanic, family = binomial(link = "logit"))

Evaluating Model

There are many ways to evaluate model performance, but since that’s not the point here and there are plenty of areas to confuse and distract (i.e. I could rant for a while…), let’s keep it simple here. For our model, we’re using the type response which returns probability as our prediction. From that, we predict TRUE for survival is our predicted probability is over 50%. We then compare the predicted survival to the actual survival in our test_df split to get a confusion matrix and our predicted accuracy.

test_predictions <- predict(
    titanic_glm, 
    newdata = test_df, 
    type = "response"
) >= 0.5
test_actuals <- test_df$survived == 1
accuracy <- table(test_predictions, test_actuals)
print(accuracy)
print(paste0(
    "Accuracy: ", 
    round(100 * sum(diag(accuracy))/sum(accuracy), 2), 
    "%"
))

                test_actuals
test_predictions FALSE TRUE
           FALSE   147   29
           TRUE     29   63

[1] "Accuracy: 78.36%"

As you can see, our model does a reasonable job, but could definitely be dialed in quite a bit. Let’s just call this version 0.0.1 for now, then as we deploy new and improved versions of our model we can better justify the investment into our fancy data science.

Save model to RDS

In order to reuse this model on another computer or just in another R session, we want to save it. I highly recommend using the built-in, native to R, efficient binary file format or .Rds. In addition to being efficient, since it only saves an R object instead of the entire R session environment like .Rdata, when you load a .Rds file later you can also save it to a variable, which means we can use the predict function with it just like normal (like we just did in previous code).

saveRDS(titanic_glm, file = "plumber_titanic/model.Rds", compress = TRUE)

Building Plumber API

Finally. The good stuff.

I know it took way too long to get here, but I want to make sure you understood why this was important and I wanted to make sure people new to R could follow all the way through.

Plumber API files

We start off by creating two .R files:

A file that will define our API endpoints, i.e. the functions we’ll be exposing.
A file that will read our API definitions and start up a simple web server with those endpoints available.

According to the plumber package documentation, the first file is conventionally named plumber.R. Feel free to do that if that helps you remember where you save the plumber code or helps you easily find the API definitions once you build 30 of these. Otherwise, feel free to name it whatever you want. I like to name my API code relevant to the model it exposes, in this case titanic-api.R, since I might include more than one file in a single directory. That may be bad practice, though so you do you.

The second file contains your code for routing your API endpoints, commonly referred to as a router. There are many possible options available to customive your API here including adding custom serializers, adding error handlers, encrypting cookies, and even the ability to serve static sites. For our use case, we have the most bare bones scenario (keep it simple) so I simply named it server.R. In that code, we do just 4 things; load in the plumber library, source our API definitions file, and serve it over port 8000.

library(plumber)

serve_model <- plumb("plumber_titantic/titanic-api.R")
serve_model$run(port = 8000)

The other file has the most interesting bits. At the top of the code for titanic-api.R, we load the plumber library and then read in our trained model. Then you’ll notice some constants that describe the current version of the model (and thus the API) and the variables/features/inputs required. The reason for these constants are two fold:

We want to make iterative progress on our model and thus we want to know which version is serving predictions once we update it. Therefore it’s important to update the MODEL_VERSION constant eveytime you update your model or this file. It’s totally fine to have millions of versions, but it’s totally not cool to have multiple versions out there with the same version number.
We want to help developers use our prediction model as a service. This means having a way to help them understand what is necessary and what the parameters passed to the API mean. We’ll end up creating a very simple landing page for our API based on these variables.

library(plumber)
model <- readRDS("plumber_titanic/model.Rds")

MODEL_VERSION <- "0.0.1"
VARIABLES <- list(
  pclass = "Pclass = 1, 2, 3 (Ticket Class: 1st, 2nd, 3rd)",
  sex = "Sex = male or female",
  age = "Age = # in years",
  gap = "",
  survival = "Successful submission will results in a calculated Survival Probability from 0 to 1 (Unlikely to More Likely)"
)

First use of Plumber Annotation - Health check endpoint

The first function we’ll write will also be the first endpoint we’ll create. We’re going to start small by simply return a response for any request containing a status code and our model version. You’ll see that all we did was write a simple R function, just like we always do. However, by simply adding a special type of comment right above our function, plumber will automagically be able to turn our code into an HTTP API. The details of this are maybe a little complex, but not complicated. I encourage you to read up on decorator functions if you are curious, but I’m going to pass on explaining it for now other than saying they work by telling the program that your function should get passed into another function as described in the annotation (special comment).

In this function, the decorator we applied is @get /healthcheck which tells plumber to expose this function to HTTP GET requests at the url http://127.0.0.1:8000/healthcheck.

#* @get /healthcheck
health_check <- function() {
  result <- data.frame(
    "input" = "",
    "status" = 200,
    "model_version" = MODEL_VERSION
  )
  
  return(result)
}

For the function above, think of it as a way to test whether the API is actually working. In fact, if you stop right here and save your two files, then source the server.R file it should start a web server at your localhost over port 8000 that you could go check in the browser and see if it’s running at http://127.0.0.1:8000/healthcheck. Over time, I think you’ll find having a simple endpoint like this can be really helpful for debugging whether your model is having problems or if the API server is having problems.

health check page

Landing Page

This is probably unnecessary as all software developers are super smart and obviously know exactly what data to pass a simple REST API, but just in case let’s make a simple home page for users accessing our API to help get them acquainted. You’ll notice here in addition to the @get decorator, we also introduce a second decorator, @html that tells plumber to render responses to this endpoint as html content instead of json.

#* @get /
#* @html
home <- function() {
  title <- "Titanic Survival API"
  body_intro <-  "Welcome to the Titanic Survival API!"
  body_model <- paste("We are currently serving model version:", MODEL_VERSION)
  body_msg <- paste(
    "To received a prediction on survival probability,", 
    "submit the following variables to the <b>/survival</b> endpoint:",
    sep = "\n")
  body_reqs <- paste(VARIABLES, collapse = "<br>")
  
  result <- paste(
    "<html>",
    "<h1>", title, "</h1>", "<br>",
    "<body>", 
    "<p>", body_intro, "</p>",
    "<p>", body_model, "</p>",
    "<p>", body_msg, "</p>",
    "<p>", body_reqs, "</p>",
    "</body>",
    "</html>",
    collapse = "\n"
  )
  
  return(result)
}

The html we generate with this function is pretty basic, but if you are unfamiliar we are displaying:

a page header with the name of the API
a short welcome message
a sentence stating the current version of the model being served
a short explanation of how to use the API and what the response means

Sourcing the server.R file now should allow you to check it out and see:

api landing page

Prediction Endpoint

Before jumping straight into the function to make predictions, let’s start by implementing the helper function we used to clead the data before training the model, as well as a helper function to validate that inputs passed to our API are useful and appropriate.

transform_titantic_data <- function(input_titantic_data) {
  ouput_titantic_data <- data.frame(
    pclass = factor(input_titantic_data$Pclass, levels = c(1, 2, 3)),
    female = tolower(input_titantic_data$Sex) == "female",
    age = factor(dplyr::if_else(input_titantic_data$Age < 18,
                 "child", "adult", "unknown"), 
                 levels = c("child", "adult", "unknown"))
  )
}

validate_feature_inputs <- function(age, pclass, sex) {
  age_valid <- (age >= 0 & age < 200 | is.na(age))
  pclass_valid <- (pclass %in% c(1, 2, 3))
  sex_valid <- (sex %in% c("male", "female"))
  tests <- c("Age must be between 0 and 200 or NA", 
             "Pclass must be 1, 2, or 3", 
             "Sex must be either male or female")
  test_results <- c(age_valid, pclass_valid, sex_valid)
  if(!all(test_results)) {
    failed <- which(!test_results)
    return(tests[failed])
  } else {
    return("OK")
  }
}

You may notice that the transform_titantic_data function is almost identical, but not quite. Since we are now just making predictions, we don’t need to format the output variable we trained against since it won’t be in the request. For the second function, you should notice it is a series of logical tests against each feature separately, that we then parse to identify if there are any exceptions we might need to notify in the response. For example, if the Age submitted is excessive or negative, we probably want to tell the developer to check their code instead of just returning a probability.

Now onto the main event.

You can see that we implement the API to the same endpoint of /survival and we allow both GET and POST requests to that same endpoint. Technically this shouldn’t make a lot of sense as HTTP requests should roughly map to the database operations in the following fashion according to principles of good API design:

GET (SELECT): Retrieve a specific Resource from the Server, or a listing of Resources.
POST (CREATE): Create a new Resource on the Server.
PUT (UPDATE): Update a Resource on the Server, providing the entire Resource.
PATCH (UPDATE): Update a Resource on the Server, providing only changed attributes.
DELETE (DELETE): Remove a Resource from the Server.

However, while we really only want to allow users to GET a single prediction back, but not all servers are ok with this strategy. In theory, a GET request should be to retrieve a specific thing (which this is), but allowing different payloads in the GET request can break any caching the server might provide. So even though we want to accept a GET request with a payload or pass in the payload as URL query string, we’re going to do both to make everyone unhappy.

#* @post /survival
#* @get /survival
predict_survival <- function(Age=NA, Pclass=NULL, Sex=NULL) {
  age = as.integer(Age)
  pclass = as.integer(Pclass)
  sex = tolower(Sex)
  valid_input <- validate_feature_inputs(age, pclass, sex)
  if (valid_input[1] == "OK") {
    payload <- data.frame(Age=age, Pclass=pclass, Sex=sex)
    clean_data <- transform_titantic_data(payload)
    prediction <- predict(model, clean_data, type = "response")
    result <- list(
      input = list(payload),
      reposnse = list("survival_probability" = prediction,
                      "survival_prediction" = (prediction >= 0.5)
                      ),
      status = 200,
      model_version = MODEL_VERSION)
  } else {
    result <- list(
      input = list(Age = Age, Pclass = Pclass, Sex = Sex),
      response = list(input_error = valid_input),
      status = 400,
      model_version = MODEL_VERSION)
  }

  return(result)
}

Compared to the rest of the code above, this function is definitely doing a lot more. First off, we specifically cast our input variables to the data types we want. JSON, which is how the request comes in, is not strongly typed and so we may receive an Age of “22” which would work fine, but fail if we didn’t convert to an integer in R. Next, we use the helper function to test and validate the inputs passed in the request and save the results to a variable named valid_input. Then we proceed down two separate paths.

If the valid_input was OK, then we join the data passed in the request into a data.frame, clean that data.frame, and then run the predict function on the request data using our trained model. Instead of just returning the prediction directly, we want to return it in a more standard JSON format with some additional information helpful to us later. Specifically, we return JSON containing:

the input passed to the predict function
the prediction as both a probability (0.0 to 1.0) and a recommendation (TRUE/FALSE)
a status code of 200 since everything worked OK
the version of the model the prediction came from

predict success

Why do we need all that? Well, in addition to making predictions, as data scientists we want to collect data. We want to log how often the model is being used. We want to later compare how strongly our recommendation correlates with other actions. We want to be able to debug our prediction API if things start to go awry.

If the valid_input was not OK, we don’t want the API to just break down. Instead we want it to return useful information, like why it didn’t work. Therefore when we can’t validate the request data, we return:

the exact input passed to the API (not necessarily the same as what would have been passed to the model if had been ok)
an array of the errors with the input to give the developer a better understanding of why the request failed
a status code of 400 which indicates a Bad Request
the version of the model that made the response

predict error

In addition to being able to use the API from a web browser, you can also use other tools, like curl through the command line.

curl api

Putting it all together

Once you’ve got your code ready, let’s go through this checklist to make sure everything is in order:

We have our code to build a model isolated from the code to deploy the model.
We have our model trained and saved to a file.
We have our code to deploy the model as a RESTful API in two files, one defining our API in decorated functions and one to serve that first file.
We have versioning for our deployment and make sure to update it with each change.
We have error handing in place to ensure bad requests get returned with useful information.

Concluding Remarks

Full source code to build this plumber based API is available at https://github.com/raybuhr/plumber-titanic.

Read through the Plumber Documentation. By default, all plumber endpoints return responses as application/json using the jsonlite::toJSON function. You can read more about alternative serializers (i.e. other types of response content) here in the plumber docs. In my experience, you may be fine with the default serializers or you may need to get pretty deep in the weeds depending on how you need to receive and return results.

Deploy your models somewhere that is easy to find and easy to download for your servers, such as AWS S3 or Google Cloud Storage. GitHub is usually not a very good place to store models, unless they are very small and they benefit from reviewing the text differences (which .Rds does not).

In Part II of this post, we’ll explore:

uploading our code to a version control system like GitHub
uploading our .Rds file to cloud storage
spinning up a virtual machine in the cloud to host our API
using Docker to “containerize” our API allowing us to run multiple instances at once
using Nginx to serve up those docker instances of this API and act as a load balancer