This presentation was given by Anna Connolly (VP, Customer Success, OctoML) and Bassem Yacoube (Solutions Architect, OctoML) at the Computer Vision Summit in San Jose, April 2023.

In this article, we're going to focus on deploying models efficiently, particularly in the cloud. 

But first, a little word on OctoML. We're a company of 100 machine learning systems experts who know a lot about infrastructure and how to deploy models, and we compile our experts from all around the world. 

Our vision is a world where AI is sustainable, accessible, and used thoughtfully to improve lives. And we do that in a specific way by employing developers to deploy any trained model into an intelligent application in production anywhere. 

Our founding team created the Apache TVM open source project and the XGBoost technology.

The challenges of deploying generative AI models

Anna Connolly

First, let’s just take a second to remind ourselves how ubiquitous AI is in the world today. This used to be a niche thing and now it's everywhere. It's in every application and in all the technology we're using. Models that were in academic papers six to eight months ago are now being talked about in the mainstream. 

I was talking about ChatGPT at a birthday party last weekend. When your mom starts asking you about it, and when your kids and teachers are trying to figure out how to deal with this, you know that it's really broken through. And I think the speed at which all of this is developing is really incredible. 

But what you all know is that this is a hard thing to do, to go from research into production, and to really make these apps usable for people. 

One of our co-founders, Jason Knight, asked GPT-4 why generative AI models like the one that powers this app are so difficult to deploy. And the app dutifully returned a few key reasons.

Number one is performance considerations. In this specific talk, we're going to speak about performance in terms of speed of inference, latency, and throughput, and not issues of model accuracy or quality. So that's the first consideration. 

The second one is compute constraints. Do you have enough of the hardware you need to deploy your feature or app and scale it up as demand increases? 

And then number three is production costs. Is it economically viable for you to put this feature into production given the revenue or the traffic you might get from it and the cost to serve it? 

Before we go into customer stories, we're going to look at these considerations in a little bit more detail.