Serverless GPUs, AI powered mobile apps, democratized AI,…
The compute landscape for model creation and deployment is undergoing rapid transformation catalyzed in part by the objective of creating compelling applications, with cost of compute being the primary constraint.
Applications powered by machine learning (ML) have become commonplace — trained models are used behind-the-scenes to solve a wide range of tasks. Use case, user experience, and cost of High Performance Compute (HPC) like GPUs, determine the choice of compute for training and deploying models. For instance applications relying on large models have no choice but to compute in the cloud. On the other hand, a real-time object detection application running on a smartphone requires a model that is small and fast — a model running in the cloud will be a suboptimal user experience.
This post is a snapshot of the compute landscape through the lens of a practitioner or a small business.
How does one leverage state-of-art (SOTA) solutions for a task, particularly in areas where SOTA records are often broken by the rapid progress?
Specifically, this post examines cost-effective compute options to
- discover and interact (no code) with state-of-art models for a task
- tinker (notebook) with a specific state-of-art model
- optionally train or fine-tune it on custom data
- and finally deploy it
Cost of High Performance Compute (HPC) as mentioned earlier constraints what a company, practitioner or even a researcher can aspire to do. For instance training a large language model or even a vision model, like the recent diffusion model that is used for generating art, is beyond the scope for a large majority of us given the cost can range in the hundreds of thousands if not millions of dollars. Inference on the other hand with distilled/quantized versions of large models, is possible in some cases, even on laptops with an accelerator card and sufficient memory.
While several established enterprise MLOps players claim to have optimized the entire model cycle for cost from training, deployment, monitoring, and improvement, they have not in reality. This has led to the emergence of new startups attempting to address specific portions of the model life cycle, where there is opportunity to drive costs down.
The high performance compute required for research is being addressed by startups like Stability.ai that is funding research and developer communities. Stability.ai has recently sparked an excitement flurry on twitter over the open release of a model through one of the entities it funded— DreamStudio. Stable diffusion generates photorealistic images from text input. This is a diffusion model trained on 256 GPUs for a month costing 600k. Stability.ai’s contribution to democratizing AI goes well beyond just funding the training of this generative model, which was a collaborative effort with researchers at Compvis group - Heidelberg university. For instance, the creation of LAION dataset of 5 billion image links harvested from Common Crawl was funded by Stability.ai. Entities that constitute the Stability.ai ecosystem are building open models for language, image audio, video, 3D and biology. For instance, in addition to DreamStudio , EleutherAI and CarperAI have already published code and model checkpoints for public use like GPT-Neox, while others like Harmonyai, OpenBioML are work in progress.
Examined below are details of the cost-effective compute options for a practitioner or small business.
Discover state-of-art models for a task
Papers with code makes it quite easy to find state-of-art papers for a task. It is the only site to date that consolidates and organizes SOTA models based on performance benchmarks. A quick glance at the aggregate stats below derived from papers with code illustrates the wealth of models that could be leveraged in building applications honoring code and model release licenses. For instance, 50% of the papers have a Github repo and nearly a third (84,160) of the repos (314,760) have official code published along with the paper. Also these papers span 2,634 unique tasks that are distributed across 16 categories.
However finding and keeping track of SOTA papers with released code and pre-trained models requires additional effort for the following reasons
- The official code for a state-of-art model may not be released — only the paper. There might be unofficial implementations of the code that could be leveraged but reproducing results may be challenging at times.
- Even if the official code is released, the model used to produce the results may not be released. We would have to train the model ourselves — a fact we would only find out from the Github repository for the model.
- Even if the model is released, the license may impose restrictions on model use for commercial purposes, although a significant proportion of Github releases have permissive licenses.
- There could be multiple state-of-art models for a task of interest. Every dataset for a task is a benchmark. We would have to pick the model that performs best on a dataset that is closer to the data distribution we are interested in for our use case. For instance semantic segmentation task has SOTA models for diverse datasets such as naturally occurring images, medical images, aerial view images etc. Model choice depends on how close a benchmark dataset is to our use case. Typically there is a high variance in the SOTA scores for a particular task category due to the diversity of datasets as well as the challenge posed by a dataset.
- Periodically checking for SOTA updates on papers with code may be cumbersome, particularly when tracking state-of-art for multiple tasks or incremental improvements to the same paper that eventually becomes state-of-art. Relying on social media updates for SOTA updates may be insufficient — we are likely to miss some updates given not all authors write about their work. Even if they did, some models tend to grab more attention than the rest, and even eclipse superior alternatives. An example of this is stable diffusion model that is currently perhaps the most talked about model in social media. It relies on a SOTA paper published back in Dec 2020 that demonstrated the learning of a codebook using VQGAN. VQGAN’s reconstruction of an image from its learned codebook representation was imperceptibly close to the original image and far superior to the reconstruction quality of dVAE used in DALL-E. Despite this, the VQGAN work and a diffusion model that leveraged the VQGAN was largely eclipsed in social media by DALL-E. Stability.ai’s funding of VQGAN model researchers to create Stable diffusion — a latent text-to-image model changed that. The VQGAN paper is in the limelight now since it is the first stage of Stable diffusion model.
Tasks with code complements papers with code in the discovery of SOTA models with code by addressing the issues above to some degree. Tasks with code publishes daily and consolidated weekly updates of changes to SOTA by papers with official code releases. These updates inform the availability of pretrained/fine-tuned models, datasets a model was trained on, app links to interact with the model, Google Colab notebooks to tinker with the model, containers for cloud deployment etc.
Sites like Hugging Face, attempt to circumvent the search for SOTA models by automatic training of a pre-selected list of around 15 models based on a task. But this is only offered for a handful of tasks (< 20) in comparison to two orders of magnitude more tasks in Papers with code (2,634). Also these pre-selected models include models uploaded by users, which in most cases, have no information about how they performed on a benchmark.
Links for discovering and tracking SOTA with code and released models:
“No code” Interaction with state-of-art models for a task
Interacting with SOTA models for a task is possible in some instances from apps showcasing popular models on sites like Hugging Face, replicate.ai as well as task specific sites with open playgrounds like OpenAI, Co:here, AI21, Stability.ai etc. The task specific sites list is likely to grow over time.
Links for interacting with state-of-art models for a task
Tinker with a specific state-of-art model
The choices for tinkering based on model size, compute and storage needs could be one or more of the following
- Google Colab (free) , Google Colab Pro (monthly subscription), Kaggle (free), PaperSpace (free and paid tiers) etc. The choice of GPU is not under user control in the case of Google Colab/Colab Pro whereas Paperspace allows for GPU choice (choices limited by pricing tiers). Paperspace has recently partnered with Graphcore to offer model building, training and deploying on their HPC machines called IPUs. Baseten has free and paid tiers like Paperspace with no explicit option to choose GPUs. Baseten does not offer option to train models like Paperspace — it allows building apps to deploy models.
- HPC cloud providers. This is perhaps the only choice when models are large or when data storage /compute needs are high. There are a large number of HPC cloud providers, with some targeting both individual users and companies, and others only targeting enterprise market. This HPC list below focuses on the former.
- Cloud compute offerings by HPC hardware vendors. While Nvidia GPU are the only HPC hardware in the cloud offered by most HPC cloud providers to date, competing offerings from AMD, Tenstorrent etc. are vying for hardware market share with Nvidia, particularly in the cloud. On the high end of the HPC spectrum, are cloud offerings from Cerebras, Sambanova and Graphcore but they don’t seem to be targeting end users directly through cloud offerings, although Graphcore recently has through Paperspace and last year through Hugging Face.
- Software accelerated compute alternative to HPC. This is a fairly new space created by startups like Colossal AI (raised 6 M a few days ago). Colossal AI accelerates computing, specifically transformer models, enabling for instance an 8B parameter model to run on standard GPU. They just released a demo of Meta’s OPT-175B running on a hardware optimized with their software.
Links for tinkering with models
Training a model
Training a model has the same choices like tinkering based on model size, compute, data storage needs, number of training epochs etc. However, training is more likely to require the use of HPCs in the cloud. Costs aside, most HPC provider offerings limit the number of GPUs that can be provisioned to 8. Provisioning beyond that number typically requires negotiating prices with the provider. This 8 machine limit serves as a crude threshold for the limit of spend an individual or a small company can afford, separating this category of HPC users from the rest who can afford to provision tens or even hundreds of GPUs.
As mentioned earlier, Hugging Face offers automatic training options for select tasks.
While deploying a model, to date, is mostly in the cloud, smaller models are being increasingly deployed on end-user device like smartphones. These two cases are examined separately below.
Deploying model in the cloud
Provisioning compute driven by demand has become critical when deploying applications powered by models, given the cost of HPC compute. As mentioned earlier, while serverless CPUs has been around for a while, the serverless GPU market was virtually absent until recently. Startups like replicate.ai, pipeline.ai, banana.dev are attempting to fill that void. They enable cost-effective deployment of models without having to explicitly provision machines with GPUs (serverless GPUs — this phrase is used in this post as a catchall term for all forms of high performance compute). Users only pay for GPU when it is used. Algorithmia was an early provider of serverless GPUs which was acquired by DataRobot last year.
The challenges to deploy large language models has opened up an other market of large model hosting. Users can directly use pretrained models hosted in the cloud or custom finetuned versions of it, also hosted in the cloud. The access to these models during deployment is through a metered API — the revenue source for the hosted model approach. OpenAPI allows for large models to be fine-tuned in the cloud with access to the model made possible through a metered API.
Links for deploying in the cloud with serverless GPUs
Links for API driven access to large models at deployment
Deploying model on a smartphone
With continued hardware improvements, smartphones are able to run models locally on the phone for certain applications. PlayTorch recently announced a toolkit for rapidly prototyping applications powered by models. Models are downloaded to device on a need basis and cached.The snappy user experience of real time object detection and classification even on an older generation iPhone, is evidence that we are likely to see more mobile applications powered by models running on device. While any existing iPhone or Android application could bundle a model — the noteworthy aspect of PlayTorch is the ease of creating and deploying an application powered by a model.
Link for creating and deploying model powered apps on a smartphone
The compute landscape is witnessing rapid transformation along many dimensions — from hardware to new business models addressing the challenges of large model deployment through metered APIs or driving costs down for inference using user uploaded models with “serverless GPUs”. While this transformation is in part catalyzed by the objective of creating compelling applications, the primary driving force causal to progress is perhaps open and collaborative innovation.
It would not be inaccurate to say the machine learning community, unlike any other community in the sciences or engineering disciplines, owes its progress in large part to the open sharing of papers, code and even models that costs hundreds of thousands of dollars to train. The recent trend of large language models demanding increased training spend, and the high cost of training even vision models, may seem to threaten this open sharing culture, at least of trained models. Startups like Stability.ai are trying to counter this threat by both granting open access to, and releasing trained models. This is accomplished by funding collaboration of research groups who would otherwise no be able to do such work given the cost (unless they decide to work for a company with deep pockets), and leveraging the public release of trained models as a means to improve applications whose access for commercial purposes becomes a revenue source.
This open strategy appears to be having an impact beyond the machine learning community. End users are unleashing their creativity either by directly interacting with trained models in the cloud, or in some cases by downloading models on to their laptops and tinkering.