Serving OPT-175B Language Model with Alpa

Alpa is an open-source system for training and serving large-scale neural networks. Alpa aims to automate large-scale distributed training and serving with just a few lines of code. Alpa was initially developed by folks in the Sky Lab, UC Berkeley. Some advanced techniques used in Alpa have been written in a paper published in OSDI'2022. Alpa community is growing with new contributors from Google, Amazon, AnyScale, and more.

A language model is a probability distribution over sequences of words. It predicts the next word based on all the previous words. It is useful for a variety of AI applications, such the auto-completion in your email or chatbot service. For more information, check out the language model wikipedia page.

GPT-3 is very large language model, with 175 billion parameters, that uses deep learning to produce human-like text. Many researchers and news articles described GPT-3 as "one of the most interesting and important AI systems ever produced". GPT-3 is gradually being used as a backbone in the latest NLP research and applications.

Due to its gigantic size, training and serving GPT-3 are very difficult and expensive, and pose significant challenges to the underlying software systems. The original GPT-3 trained by OpenAI is closed sourced and developed as a charged service --- When using it, the users have to pay for every token generated.

OPT-175B is a GPT-3 equivalent model trained by Meta. It is by far the largest pretrained language model available with 175 billion parameters. You can request the access to the trained weights by filling this form. For detailed performance of OPT-175B, check the OPT paper.

You can start with the provided examples. Avoid spaces at the end of your query. New lines are great though. More examples can be found in the appendix of the OPT paper.

Right now we use random sampling, so every time you click "generate" the generated result might be different. The temperature controls how sharp the sampling distribution is. Lower temperature pushes the generator to pick the tokens with higher scores from the model. Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. Small value of p prevents the model to choose from tokens with lower scores. See more detailed description on how to sample on this page from huggingface.

This web interface exposes only three arguments for simplicity, although our backend supports a diverse set of generation techniques and arguments.

We are developing a RESTFUL API to expose the full set of arguments. Stay tuned. Meanwhile, if you want to try out different generation techniques and hyperparameters now, you can set up your own OPT-175B service using Alpa and start from here.

We are not storing the content of your inputs. We only log the traffic patterns, such as the timestamp when you submitted your inputs and the length of your inputs.

High-level speaking, Alpa is more automatic, scalable, and cost-effective compared to existing systems.

In more details, if you are an ML developer or data scientist who is looking for a system that can train or serve large models like GPT-3, Alpa provides state-of-the-art performance while requires the least amount of system expertise to setup. Meanwhile, Alpa enables to train or serve large models on older generations of (hence cheaper) GPUs, such as 40GB A100, V100, T4, M60, etc., which are common in many in-house clusters and more accessible for many people.

If you are a system developer aiming for developing better training or serving systems, Alpa, as a compiler, offers the most flexibility to try out various ML parallelization methods (inter- and intra-op parallelisms), and the richest coverage of big model architectures (GPT-3, MoE, WideResNet, etc.). Alpa might be a good starting point for you to start your prototyping.

If you are an amateur in ML/NLP/systems, well 😛, you can play with OPT-175B inference for free; while all existing service will charge you for each token generated.

It depends on which types of GPUs used. A hard constraint now is that the total GPU memory in the cluster needs to be greater than 350GB in order to successfully run the model inference. Many existing training or serving systems usually rely on using the latest generations of GPUs with the largest memory capacity, such as 80GB A100. In contrast, Alpa, due to its more powerful backend, enables serving OPT-175B with more flexible parallelisms on older generations of GPUs, such as 40GB A100, V100, T4, M60, etc.

Take an example, if you choose to use 16GB V100 GPUs, then you would need 350 / 16 = 22 V100 GPUs to run the service.

We are working on a feature to enable serving models even if you do not have enough GPU memory, stay tuned.

Alpa does not require the latest generation GPUs (such as 80GB A100), hence reduces the machine cost. With that, we leverage older generations of hardware provided by our sponsors: MBZUAI and Sky Lab, UC Berkeley.

If you are interested in any form of donation or sponsorship to help the development of Alpa, please get in touch with Alpa authors in Alpa Slack.

No. This is a public service provided by the Alpa authors and sponsors. Your usage of this service is subject to Alpa's open source license. Your usage of the OPT-175B model is subject to Meta's OPT-175B license, which limits use to research purposes.

This is a well-known problem with large language models trained on text corpora collected from Internet. There is an active line of research in the NLP and ML community on addressing this issue. See this article. We'll incorporate latest research results into this service to improve the results in following iterations.

Alpa currently runs on top of a Ray cluster, and uses some Ray functionalities to coordinate distributed processes. However, in contrast to Ray, Alpa is designed as a compiler for large-scale distributed machine learning training and serving with high performance.

Large Model for Everyone

Frequently Asked Questions

Alpa Partners

Interested in contributing to the Alpa project?

Large Model for Everyone

Frequently Asked Questions

What is Alpa?

What are language models and GPT-3? Could you give more general introduction about them and their applications?

What is OPT-175B? How does it compare to GPT-3?

Any tips for better generation?

What sampling method do you use? What do Temperature and Top-p mean?

I want more customizations on how to generate, such as using beam search or tuning the repetition penalty. How can I do that?

Are you collecting any data from my inputs when I use this service?

Why should I choose Alpa over existing systems?

How many GPUs are needed to run the serving service for OPT-175B or GPT-3?

How do you keep this service free?

Can I use this free service for my business?

Why does this model sometimes generate something very offensive?

What's the relation between Alpa and the Ray project?

Alpa Partners

Interested in contributing to the Alpa project?