Benchmarking Popular Opensource LLMs: Llama2, Falcon, and Mistral
In this blog, we will show the summary of various open-source LLMs that we have benchmarked. We benchmarked these models from a latency, cost, and requests per second perspective. This will help you evaluate if it can be a good choice based on the business requirements. Please note that we don't cover the qualitative performance in this article - there are different methods to compare LLMs which can be found here.
Use cases Benchmarked
The key use cases across which we benchmarked are:
- 1500 Input tokens, 100 output tokens (Similar to Retrieval Augmented Generation use cases)
- 50 Input tokens, 500 output tokens (Generation Heavy use cases)
Benchmarking Setup
For benchmarking, we have used locust, an open-source load-testing tool. Locust works by creating users/workers to send requests in parallel. At the beginning of each test, we can set the Number of Users
and Spawn Rate
. Here the Number of Users
signify the Maximum number of users that can spawn/run concurrently, whereas the Spawn Rate
signifies how many users will be spawned per second.
In each benchmarking test for a deployment config, we started from 1
user and kept increasing the Number of Users
gradually till we saw a steady increase in the RPS. During the test, we also plotted the response times (in ms)
and total requests per second
.
In each of the 2 deployment configurations, we have used the huggingface text-generation-inference model server having version=0.9.4
. The following are the parameters passed to the text-generation-inference
image for different model configurations:
LLMs Benchmarked
The 5 open source LLMs benchmarked are as follows:
The following table shows a summary of benchmarking LLMs:
Model | Input / Output tokens | Concurrent Users / Throughput | GPU Type | AWS Machine Type (Cost/hr) Region: us-east-1 | GCP Machine Type (Cost/hr) Region: us-east4 | Azure Machine Type (Cost/hr) Region: East US (Virginia) | Sagemaker Instance Type (Cost/hr) Region: us-east-1 |
---|---|---|---|---|---|---|---|
Mistral 7b | 1500 Input, 100 Output | 7 users / 2.8 | A100 40 GB (Count: 1) | p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) | a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) | Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) | ml.p4d.24xlarge (On-Demand: $37.68/hr) |
Mistral 7b | 50 Input, 500 Output | 40 users / 1.5 | A100 40 GB (Count: 1) | p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) | a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) | Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) | ml.p4d.24xlarge (On-Demand: $37.68/hr) |
LLama 2 7b | 1500 Input, 100 Output | 20 users / 3.6 | A100 40 GB (Count: 1) | p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) | a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) | Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) | ml.p4d.24xlarge (On-Demand: $37.68/hr) |
LLama 2 7b | 50 Input, 500 Output | 62 users / 3.5 | A100 40 GB (Count: 1) | p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) | a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) | Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) | ml.p4d.24xlarge (On-Demand: $37.68/hr) |
LLama 2 13b | 1500 Input, 100 Output | 7 users / 1.4 | A100 40 GB (Count: 1) | p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) | a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) | Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) | ml.p4d.24xlarge (On-Demand: $37.68/hr) |
LLama 2 13b | 50 Input, 500 Output | 23 users / 1.5 | A100 40 GB (Count: 1) | p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) | a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) | Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) | ml.p4d.24xlarge (On-Demand: $37.68/hr) |
LLama 2 70b | 1500 Input, 100 Output | 15 users / 1.1 | A100 40 GB (Count: 4) | p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) | a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr) | Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr) | ml.p4d.24xlarge (On-Demand: $37.68/hr) |
LLama 2 70b | 50 Input, 500 Output | 38 users / 0.8 | A100 40 GB (Count: 4) | p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) | a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr) | Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr) | ml.p4d.24xlarge (On-Demand: $37.68/hr) |
Falcon 40b | 1500 Input, 100 Output | 16 users / 2 | A100 40 GB (Count: 4) | p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) | a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr) | Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr) | ml.p4d.24xlarge (On-Demand: $37.68/hr) |
Falcon 40b | 50 Input, 500 Output | 75 users / 2.5 | A100 40 GB (Count: 4) | p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) | a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr) | Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr) | ml.p4d.24xlarge (On-Demand: $37.68/hr) |
Details LLM Benchmarking Blogs on each LLMs
For each of the models mentioned above, refer to the detailed LLM benchmarking blogs as shown below: