Deploying Open LLMs with LLAMA-CPP Server: A Step-by-Step Guide

Deploying Open LLMs with LLAMA-CPP Server: A Step-by-Step Guide. Learn how to install and set up LLAMA-CPP server to serve open-source large language models, making requests via cURL, OpenAI client, and Python's requests package. Optimize for local and remote deployment.

June 17, 2024


Unlock the power of open-source large language models (LLMs) with this comprehensive guide on deploying LLAMA-CPP Server. Discover how to efficiently serve multiple users with a single LLM, optimizing performance and accessibility for your AI-powered applications.

Installing LLAMA-CPP

The easiest way to get started with LLAMA-CPP is to use the Homebrew package manager to install it. This will work natively on both macOS and Linux machines. To install LLAMA-CPP on a Windows machine, you will need to use Windows Subsystem for Linux (WSL).

To install LLAMA-CPP using Homebrew, run the following command in your terminal:

brew install llama-cpp

This command will download and install the LLAMA-CPP package on your system. Once the installation is complete, you can start using the llama-server command to serve your LLM models.

Starting the LLAMA-CPP Server

To start the LLAMA-CPP server, follow these steps:

  1. Install LLAMA-CPP using the Homebrew package manager:

    brew install llama.cpp

    This command will install LLAMA-CPP on your Mac or Linux machine. For Windows users, you'll need to use WSL (Windows Subsystem for Linux) to install LLAMA-CPP.

  2. Start the LLAMA-CPP server by running the following command:

    llama-server --model <hugging-face-repo-id> --model-file <quantization-file>

    Replace <hugging-face-repo-id> with the Hugging Face repository ID of the model you want to serve, and <quantization-file> with the specific quantization file you want to use (e.g., the 4-bit quantized version in GGML format).

  3. The LLAMA-CPP server will start listening for incoming requests on localhost:8080 by default. You can customize the host address and port using the available options, such as --host and --port.

  4. The server supports various configuration options, including setting the maximum context window, batch size, and more. You can explore these options by running llama-server --help.

  5. Once the server is running, you can interact with it using different methods, such as cURL, the OpenAI client, or the Python requests package, as demonstrated in the previous sections.

Remember, the LLAMA-CPP server is designed to provide a fast and efficient way to serve open-source large language models on your local machine or in a production environment. By leveraging the server, you can easily integrate these models into your applications and serve multiple users with a single GPU.

Making Requests to the LLAMA-CPP Server

There are several ways to interact with the LLAMA-CPP server and make requests to the served model:

  1. Using the cURL command:

    curl -X POST -H "Content-Type: application/json" -d '{"prompt": "Hello, how are you?", "max_tokens": 50}' http://localhost:8080/v1/chat/completions

    This will make a POST request to the chat completion endpoint running on the local host at port 8080.

  2. Using the OpenAI client:

    1import openai 2 3openai.api_base = "http://localhost:8080/v1" 4openai.Model.list() 5 6response = openai.ChatCompletion.create( 7 model="chat-gpt-3.5", 8 messages=[ 9 {"role": "system", "content": "You are a helpful assistant."}, 10 {"role": "user", "content": "Hello, how are you?"} 11 ] 12) 13 14print(response.choices[0].message.content)

    This uses the OpenAI client to interact with the LLAMA-CPP server, which is compatible with the OpenAI API.

  3. Using the Python Requests package:

    1import requests 2 3url = "http://localhost:8080/v1/chat/completions" 4headers = {"Content-Type": "application/json"} 5data = { 6 "prompt": "Hello, how are you?", 7 "max_tokens": 50 8} 9 10response =, headers=headers, json=data) 11print(response.json())

    This uses the Python Requests package to make a POST request to the chat completion endpoint.

In all these examples, the LLAMA-CPP server is running on the local host at port 8080, serving the specified model. You can customize the server configuration, such as the host address, port, and model, as needed.

Customizing the LLAMA-CPP Server

LLAMA-CPP provides a highly customizable server that allows you to fine-tune the behavior of your LLM deployment. Here are some of the key options you can configure:

  1. Max Context Window: You can define the maximum context window size for the LLM, which determines the maximum length of the input sequence the model can process.

  2. Batch Size: LLAMA-CPP supports batching of prompts, allowing you to process multiple inputs simultaneously for improved throughput. You can configure the batch size to optimize performance.

  3. Host Address: By default, the LLAMA-CPP server listens on localhost, but you can change the host address to make the server accessible from other machines on your network.

  4. Port: The server listens on port 8080 by default, but you can specify a different port if needed.

  5. Model Path: LLAMA-CPP allows you to customize the path from which it loads the LLM model files, giving you flexibility in how you organize your model assets.

  6. Embedding Models: In addition to language models, LLAMA-CPP can also serve embedding models, allowing you to integrate both text generation and text encoding capabilities into your applications.

  7. Metrics Tracking: The LLAMA-CPP server can track various metrics, such as request latency and throughput, to help you monitor and optimize the performance of your deployment.

By leveraging these customization options, you can tailor the LLAMA-CPP server to your specific deployment requirements, whether you're running it in a production environment or using it for local development and experimentation.

Interacting with the LLAMA-CPP Server Using Different Methods

To interact with the LLAMA-CPP server, we can use various methods:

  1. Using the cURL Command:

    • Make a POST request to the chat completion endpoint running on localhost.
    • Provide the necessary headers and the data object containing the prompt and the desired number of tokens to generate.
    • The response will include the generated text, as well as information about the generation process, such as temperature, top-P, top-K, and predicted tokens per second.
  2. Using the OpenAI Client:

    • Create an OpenAI client with the base URL set to the URL of the local LLAMA-CPP server.
    • Use the chat completion endpoint client and provide the model name (e.g., chat-gpt-3.5).
    • Set the system prompt and the user prompt, then make the request to the server.
    • The response will be returned in the same format as the OpenAI API.
  3. Using the Requests Package (Python):

    • Define the URL and headers for the POST request.
    • Pass multiple different messages to the server and observe how it processes the requests concurrently.
    • The server will queue the requests and process them one at a time, without being overwhelmed.

By using these different methods, you can interact with the LLAMA-CPP server and serve multiple users with a single LLM and a single GPU. The server provides a flexible and customizable way to deploy open-source language models, allowing you to adjust various parameters to suit your specific needs.


In this video, we have explored the installation and usage of LlamaCPP, a powerful open-source project for serving open-source large language models. We have learned how to install LlamaCPP on our local machine, start the server, and interact with it using various methods, including cURL, the OpenAI client, and the Python requests package.

We have also discussed the various configuration options available in LlamaCPP, allowing us to customize the server to our specific needs, such as setting the maximum context window, batch size, and host address. Additionally, we have seen how LlamaCPP can track metrics, making it a suitable choice for production environments.

Finally, we have touched upon the practical applications of LlamaCPP, particularly in the context of deploying large language models for various use cases. We have mentioned the Rasa framework as a potential application and have provided a link to a related course in the video description.

Overall, this video has provided a comprehensive introduction to LlamaCPP and its capabilities, equipping you with the knowledge to start serving open-source large language models on your local machine or in a production environment.