Audience: Developers (advanced)

llama.cpp Server

While Ollama is the default model server for NeuralDrive, the system also includes the llama-server binary for users who require lower-level control or specialized inference features.

When to Use llama.cpp

The llama-server is ideal for developers who need more granular configuration than the Ollama API provides.

Feature	Ollama	llama.cpp
Ease of Use	High (Managed downloads)	Medium (Manual model placement)
Model Format	Managed Blobs	Direct GGUF loading
Control	Standardized API	Fine-grained sampling & batching
Resource Usage	Integrated Management	Lower overhead per instance

Enabling the Server

The llama-server is included in the base image but is not enabled by default. You can start it manually or configure a custom systemd service.

# Example command to start the llama.cpp server
llama-server --model /path/to/my-model.gguf --port 8080 --host 0.0.0.0

Configuration Options

The server supports a wide variety of command-line flags to tune performance:

--ctx-size: (Integer) Define the maximum context window size.
--n-gpu-layers: (Integer) Specifically state how many layers to offload to the GPU.
--threads: (Integer) The number of CPU threads to utilize.
--batch-size: (Integer) Set the batch size for prompt processing.

It's important to understand the difference in model storage between the two servers. Ollama stores models in a proprietary blob format within /var/lib/neuraldrive/models/. These blobs are not directly compatible with the llama-server. To use a model with llama.cpp, you must provide a raw .gguf file.

For more information on model formats, see Understanding LLM Models. To further optimize your inference setup, refer to Performance Tuning.

NeuralDrive User Guide

llama.cpp Server

When to Use llama.cpp

Enabling the Server

Configuration Options

Important Note: Model Formats