Audience: Developers (advanced)

llama.cpp Server

While Ollama is the default model server for NeuralDrive, the system also includes the llama-server binary for users who require lower-level control or specialized inference features.

When to Use llama.cpp

The llama-server is ideal for developers who need more granular configuration than the Ollama API provides.

FeatureOllamallama.cpp
Ease of UseHigh (Managed downloads)Medium (Manual model placement)
Model FormatManaged BlobsDirect GGUF loading
ControlStandardized APIFine-grained sampling & batching
Resource UsageIntegrated ManagementLower overhead per instance

Enabling the Server

The llama-server is included in the base image but is not enabled by default. You can start it manually or configure a custom systemd service.

# Example command to start the llama.cpp server
llama-server --model /path/to/my-model.gguf --port 8080 --host 0.0.0.0

Configuration Options

The server supports a wide variety of command-line flags to tune performance:

  • --ctx-size: (Integer) Define the maximum context window size.
  • --n-gpu-layers: (Integer) Specifically state how many layers to offload to the GPU.
  • --threads: (Integer) The number of CPU threads to utilize.
  • --batch-size: (Integer) Set the batch size for prompt processing.

Important Note: Model Formats

It's important to understand the difference in model storage between the two servers. Ollama stores models in a proprietary blob format within /var/lib/neuraldrive/models/. These blobs are not directly compatible with the llama-server. To use a model with llama.cpp, you must provide a raw .gguf file.

For more information on model formats, see Understanding LLM Models. To further optimize your inference setup, refer to Performance Tuning.