If you’re running LLMs locally, you’ve probably used Ollama or LM Studio. They’re both excellent tools, but I hit some limitations. LM Studio is primarily a desktop app that can’t run truly headless, while Ollama requires SSH-ing into your server every time you want to switch models or adjust parameters.
For more control, there’s llama-server from llama.cpp. It’s powerful, lightweight, supports virtually every model format, offers extensive configuration options, provides OpenAI-compatible APIs, and, in my opinion, is noticeably faster than Ollama. But it’s CLI-only – want to switch models? SSH in and restart.
I wanted to manage my home LLM server from anywhere without constantly SSH-ing just to switch models. So I built what was missing: a management layer on top of llama-server.
Enter Llamactl
So I built what was missing: a management layer on top of llama-server. Meet llamactl – a management server and proxy that gives you the power of llama-server with the convenience of remote management.
You get a modern React web dashboard for visual management, REST APIs for programmatic control, and the ability to create, start, and stop instances with just a few clicks. Need a 7B model for quick responses and a 70B model for complex reasoning? Run both. Want to switch between them based on the task? Just change the model name in your API request.
It’s also OpenAI API compatible, so your existing tools, scripts, and integrations work without modification – just point them to your llamactl server instead of OpenAI’s endpoints. Want to use Open WebUI for a ChatGPT-like interface? Just configure it to use your llamactl server as the OpenAI API base URL, and you’re instantly chatting with any of your local models.
What Llamactl Brings to the Table
-
Multiple Model Serving: Run different models simultaneously — a 7B model for speed, a 70B model for quality, or a vision model for image analysis. Switch between them by simply changing the model name in your API requests.
-
Web Dashboard: A modern React UI that beats SSH-ing into servers. Create instances, monitor health, view logs, and manage everything from your browser.
-
Smart Resource Management: Idle timeout automatically stops unused instances to save resources. LRU eviction ensures your most-used models stay available. Configurable instance limits prevent resource exhaustion.
-
On-Demand Starting: Point your application at a model that isn’t running? Llamactl automatically starts it for you. No more „is the server running?“ guesswork.
-
API Key Authentication: Separate keys for management operations vs inference requests.
-
State Persistence: Server restarts don’t kill your carefully configured instances. Everything comes back exactly as you left it.
My Setup: LLMs Anywhere, Anytime
Here’s how I’ve configured my home LLM infrastructure to work seamlessly from anywhere:
I run my LLMs on a Mac Mini M4 Pro at home. The 48GB of unified memory gives me enough room to run larger models like Gemma 3 27B or Qwen 3 Coder 32B and switch between them as needed.
Both my home Mac Mini and a cloud VPS are connected via Tailscale. This creates a secure, private network that lets them communicate as if they were on the same LAN, regardless of where I am.
The setup is straightforward:
- Llamactl runs on the Mac Mini, managing my llama-server instances
- Open WebUI also runs locally, providing a ChatGPT-like interface
- Traefik runs on my VPS as a reverse proxy
Traefik on the VPS proxies requests through the Tailscale network to my home setup, giving me a clean public URL (like llm.mydomain.com
) that securely tunnels to my home lab.
This setup lets me deploy any model, switch between them, and chat with my LLMs from anywhere with internet access. No VPN client needed, no SSH required, and Tailscale handles the security with zero-trust networking.
What’s Next
Llamactl is still evolving, and I have several ideas for future improvements:
-
Enhanced Admin Dashboard: While the current web UI covers the essentials, I’d like to add proper user authentication with usernames and passwords, and better management of inference API keys.
-
Multiple Backend Support: Right now it’s focused on llama-server, but adding support for other inference engines like vLLM or mlx_lm.server could make it even more versatile.
-
Built-in Chat Interface: Open WebUI is fantastic, but having a simple chat UI built directly into llamactl could reduce the setup complexity for basic use cases.
-
Better Resource Scheduling: Smart load balancing and automatic model placement based on hardware capabilities.
Get Involved
Llamactl is open source and available on GitHub. You can find complete documentation and guides at llamactl.org. Whether you’re interested in trying it out, reporting bugs, suggesting features, or contributing code, I welcome all feedback and contributions.
If you’re tired of SSH-ing into servers just to switch LLM models, give llamactl a try. It might just be the management layer you didn’t know you needed.