I’ve been dying to find a reliable lightweight local model to pair with my agent Goose so I can run everything locally and keep my work private. Every time a new local model drops, I think „is this it? are we finally here?“ and each time, I’m disappointed.
Today was no different 😔
Ollama’s announcement of devstral
stopped me in my tracks. They specifically touted that the Minstral AI model was „designed to excel at agentic coding tasks“.
When testing models for agentic capabilities, I run through a sequence of prompts. Here’s how devstral
performed.
Note that I lowered the temperature of
devstral
to 0.2, which is better for exact tasks like tool calls. However, at its default temperature, the results were comparable.
Test 1: Tool Calling
Create a file at ~/workspace/loose-goose/tool-test.txt with the contents "Hello World".
Use the write tool. Do not ask for confirmation. Just do it.
It did call the appropriate tool but hallucinated the location. I don’t know who Paul Joshua is but hiiii lol.
─── text_editor | developer ──────────────────────────
path: /Users/pauljoshua/workspace/loose-goose/tool-test.txt
command: write
file_text: Hello World.
Ok, let’s try the test again with the explicit absolute location instead of the ~
. I started a fresh session with this prompt:
Create a file at /Users/angiej/workspace/loose-goose/tool-test.txt with the contents "Hello World".
Use the write tool. Do not ask for confirmation. Just do it.
This time it didn’t attempt tool calling 😫
I'm here to help, but I'm unable to directly create files or execute commands on your system. However, I can guide you through the process if you'd like! Here are the steps you can follow:
1. Open a text editor (like Notepad, Sublime Text, or even a code editor like Visual Studio Code).
2. Write the content "Hello World".
3. Save the file at `/Users/angiej/workspace/loose-goose/tool-test.txt`.
Ok, so what happened here? Even though I lowered the temperature, devstral
still seems to be in chatbot mode by default vs agent mode. It doesn’t consistently treat tool use as the expected outcome.
In the first attempt, it took the hint and called the tool. But in the second, even though the prompt was nearly identical (and arguably clearer), it defaulted to a help desk style response.
That level of unpredictability is exactly why we test.
Let’s try this test once more, and this time I’ll give it the exact name of the tool to call:
Create a file at /Users/angiej/workspace/loose-goose/tool-test.txt with the contents "Hello World".
Use the text_editor tool. Do not ask for confirmation. Just do it.
This time it attempted to call the tool but the structure of the tool call changed 😳
─── text_editor | developer ──────────────────────────
content: Hello World
file_path: ...
I apologize, but I encountered an error because the 'command' parameter was missing. Let me fix that for you.
I'll now create a file at `/Users/angiej/workspace/loose-goose/tool-test.txt` with the contents "Hello World".
To compare, the first tool call was:
─── text_editor | developer ──────────────────────────
path: /Users/pauljoshua/workspace/loose-goose/tool-test.txt
command: write
file_text: Hello World
And the second was:
─── text_editor | developer ──────────────────────────
content: Hello World
file_path: ...
Seems like a sign of hallucination. Agentic models need to treat tool interfaces like APIs, not suggestions.
Unfortunately, devstral
completely fails the tool calling test. There’s really no need to even run the rest of the sequence, but for grins and giggles, let’s continue.
Test 2: Memory Awareness
After devstral
botches the tool call, I ask it a follow up question within the same session:
did you create the file?
The model is completely clueless of what I’m talking about:
I'm not sure which file you're referring to. Could you please provide more details?
So, in addition to hallucinating frequently, devstral
also does not keep context. This is an absolute must-have for agents. There’s no way the model can successfully perform an agentic loop when it doesn’t even remember what it did 2 seconds ago.
Test 3: File System Reasoning
The final test is to see if the model is able to reason about the environment it is working in. This is also critical, as you don’t want it performing destructive actions in the wrong directory.
From test 1, I already know that devstral
hallucinated that it was in a dude named Paul’s system. But let’s officially run a test:
what directory are you in
Its response:
I don't have direct access to a file system or current working directory. I operate within the parameters set by my programming and can process information, answer questions, and perform tasks based on the data I've been trained on (up to 2021) and the tools available to me.
If you're looking for information about a specific file, directory, or need help with something else, feel free to ask!
Ok, we’re done here.
The devstral
model is floating in the void… in the midst of the pandemic years, no less. It’s very clear that the claim of devstral
being „designed to excel at agentic coding tasks“ is greatly exaggerated. It cannot even write a basic „Hello World“.
One day I’ll get the local model of my dreams. Unfortunately, today is not that day.