I’ve been building a knowledgebase of working with text based models. This includes exploring how they all respond to the same prompt, what they return as a result, what parameters they support and what kinds of features do they offer, for example, structured output.

Based on what I’ve captured, I believe the content makes for an interesting article from both a sharing and learning perspective.

The models I tested are:

  • Gemini Flash
  • OpenAI GPT-4o
  • Docker model runner using ai/smollm2
  • Docker model runner using ai/ deepseek-r1-distill-llama:8B-Q4_K_M
  • Anthropic claude-3-7-sonnet-20250219

To simplify the blog, I provide a very simple prompt, “just respond with hello world”. The goal was to gain some insight into what each of these APIs provided as a response and where was the actual response data stored.

Below is the Python code used to submit the request, full response and the code required to get the response text from the LLM. What you’ll notice is there is a lot of peripheral metadata that is returned in addition to the inference response. Immediately, not any of it is useful or interesting, but as you start to develop out production enterprise applications, some of these metrics will end up being very important.

Gemini Flash

import json
from google import genai

client = genai.Client(api_key=" ")

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="just respond with hello world"
)

Of note the Gemini API docs example has print(response.text, which won’t work because there is no text element as a direct child of response. I think the worst thing you can do is fail your developers with non-working code right out of the gate in the quickstart.

Results

# Code to retrieve the actual response text
print(response.candidates[0].content.parts[0].text)

The type of the response is <class ‘google.genai.types.GenerateContentResponse’> Gemini, OpenAI & Claude all use a structured response model. For each of these models I will provide the type of the response as it highlights the complexity in working across models.

OpenAI GPT-4o

from openai import OpenAI
client = OpenAI()

response = client.responses.create(
    model="gpt-4o",
    input="just respond with hello world"
)

Response

print(response.output[0].content[0].text)

The type of the response object used by OpenAI is <class ‘openai.types.responses.response.Response’>

Docker model runner using ai/smollm2

If you haven’t tried Docker model runner yet, it’s essentially Ollama built into docker. It allows you to treat models like Docker containers. There’s a registry of models that you can pull from so that multiple models are available through the same API.

The other aspect of Docker model runner is that there is no SDK as of yet. The API access is through a REST-based interface.

api_url = "http://localhost:9001/engines/llama.cpp/v1/chat/completions"

try:
    async with aiohttp.ClientSession(headers={'Content-Type': 'application/json'}) as session:
        payload = {
            "model": "ai/smollm2",
            "messages":[{
                "role": "user",
                "content": "just respond with hello world",}
            ],
            "stream": False,
        }            
        headers = {'Content-Type': 'application/json'}
        async with session.post(api_url, json=payload, headers=headers) as response:
            if response.status == 200:
                response_data = await response.json()
                print(response_data)
except aiohttp.ClientError as e:
    print(f"Request to docker model runner API failed: {e}")
finally:
    print("Request completed.")

Results

print(response_data['choices'][0]['message']['content'])

Results

Switching to Docker model runner means that I’m using a model that is quantitized and not as accurate as some of the larger models. That said, after switching to the local model I noticed some unusual behavior or what I like call:

When your LLM decides to be a defiant teenager:

I swear this triggered me as it was reminiscent of talking to my 16 year old. All the other models up to this point no matter how many times I ran it (up to 5) just responded with “Hello world”.

Run 1: Hello world!

Run 2: Hello and welcome to my humble abode of answers. I’m here to assist you with any questions or queries you may have. What is it that you need help with?

Run 3: Hello world!

Run 4: hello, how can I assist you today?

Docker model runner using ai/ deepseek-r1-distill-llama:8B-Q4_K_M

This uses the same code as ai/smollm2 all we need to do is pull the model and change the model parameter in the API request

Like smollm2, deepseek-r1 has it’s own issues that I likened to having ADHD:

Results

However, this was just run, runs 2 and 3 operated as expected:

Run 1: To respond with "hello world" using Python, follow these steps:

1. Open a text editor and create a new file named `hello.py`.
2. Type the following line in the file:
   ```python
   print("hello world")
   ```
3. Save the file and close the editor.
4. Run the script using a command line or an IDE.
5. The output will be "hello world".

Here is the code:

```python
print("hello world")
```

Run 2: Hello, world!
Run 3: Hello, World!

Anthropic claude-3-7-sonnet-20250219

import anthropic

client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    #api_key="my_api_key",
)
message = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Just respond with hello world"}
    ]
)
print(message)

Results

print(message.content[0].text)

Conclusions

As I stated at the start of the blog, I wrote this in the interest of learning as much as sharing. When MCP came out I said, I’d prefer if we had started with standardization of LLM output. I am curious what others are finding along their journey across LLMs.

Is there a way that you mitigate the variances in responses that are provided by the multitude of LLMs?

Do you find that for each LLM there is clear and sufficient documentation with regard to formatting prompts? For example, there’s all different types of roles, I’ve seen assistant, developer, user, system, etc. What about options that the LLM accepts?

If there’s an SDK would you still prefer to use the REST API if one exists?

Do you prefer hosted LLMs versus embedded LLMs? I prefer the hosted ones, but I’m a distributed computing engineer by nature. I prefer a service-based approach versus an SDK-based approach.

Leave a Reply

Your email address will not be published. Required fields are marked *

*