I’ve been building a knowledgebase of working with text based models. This includes exploring how they all respond to the same prompt, what they return as a result, what parameters they support and what kinds of features do they offer, for example, structured output.
Based on what I’ve captured, I believe the content makes for an interesting article from both a sharing and learning perspective.
The models I tested are:
- Gemini Flash
- OpenAI GPT-4o
- Docker model runner using ai/smollm2
- Docker model runner using ai/ deepseek-r1-distill-llama:8B-Q4_K_M
- Anthropic claude-3-7-sonnet-20250219
To simplify the blog, I provide a very simple prompt, “just respond with hello world”. The goal was to gain some insight into what each of these APIs provided as a response and where was the actual response data stored.
Below is the Python code used to submit the request, full response and the code required to get the response text from the LLM. What you’ll notice is there is a lot of peripheral metadata that is returned in addition to the inference response. Immediately, not any of it is useful or interesting, but as you start to develop out production enterprise applications, some of these metrics will end up being very important.
Gemini Flash
import json
from google import genai
client = genai.Client(api_key=" ")
response = client.models.generate_content(
model="gemini-2.0-flash",
contents="just respond with hello world"
)
Of note the Gemini API docs example has print(response.text, which won’t work because there is no text element as a direct child of response. I think the worst thing you can do is fail your developers with non-working code right out of the gate in the quickstart.
Results
candidates=[Candidate(content=Content(parts=[Part(video_metadata=None, thought=None, code_execution_result=None, executable_code=None, file_data=None, function_call=None, function_response=None, inline_data=None, text='hello world\n')], role='model'), citation_metadata=None, finish_message=None, token_count=None, finish_reason=<FinishReason.STOP: 'STOP'>, avg_logprobs=-0.29720538854599, grounding_metadata=None, index=None, logprobs_result=None, safety_ratings=None)] create_time=None response_id=None model_version='gemini-2.0-flash' prompt_feedback=None usage_metadata=GenerateContentResponseUsageMetadata(cache_tokens_details=None, cached_content_token_count=None, candidates_token_count=3, candidates_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=3)], prompt_token_count=5, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=5)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=8) automatic_function_calling_history=[] parsed=None
# Code to retrieve the actual response text
print(response.candidates[0].content.parts[0].text)
The type of the response is <class ‘google.genai.types.GenerateContentResponse’> Gemini, OpenAI & Claude all use a structured response model. For each of these models I will provide the type of the response as it highlights the complexity in working across models.
OpenAI GPT-4o
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-4o",
input="just respond with hello world"
)
Response
Response(id='resp_67eea85673208192a0f9a22a581bd57908d3486167a3e0c1', created_at=1743693910.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_67eea856d5c08192843f52ac9027236908d3486167a3e0c1', content=[ResponseOutputText(annotations=[], text='Hello, world!', type='output_text')], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, max_output_tokens=None, previous_response_id=None, reasoning=Reasoning(effort=None, generate_summary=None), status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text')), truncation='disabled', usage=ResponseUsage(input_tokens=12, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=5, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=17), user=None, store=True)
print(response.output[0].content[0].text)
The type of the response object used by OpenAI is <class ‘openai.types.responses.response.Response’>
Docker model runner using ai/smollm2
If you haven’t tried Docker model runner yet, it’s essentially Ollama built into docker. It allows you to treat models like Docker containers. There’s a registry of models that you can pull from so that multiple models are available through the same API.
The other aspect of Docker model runner is that there is no SDK as of yet. The API access is through a REST-based interface.
api_url = "http://localhost:9001/engines/llama.cpp/v1/chat/completions"
try:
async with aiohttp.ClientSession(headers={'Content-Type': 'application/json'}) as session:
payload = {
"model": "ai/smollm2",
"messages":[{
"role": "user",
"content": "just respond with hello world",}
],
"stream": False,
}
headers = {'Content-Type': 'application/json'}
async with session.post(api_url, json=payload, headers=headers) as response:
if response.status == 200:
response_data = await response.json()
print(response_data)
except aiohttp.ClientError as e:
print(f"Request to docker model runner API failed: {e}")
finally:
print("Request completed.")
Results
{'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': 'Hello, world!'}}], 'created': 1743695344, 'model': 'ai/smollm2', 'system_fingerprint': 'b1-be7c303', 'object': 'chat.completion', 'usage': {'completion_tokens': 5, 'prompt_tokens': 34, 'total_tokens': 39}, 'id': 'chatcmpl-M9T18dvQAYZcO8w0k8xN2AqXbo5CTjNu', 'timings': {'prompt_n': 1, 'prompt_ms': 70.213, 'prompt_per_token_ms': 70.213, 'prompt_per_second': 14.242376767835017, 'predicted_n': 5, 'predicted_ms': 34.417, 'predicted_per_token_ms': 6.8834, 'predicted_per_second': 145.27704332161431}}
print(response_data['choices'][0]['message']['content'])
Results
Switching to Docker model runner means that I’m using a model that is quantitized and not as accurate as some of the larger models. That said, after switching to the local model I noticed some unusual behavior or what I like call:
When your LLM decides to be a defiant teenager:
I swear this triggered me as it was reminiscent of talking to my 16 year old. All the other models up to this point no matter how many times I ran it (up to 5) just responded with “Hello world”.
Run 1: Hello world!
Run 2: Hello and welcome to my humble abode of answers. I’m here to assist you with any questions or queries you may have. What is it that you need help with?
Run 3: Hello world!
Run 4: hello, how can I assist you today?
Docker model runner using ai/ deepseek-r1-distill-llama:8B-Q4_K_M
This uses the same code as ai/smollm2 all we need to do is pull the model and change the model parameter in the API request
Like smollm2, deepseek-r1 has it’s own issues that I likened to having ADHD:
Results
{'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'reasoning_content': 'Okay, so I need to figure out how to respond with "hello world" using Python. I\'m pretty new to programming, so I\'m not entirely sure how to start. I know that "hello world" is a common first program that prints a message. Let me think about the steps I need to take.\n\nFirst, I need to write a Python program. I remember that in Python, we can write simple scripts using a text editor. Maybe I can just type it into a new file. I should start with something like opening a file and writing to it. I\'ve heard that the print function is used for outputting messages.\n\nSo, maybe I can write something like print("hello world"). That seems straightforward. But wait, I want to make sure that\'s all. Do I need to include any additional parts, like a main function or something else? No, I think for such a simple program, the script can just have the print statement.\n\nI also recall that when you run a Python script, it executes each line. So if I have only the print line, that should execute and display "hello world". I don\'t think I need any imports because I\'m not using any libraries here.\n\nLet me check if I\'m missing anything. Oh, right, sometimes people add a shebang line at the top of the file to specify the Python interpreter. But I think that\'s optional. Without it, the script should still run fine on most systems.\n\nWait, should I name the file something specific, like "hello.py"? That makes sense because it\'s a Python file. So I can save the content as hello.py and then execute it using the command line. If I\'m using an IDE, I might just run it from there, but for now, the command line should work.\n\nI\'m a bit nervous about syntax errors. How can I make sure that my code is correct? Maybe I can run it and see if it outputs "hello world" or if it throws an error. If it works, then great! If not, I\'ll need to troubleshoot.\n\nI think that\'s all I need to do. So, putting it all together, my Python script will have a single line: print("hello world"). I\'ll save it as hello.py and execute it to see the output.', 'content': '\n\nTo respond with "hello world" using Python, follow these steps:\n\n1. Open a text editor and create a new file named `hello.py`.\n2. Type the following line in the file:\n ```python\n print("hello world")\n ```\n3. Save the file and close the editor.\n4. Run the script using a command line or an IDE.\n5. The output will be "hello world".\n\nHere is the code:\n\n```python\nprint("hello world")\n```'}}], 'created': 1743695786, 'model': 'ai/deepseek-r1-distill-llama:8B-Q4_K_M', 'system_fingerprint': 'b1-be7c303', 'object': 'chat.completion', 'usage': {'completion_tokens': 571, 'prompt_tokens': 10, 'total_tokens': 581}, 'id': 'chatcmpl-Q9q4cGIEaZdLYvJatPOBTCrmpcBjIzv5', 'timings': {'prompt_n': 10, 'prompt_ms': 184.437, 'prompt_per_token_ms': 18.4437, 'prompt_per_second': 54.21905582936178, 'predicted_n': 571, 'predicted_ms': 29219.523, 'predicted_per_token_ms': 51.17254465849387, 'predicted_per_second': 19.541729000846452}}
However, this was just run, runs 2 and 3 operated as expected:
Run 1: To respond with "hello world" using Python, follow these steps:
1. Open a text editor and create a new file named `hello.py`.
2. Type the following line in the file:
```python
print("hello world")
```
3. Save the file and close the editor.
4. Run the script using a command line or an IDE.
5. The output will be "hello world".
Here is the code:
```python
print("hello world")
```
Run 2: Hello, world!
Run 3: Hello, World!
Anthropic claude-3-7-sonnet-20250219
import anthropic
client = anthropic.Anthropic(
# defaults to os.environ.get("ANTHROPIC_API_KEY")
#api_key="my_api_key",
)
message = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1024,
messages=[
{"role": "user", "content": "Just respond with hello world"}
]
)
print(message)
Results
Message(id='msg_01Lz93UZy3eJTiLigpjkakU3', content=[TextBlock(citations=None, text='hello world', type='text')], model='claude-3-7-sonnet-20250219', role='assistant', stop_reason='end_turn', stop_sequence=None, type='message', usage=Usage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=12, output_tokens=5))
print(message.content[0].text)
Conclusions
As I stated at the start of the blog, I wrote this in the interest of learning as much as sharing. When MCP came out I said, I’d prefer if we had started with standardization of LLM output. I am curious what others are finding along their journey across LLMs.
Is there a way that you mitigate the variances in responses that are provided by the multitude of LLMs?
Do you find that for each LLM there is clear and sufficient documentation with regard to formatting prompts? For example, there’s all different types of roles, I’ve seen assistant, developer, user, system, etc. What about options that the LLM accepts?
If there’s an SDK would you still prefer to use the REST API if one exists?
Do you prefer hosted LLMs versus embedded LLMs? I prefer the hosted ones, but I’m a distributed computing engineer by nature. I prefer a service-based approach versus an SDK-based approach.