Ollama Architecture Analysis / Real-World Q&A
A comprehensive architecture analysis of Ollama, a local LLM execution platform implemented in Go. Covers the tech stack, scheduler, model management, llama.cpp integration, and the Runner system in detail.
Analyzed: 2026-03-15 Package: v0.18.0 Repository: https://github.com/ollama/ollama
This article is mostly written by Claude Code
1. Project Overview
Ollama is a local LLM (Large Language Model) execution platform written in Go. Inspired by Docker's design philosophy, it lets you run the latest open-source LLMs locally with a single command and no complex setup.
- Tagline: "Get up and running with large language models locally."
- Core values: Local execution, privacy protection, automatic GPU optimization, Docker-style model management
- Supported models: LLaMA, Mistral, Qwen, Gemma, Phi, Deepseek, GLM, and dozens more
- Supported GPUs: NVIDIA CUDA, AMD ROCm, Apple Metal (M1/M2/M3/M4)
- Supported platforms: macOS, Linux, Windows
2. Tech Stack
| Area | Technology |
|---|---|
| Language | Go 1.24.1 |
| HTTP framework | Gin v1.10.0 |
| CLI framework | Cobra v1.7.0 |
| Inference backend | llama.cpp (CGO bindings) |
| DB | SQLite (blob metadata) |
| Compression | zstd |
| Serialization | protobuf, JSON |
| GPU support | CUDA / ROCm / Metal / CPU |
GPU Support Layers
| Backend | Target hardware |
|---|---|
| NVIDIA CUDA | GeForce, RTX, Tesla, A100, etc. |
| AMD ROCm | RX 7000/6000 series, MI300, etc. |
| Apple Metal | Apple Silicon (M1–M4) |
| CPU | AVX2/AVX512-optimized x86, ARM |
3. Overall Architecture
╔══════════════════════════════════════════════════════════════════╗
║ Ollama System ║
║ ║
║ ┌──────────────────────────────────────────────────────────┐ ║
║ │ CLI Layer (Cobra) │ ║
║ │ ollama run / pull / push / create / list / show / serve │ ║
║ └──────────────────────┬───────────────────────────────────┘ ║
║ │ ║
║ ┌──────────────────────▼───────────────────────────────────┐ ║
║ │ HTTP Server (Gin + CORS) │ ║
║ │ 127.0.0.1:11434 (configurable via OLLAMA_HOST) │ ║
║ │ │ ║
║ │ OpenAI Compatibility Anthropic Compatibility │ ║
║ │ Middleware Middleware │ ║
║ └──────────────────────┬───────────────────────────────────┘ ║
║ │ ║
║ ┌──────────────────────▼───────────────────────────────────┐ ║
║ │ Route Handlers │ ║
║ │ ChatHandler / GenerateHandler / EmbedHandler │ ║
║ │ PullHandler / PushHandler / CreateHandler │ ║
║ └──────────────────────┬───────────────────────────────────┘ ║
║ │ scheduleRunner() ║
║ ┌──────────────────────▼───────────────────────────────────┐ ║
║ │ Scheduler │ ║
║ │ - Model load/unload management │ ║
║ │ - VRAM-based Eviction │ ║
║ │ - Reference Counting │ ║
║ │ - Keep-Alive timer │ ║
║ └──────────────────────┬───────────────────────────────────┘ ║
║ │ spawn process ║
║ ┌──────────────────────▼───────────────────────────────────┐ ║
║ │ Runner Process (separate process) │ ║
║ │ │ ║
║ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ ║
║ │ │LlamaRunner │ │OllamaRunner │ │ImageGen Runner │ │ ║
║ │ │(llama.cpp) │ │(Go native) │ │(diffusion) │ │ ║
║ │ └──────┬──────┘ └──────┬──────┘ └────────┬────────┘ │ ║
║ │ └────────────────┴──────────────────┘ │ ║
║ │ │ │ ║
║ │ GGML Backend (GPU/CPU) │ ║
║ │ CUDA / ROCm / Metal / CPU │ ║
║ └──────────────────────────────────────────────────────────┘ ║
║ ║
║ ┌──────────────────────────────────────────────────────────┐ ║
║ │ Model Storage (Content-Addressable) │ ║
║ │ ~/.ollama/models/manifests/ + blobs/sha256-[digest] │ ║
║ └──────────────────────────────────────────────────────────┘ ║
╚══════════════════════════════════════════════════════════════════╝
4. Core Module Structure
/api — Client API Package
A Go client library for external access to the Ollama server.
Clientstruct: HTTP client- Request/response type definitions:
GenerateRequest,ChatRequest,EmbedRequest, etc. - Streaming response handling
- Error types such as
StatusErrorandAuthorizationError
/server — HTTP Server + Core Logic
type Server struct {
addr net.Addr // listening address
sched *Scheduler // model scheduler
defaultNumCtx int // default context length
}
Key handlers:
ChatHandler— multi-turn conversationGenerateHandler— single-prompt generationEmbedHandler— embedding generationPullHandler— model downloadCreateHandler— custom model creation
/llm — LLM Server Interface
An abstract interface for communicating with Runner processes.
type LlamaServer interface {
Load(ctx context.Context, opts api.Options) error
Ping(ctx context.Context) error
Completion(ctx context.Context, req CompletionRequest, fn func(CompletionResponse)) error
Embedding(ctx context.Context, input string) ([]float64, error)
Tokenize(ctx context.Context, content string) ([]int, error)
Detokenize(ctx context.Context, tokens []int) (string, error)
MemorySize(ctx context.Context) (uint64, error)
Close() error
}
/runner — Model Execution Engines
| Runner | Path | Description |
|---|---|---|
| LlamaRunner | runner/llamarunner/ | Legacy runner based on llama.cpp |
| OllamaRunner | runner/ollamarunner/ | New Go-native engine |
| ImageGen | runner/x/imagegen/ | Stable Diffusion-class image generation |
| MLX | runner/x/mlxrunner/ | Apple MLX-optimized runner |
/model — Model Architecture Implementations
Model implementations used by the Go-native engine.
- LLaMA, Mistral, Qwen, Gemma, Phi, Deepseek, GLM, and more
Modelinterface: implements the forward passMultimodalProcessor: image encoding (Vision models)- Model architecture registry (
model.Register())
/manifest — Content-Addressable Storage
A model storage system similar to Docker image layers.
Manifest (JSON)
└── Layers []Layer
├── Digest: "sha256-abc123..."
├── MediaType: "application/vnd.ollama.image.model"
└── Size: 4_000_000_000
Media types:
application/vnd.ollama.image.model— model weights (GGUF)application/vnd.ollama.image.projector— Vision projectorapplication/vnd.ollama.image.adapter— LoRA adapterapplication/vnd.ollama.image.template— chat templateapplication/vnd.ollama.image.params— parameters
/template — Chat Template System
Handles the prompt format expected by each model (ChatML, Llama3, Gemma, etc.) using Go's text/template.
/discover — GPU Discovery
Detects installed GPUs at runtime and collects metadata such as VRAM size and driver version.
5. Request Processing Pipeline
Full Chat Request Flow
Client
│
│ POST /api/chat
▼
ChatHandler()
│ JSON parsing + validation
│
▼
scheduleRunner(modelName, options)
│
▼
Scheduler.GetRunner()
├─ [already loaded] → return Runner reference
└─ [not loaded] → enqueue as LlmRequest
│
▼
processPending() goroutine
│
├─ detect GPU devices
├─ calculate available VRAM
├─ evict existing model if needed
└─ spawn Runner process
│
▼
Runner process (separate port)
│
HTTP RPC communication
│
▼
Runner.Completion()
│
├─ messages → tokenize
├─ initialize / reuse KV cache
├─ execute forward pass (GPU)
├─ sample next token
└─ repeat (until EOS or max_tokens)
│
▼
ChatResponse chunks streamed (NDJSON)
│
▼
Client (done: true + metrics)
Model Resolution Flow
"llama3.2:3b"
│
▼
GetModel(name)
│
├─ check for local manifest
├─ if absent, Pull from registry
▼
parse manifest.json
│
▼
parse GGUF header → extract model metadata
│ (parameter count, context length, quantization type, etc.)
▼
auto-detect or explicitly load chat template
│
▼
return Model struct
6. Scheduler System
The scheduler (/server/sched.go) is a core Ollama component that efficiently manages VRAM when multiple model requests arrive concurrently.
Key Types
// Unit of work queued for scheduling
type LlmRequest struct {
ctx context.Context
model *Model
opts api.Options
sessionDuration *api.Duration
successCh chan *runnerRef // delivers Runner once loaded
errCh chan error
}
// Reference-counted handle to a running Runner
type runnerRef struct {
refMu sync.Mutex
refCount uint // number of in-flight requests
llama llm.LlamaServer
model *Model
pid int // Runner process PID
gpus []ml.DeviceID // GPUs in use
expireTimer *time.Timer
}
How the Scheduler Works
- Single-threaded loading: Model loading is executed sequentially to prevent GPU memory contention.
- Keep-Alive management: After a request completes, the model stays in memory for a configurable duration (default: 5 minutes).
- Eviction policy: When loading a new model and VRAM is insufficient, the least-recently-used model is unloaded first.
- Reference counting: Concurrent requests for the same model share the same Runner instance.
Request A → load llama3.2 → refCount: 1
Request B → llama3.2 already loaded → refCount: 2
Request A done → refCount: 1
Request B done → refCount: 0 → Keep-Alive timer starts
5 min later → model unloaded (reloaded on next request)
7. Model Management System
Storage Structure
~/.ollama/models/
├── manifests/
│ └── registry.ollama.com/
│ └── library/
│ ├── llama3.2/
│ │ ├── latest
│ │ └── 3b
│ └── mistral/
│ └── latest
└── blobs/
├── sha256-abc123... (model weights GGUF file)
├── sha256-def456... (template)
└── sha256-ghi789... (parameters)
- Content-addressable: Files are stored by SHA256 digest — identical files are shared across multiple models.
- Manifest: Describes which blobs compose a model (the same concept as Docker image layers).
Model Pull Process
1. Parse model name (registry/namespace/name:tag)
2. Fetch Manifest from registry API
3. Parallel download of required layer blobs (16-part chunks)
4. Resumable download support (Range header)
5. SHA256 digest verification
6. Save Manifest + blobs locally
Modelfile (Custom Model Creation)
A format inspired by Docker's Dockerfile.
FROM llama3.2
SYSTEM """
You are a helpful Korean assistant.
항상 한국어로 답변해주세요.
"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
TEMPLATE """
{{- if .System }}<|system|>
{{ .System }}<|end|>
{{- end }}
{{- range .Messages }}<|{{ .Role }}|>
{{ .Content }}<|end|>
{{- end }}<|assistant|>
"""
8. Runner Architecture
Process-Isolation Design
[Ollama Server Process]
│
│ spawn Runner via exec.Command()
│ pass port/config via environment variables
▼
[Runner Process (llama.cpp / ollama-engine)]
│
│ HTTP server on localhost:<random port>
│ JSON RPC communication
▼
[GGML Backend]
│
├── CUDA (libcuda.so)
├── ROCm (librocblas.so)
├── Metal (Apple Framework)
└── CPU (AVX2/AVX512)
Why processes are isolated:
- A model crash does not bring down the entire server.
- Prevents GPU memory leaks (process exit = memory returned).
- Provides a unified interface across different backends (llama.cpp, Go native, imagegen).
llama.cpp CGO Integration
/llama/llama.go calls llama.cpp C/C++ code directly via CGO.
// CGO binding example (conceptual)
/*
#include "llama.h"
*/
import "C"
func loadModel(modelPath string) *C.llama_model {
params := C.llama_model_default_params()
return C.llama_load_model_from_file(C.CString(modelPath), params)
}
New Ollama Engine (Go Native)
A pure-Go inference engine implemented in /runner/ollamarunner/.
- Multi-sequence parallel processing (batch inference)
- KV cache slot management
- Multimodal (image) support
- Buildable without CGO since it is Go-native
9. Configuration System
Environment Variables (envconfig package)
| Environment variable | Default | Description |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Server listening address |
OLLAMA_MODELS | ~/.ollama/models | Model storage directory |
OLLAMA_KEEP_ALIVE | 5m | Duration to keep model in memory |
OLLAMA_NUM_PARALLEL | 1 | Number of models allowed to load simultaneously |
OLLAMA_MAX_QUEUE | 512 | Maximum request queue size |
OLLAMA_DEBUG | false | Enable debug logging |
OLLAMA_ORIGINS | localhost | Allowed CORS origins |
OLLAMA_NUM_GPU | -1 (all) | Number of layers to offload to GPU |
OLLAMA_NUM_THREAD | auto | Number of CPU threads |
OLLAMA_CONTEXT_LENGTH | 2048 | Default context length |
OLLAMA_LLM_LIBRARY | auto | Force a specific GPU library |
OLLAMA_USE_MMAP | auto | Whether to use memory mapping |
Inference Parameters (per-request configuration)
{
"model": "llama3.2",
"messages": [...],
"options": {
"temperature": 0.8,
"top_k": 40,
"top_p": 0.9,
"min_p": 0.05,
"repeat_penalty": 1.1,
"num_ctx": 4096,
"num_predict": 512,
"seed": 42,
"stop": ["\n\n", "<|end|>"],
"num_gpu": 35
},
"keep_alive": "10m"
}
10. REST API Structure
Key Endpoints
| Method | Path | Description |
|---|---|---|
POST | /api/chat | Multi-turn chat (streaming / non-streaming) |
POST | /api/generate | Single-prompt text generation |
POST | /api/embed | Text embedding generation |
POST | /api/pull | Download a model |
POST | /api/push | Upload a model |
POST | /api/create | Create a custom model from a Modelfile |
DELETE | /api/delete | Delete a model |
POST | /api/copy | Copy a model |
GET | /api/tags | List locally available models |
GET | /api/ps | List currently loaded models |
POST | /api/show | Show model details |
HEAD | /api/blobs/:digest | Check if a blob exists |
OpenAI-Compatible Endpoints
Ollama also supports the OpenAI API format via middleware.
POST /v1/chat/completions → ChatHandler (OpenAI format)
POST /v1/completions → GenerateHandler (OpenAI format)
POST /v1/embeddings → EmbedHandler (OpenAI format)
GET /v1/models → ListHandler (OpenAI format)
Streaming Response Format (NDJSON)
{"model":"llama3.2","message":{"role":"assistant","content":"안"},"done":false}
{"model":"llama3.2","message":{"role":"assistant","content":"녕"},"done":false}
{"model":"llama3.2","message":{"role":"assistant","content":"하세요"},"done":false}
{"model":"llama3.2","message":{"role":"assistant","content":"!"},"done":false}
{
"model":"llama3.2",
"done":true,
"total_duration": 2340000000,
"load_duration": 100000000,
"prompt_eval_count": 12,
"prompt_eval_duration": 340000000,
"eval_count": 48,
"eval_duration": 1900000000
}
11. Core Data Structures
Request Types
type ChatRequest struct {
Model string `json:"model"`
Messages []Message `json:"messages"`
Stream *bool `json:"stream"`
Tools []Tool `json:"tools"` // function calling
Think *ThinkValue `json:"think"` // reasoning mode (DeepSeek, etc.)
KeepAlive *Duration `json:"keep_alive"`
Options map[string]any `json:"options"`
}
type Message struct {
Role string `json:"role"` // system/user/assistant/tool
Content string `json:"content"`
Images []ImageData `json:"images"` // Base64 images (multimodal)
ToolCalls []ToolCall `json:"tool_calls"`
}
Runtime Options
type Options struct {
// Sampling parameters (can be changed per request)
Temperature float32 // creativity (0.0 – 2.0)
TopK int // Top-K sampling
TopP float32 // Top-P (nucleus) sampling
MinP float32 // Min-P sampling
RepeatPenalty float32 // repetition penalty
Seed int // seed for reproducibility
// Load-time parameters (applied when the model is reloaded)
NumCtx int // context window size
NumBatch int // batch size
NumGPU int // GPU offload layers (-1 = all)
NumThread int // CPU thread count
UseMMap *bool // memory mapping
UseMLock bool // memory lock (prevent swapping)
}
Model Struct
type Model struct {
Name string
Config model.ConfigV2
ModelPath string
AdapterPaths []string // LoRA adapters
ProjectorPaths []string // Vision projectors
System string // system prompt
Template *template.Template
Digest string
Options map[string]any
Messages []api.Message // few-shot examples
}
12. Directory Tree
ollama/
├── api/ # Go client library + type definitions
├── auth/ # Ed25519-based authentication
├── cmd/ # CLI commands (cobra)
│ └── cmd.go # run, pull, push, list, show, create...
├── convert/ # Model format conversion (HuggingFace → GGUF, etc.)
├── discover/ # GPU detection (CUDA, ROCm, Metal)
├── envconfig/ # Environment variable config parsing
├── format/ # Formatting utilities (file size, duration, etc.)
├── integration/ # Integration tests
├── llama/ # llama.cpp CGO bindings + C/C++ sources
├── llm/ # LlamaServer interface + implementations
├── middleware/ # OpenAI/Anthropic compatibility middleware
├── ml/ # ML backend abstraction layer
├── model/ # Go-native model architecture implementations
│ ├── llama/
│ ├── mistral/
│ ├── qwen/
│ └── ...
├── manifest/ # Model manifests + blob management
├── runner/ # Model execution engines
│ ├── llamarunner/ # llama.cpp-based
│ ├── ollamarunner/ # Go-native
│ └── x/
│ ├── imagegen/ # image generation
│ └── mlxrunner/# Apple MLX
├── server/ # HTTP server + handlers + scheduler
│ ├── routes.go # route definitions
│ ├── sched.go # scheduler
│ └── ...
├── template/ # chat template system
└── tokenizer/ # tokenizer utilities
13. Key Design Decisions
1. Process Isolation (Process-per-Model)
Each model runs in its own separate process. If a model crashes, the main server stays alive, and when the process exits, GPU memory is automatically reclaimed.
2. Content-Addressable Storage
Like Docker Hub, models are managed as SHA256 digest-based layers. Multiple model tags can share the same weight file, saving disk space.
3. Single-Threaded Model Loading
Loading models onto the GPU is performed sequentially. Loading multiple models in parallel can cause GPU memory fragmentation and incorrect VRAM accounting.
4. Keep-Alive Caching
Once a model is loaded, it stays in memory for 5 minutes by default. Repeated requests receive immediate responses without reloading (which can take tens of seconds). Set OLLAMA_KEEP_ALIVE=0 to unload immediately, or -1 to keep the model loaded indefinitely.
5. Dynamic GPU Layer Allocation
When loading a model, Ollama measures available VRAM and offloads as many layers as possible to the GPU. When VRAM is insufficient, the remaining layers fall back to CPU RAM.
14. Performance Optimization Strategies
| Optimization technique | Description |
|---|---|
| KV cache reuse | Reuse the KV cache from a previous request via the context field |
| Memory mapping (mmap) | Map the GGUF file into memory for fast loading |
| Batch inference | Process multiple sequences in a single batch |
| GPU layer offloading | Distribute layers between GPU and CPU via the num_gpu param |
| Token streaming | Deliver generated tokens to the client immediately, no buffering |
| Reference counting | Share the same Runner instance for concurrent requests to the same model |
| CPU SIMD | Leverage AVX2/AVX512 instruction sets |
15. Q&A: Real-World Usage Scenarios
Q1. After running ollama run llama3.2, responses are very slow. How can I speed things up?
Root cause analysis:
-
GPU not being used: Check the current model state with
ollama ps. The GPU/CPU ratio is shown next to theSizecolumn. -
CPU fallback due to insufficient VRAM: If the model is larger than the available VRAM, some layers run on the CPU. This can make inference more than 10× slower.
Solutions:
# Check GPU status
ollama ps
# Inspect the number of layers loaded onto VRAM (debug logs)
OLLAMA_DEBUG=1 ollama run llama3.2
# If VRAM is insufficient, use a smaller quantized model
ollama pull llama3.2:3b-instruct-q4_K_M # 4-bit quantization, ~2GB
# Or increase the CPU thread count
OLLAMA_NUM_THREAD=8 ollama serve
Q2. I'm getting a "context length exceeded" error when calling the API.
Cause: The request has exceeded the model's default context length (2048 tokens).
Solution:
# Increase num_ctx in the request
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "긴 문서..."}],
"options": {
"num_ctx": 8192
}
}'
Or change the default via a Modelfile.
FROM llama3.2
PARAMETER num_ctx 8192
ollama create my-llama --file ./Modelfile
Note: Increasing
num_ctxgrows the KV cache size and requires more VRAM. If VRAM is insufficient, the model may fail to load at all.
Q3. I get a different answer every time I ask the same question. How do I get consistent results?
Cause: By default, temperature > 0, so sampling introduces randomness.
Solution:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "2+2=?"}],
"options": {
"temperature": 0,
"seed": 42
}
}'
Setting temperature: 0 always selects the highest-probability token.
Q4. How do I access Ollama from an external network (another machine, a Docker container, etc.)?
By default, Ollama only listens on 127.0.0.1:11434.
# Listen on all interfaces
OLLAMA_HOST=0.0.0.0 ollama serve
# Or bind to a specific IP
OLLAMA_HOST=192.168.1.100:11434 ollama serve
You may also need to configure CORS.
OLLAMA_ORIGINS="http://localhost:3000,https://myapp.com" ollama serve
Security note: When exposing Ollama externally, restrict access with firewall rules. Ollama itself has no authentication (except in cloud mode).
Q5. I want to use multiple models at the same time. Is that possible?
Yes, but VRAM is the limiting factor.
# Set the number of models allowed to load simultaneously (default: 1)
OLLAMA_NUM_PARALLEL=2 ollama serve
How it works:
- First model loaded:
llama3.2→ placed in VRAM - Second model loaded:
mistral→ loaded alongside if VRAM allows; otherwisellama3.2is unloaded first
To keep a model permanently in memory:
# Call the model with keep_alive=-1 to retain it indefinitely
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"keep_alive": -1,
"prompt": ""
}'
Q6. My internet connection dropped during ollama pull. Do I have to start over?
No. Ollama supports resumable downloads.
Partial download files are saved to ~/.ollama/models/blobs/ in the form sha256-[digest]-partial. Simply re-run the same ollama pull command and the download will resume from where it left off.
Q7. What is the fastest configuration on Apple Silicon Macs?
The Metal backend is activated automatically on Apple Silicon.
# Leverage Unified Memory (shared CPU+GPU memory)
# MacBook Pro M3 Pro (18GB): can run llama3.2:8b Q8 fully on GPU
# Disabling memory mapping slows initial loading but can improve inference speed
OLLAMA_USE_MMAP=0 ollama serve
# Use the MLX Runner (experimental — faster Metal inference)
OLLAMA_LLM_LIBRARY=mlx ollama serve
Because M1/M2/M3 Unified Memory has no distinct VRAM boundary, setting num_gpu=-1 makes the entire memory available to the GPU.
Q8. How do I call the Ollama API from Python or JavaScript?
Python (official library):
import ollama
# Streaming chat
for chunk in ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': '한국의 수도는?'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
# Embeddings
response = ollama.embed(model='nomic-embed-text', input='텍스트')
print(response['embeddings'])
JavaScript/TypeScript (official library):
import ollama from 'ollama'
const response = await ollama.chat({
model: 'llama3.2',
messages: [{ role: 'user', content: '안녕하세요!' }],
stream: true,
})
for await (const part of response) {
process.stdout.write(part.message.content)
}
Using the OpenAI-compatible SDK:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # any value works
)
response = client.chat.completions.create(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Hello!'}],
)
Q9. How do I integrate with LangChain or LlamaIndex?
# LangChain
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.2", temperature=0)
response = llm.invoke("파이썬으로 피보나치 수열을 구현해줘")
# LlamaIndex
from llama_index.llms.ollama import Ollama
llm = Ollama(model="llama3.2", request_timeout=120.0)
response = llm.complete("안녕하세요!")
Q10. Can I run a GGUF model from HuggingFace directly in Ollama?
Yes. You can specify a local path or a HuggingFace repository in the FROM directive of a Modelfile.
# Option 1: local GGUF file
cat > Modelfile << 'EOF'
FROM /path/to/my-model.gguf
PARAMETER temperature 0.7
SYSTEM "You are a helpful assistant."
EOF
ollama create my-custom-model --file Modelfile
ollama run my-custom-model
# Option 2: directly from a HuggingFace repo
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
Q11. How do I use a Vision model (image understanding)?
# Use the LLaVA model
ollama pull llava
# Send an image via the API (Base64-encoded)
import base64, ollama
with open('image.jpg', 'rb') as f:
image_data = base64.b64encode(f.read()).decode()
response = ollama.chat(
model='llava',
messages=[{
'role': 'user',
'content': '이 이미지에 무엇이 보이나요?',
'images': [image_data]
}]
)
Q12. How do I run Ollama with a GPU inside a Docker container?
# NVIDIA GPU
docker run -d \
--gpus=all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
# CPU-only (no GPU)
docker run -d \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
# Run a model
docker exec -it ollama ollama run llama3.2
Q13. Is Function Calling (Tool Use) supported?
Yes! It follows the OpenAI-compatible format.
import ollama
tools = [{
'type': 'function',
'function': {
'name': 'get_weather',
'description': '특정 도시의 날씨를 조회합니다',
'parameters': {
'type': 'object',
'properties': {
'city': {'type': 'string', 'description': '도시 이름'},
},
'required': ['city'],
},
},
}]
response = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': '서울 날씨 알려줘'}],
tools=tools,
)
# If tool_calls are present, execute the corresponding function
if response.message.tool_calls:
for tool_call in response.message.tool_calls:
print(f"Call: {tool_call.function.name}({tool_call.function.arguments})")
Note: Tool use support varies by model. Recent models such as
llama3.2,qwen2.5, andmistralare recommended.
Source code version analyzed: Ollama v0.18.0 (as of March 2026)