Skip to main content

NVIDIA A6000 Pro Blackwell Server

This guide covers working with LLMs and GPU resources on the Blackwell server. This server has two GPUs and runs two LLM services: Ollama and vLLM.

Server Overview

SpecificationDetails
GPUs2x NVIDIA A6000 Pro Blackwell
VRAM96 GB per GPU (192 GB total)
Server IP129.10.156.97

LLM Services

ServiceContainerPortGPU
Ollamaollama-blackwell5000GPU 1
vLLM (Gemma 12B)gemma_12b8000GPU 0

GPU Commands

Check GPU Status

View GPU utilization and running processes:

nvidia-smi

For continuous monitoring:

watch -n 1 nvidia-smi

Ollama (Port 5000)

Ollama runs on GPU 1 and is configured via:

/usr/local/bin/ollama-gpu.sh

Docker Commands

View Running Containers

docker ps

List Available Models

docker exec ollama-blackwell ollama list

Check GPU Usage by Container

docker exec ollama-blackwell nvidia-smi

Managing Models

Pull a New Model

Download models from Ollama Model Library:

docker exec ollama-blackwell ollama pull <model_name>

Examples:

# Pull Llama 3.2
docker exec ollama-blackwell ollama pull llama3.2

# Pull Mistral
docker exec ollama-blackwell ollama pull mistral

# Pull CodeLlama
docker exec ollama-blackwell ollama pull codellama

Remove a Model

docker exec ollama-blackwell ollama rm <model_name>

Accessing Ollama from Local Machine

Step 1: Create SSH Tunnel

ssh -L 5000:localhost:5000 test_user@129.10.156.97

Step 2: Test the Connection

curl http://localhost:5000/api/tags

Step 3: Use the API

curl http://localhost:5000/api/generate -d '{
"model": "llama3.2",
"prompt": "Hello, how are you?"
}'

vLLM - Gemma 12B (Port 8000)

vLLM runs on GPU 0 serving the google/gemma-3-12b-it model.

Container Configuration

The vLLM container is started with the following command:

docker run -d \
--name gemma_12b \
--gpus '"device=0"' \
-p 8000:8000 \
--network presbot-server-uat_presbot-network \
--restart unless-stopped \
vllm/vllm-openai:latest \
--model google/gemma-3-12b-it \
--gpu-memory-utilization 0.70 \
--max-num-seqs 20 \
--max-model-len 8192 \
--enable-prefix-caching \
--disable-log-requests \
--host 0.0.0.0 \
--port 8000

Key Parameters

ParameterValueDescription
--modelgoogle/gemma-3-12b-itGemma 3 12B Instruct model
--gpu-memory-utilization0.70Uses 70% of GPU memory
--max-num-seqs20Maximum concurrent sequences
--max-model-len8192Maximum context length
--enable-prefix-caching-Enables prefix caching for efficiency

Docker Commands

Check Container Status

docker ps | grep gemma_12b

View Container Logs

docker logs gemma_12b

Restart Container

docker restart gemma_12b

Stop Container

docker stop gemma_12b

Start Container

docker start gemma_12b

Accessing vLLM from Local Machine

Step 1: Create SSH Tunnel

ssh -L 8000:localhost:8000 test_user@129.10.156.97

Step 2: Test the Connection

curl http://localhost:8000/v1/models

Step 3: Use the OpenAI-Compatible API

vLLM provides an OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-12b-it",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'

Troubleshooting

Container Not Running

# Check all containers
docker ps -a

# Start Ollama container
docker start ollama-blackwell

# Start vLLM container
docker start gemma_12b

Model Not Found (Ollama)

# List available models
docker exec ollama-blackwell ollama list

# Pull the model
docker exec ollama-blackwell ollama pull <model_name>

vLLM Container Crashes

Check the logs for errors:

docker logs gemma_12b --tail 100

Common issues:

  • Out of memory: Reduce --gpu-memory-utilization or --max-num-seqs
  • Model download failed: Check network connectivity

Out of Memory (OOM)

  1. Check GPU usage: nvidia-smi
  2. Identify which container is using too much memory
  3. Restart the problematic container or wait for processes to complete

Connection Refused

  • Ensure SSH tunnel is active
  • Verify correct port (5000 for Ollama, 8000 for vLLM)
  • Check if container is running: docker ps