NVIDIA A6000 Pro Blackwell Server

This guide covers working with LLMs and GPU resources on the Blackwell server. This server has two GPUs and runs two LLM services: Ollama and vLLM.

Server Overview

Specification	Details
GPUs	2x NVIDIA A6000 Pro Blackwell
VRAM	96 GB per GPU (192 GB total)
Server IP	`129.10.156.97`

LLM Services

Service	Container	Port	GPU
Ollama	`ollama-blackwell`	5000	GPU 1
vLLM (Gemma 12B)	`gemma_12b`	8000	GPU 0

GPU Commands

Check GPU Status

View GPU utilization and running processes:

nvidia-smi

For continuous monitoring:

watch -n 1 nvidia-smi

Ollama (Port 5000)

Ollama runs on GPU 1 and is configured via:

/usr/local/bin/ollama-gpu.sh

Docker Commands

View Running Containers

docker ps

List Available Models

docker exec ollama-blackwell ollama list

Check GPU Usage by Container

docker exec ollama-blackwell nvidia-smi

Managing Models

Pull a New Model

Download models from Ollama Model Library:

docker exec ollama-blackwell ollama pull <model_name>

Examples:

# Pull Llama 3.2
docker exec ollama-blackwell ollama pull llama3.2

# Pull Mistral
docker exec ollama-blackwell ollama pull mistral

# Pull CodeLlama
docker exec ollama-blackwell ollama pull codellama

Remove a Model

docker exec ollama-blackwell ollama rm <model_name>

Accessing Ollama from Local Machine

Step 1: Create SSH Tunnel

ssh -L 5000:localhost:5000 test_user@129.10.156.97

Step 2: Test the Connection

curl http://localhost:5000/api/tags

Step 3: Use the API

curl http://localhost:5000/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello, how are you?"
}'

vLLM - Gemma 12B (Port 8000)

vLLM runs on GPU 0 serving the google/gemma-3-12b-it model.

Container Configuration

The vLLM container is started with the following command:

docker run -d \
  --name gemma_12b \
  --gpus '"device=0"' \
  -p 8000:8000 \
  --network presbot-server-uat_presbot-network \
  --restart unless-stopped \
  vllm/vllm-openai:latest \
  --model google/gemma-3-12b-it \
  --gpu-memory-utilization 0.70 \
  --max-num-seqs 20 \
  --max-model-len 8192 \
  --enable-prefix-caching \
  --disable-log-requests \
  --host 0.0.0.0 \
  --port 8000

Key Parameters

Parameter	Value	Description
`--model`	`google/gemma-3-12b-it`	Gemma 3 12B Instruct model
`--gpu-memory-utilization`	`0.70`	Uses 70% of GPU memory
`--max-num-seqs`	`20`	Maximum concurrent sequences
`--max-model-len`	`8192`	Maximum context length
`--enable-prefix-caching`	-	Enables prefix caching for efficiency

Docker Commands

Check Container Status

docker ps | grep gemma_12b

View Container Logs

docker logs gemma_12b

Restart Container

docker restart gemma_12b

Stop Container

docker stop gemma_12b

Start Container

docker start gemma_12b

Accessing vLLM from Local Machine

Step 1: Create SSH Tunnel

ssh -L 8000:localhost:8000 test_user@129.10.156.97

Step 2: Test the Connection

curl http://localhost:8000/v1/models

Step 3: Use the OpenAI-Compatible API

vLLM provides an OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-12b-it",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Troubleshooting

Container Not Running

# Check all containers
docker ps -a

# Start Ollama container
docker start ollama-blackwell

# Start vLLM container
docker start gemma_12b

Model Not Found (Ollama)

# List available models
docker exec ollama-blackwell ollama list

# Pull the model
docker exec ollama-blackwell ollama pull <model_name>

vLLM Container Crashes

Check the logs for errors:

docker logs gemma_12b --tail 100

Common issues:

Out of memory: Reduce --gpu-memory-utilization or --max-num-seqs
Model download failed: Check network connectivity

Out of Memory (OOM)

Check GPU usage: nvidia-smi
Identify which container is using too much memory
Restart the problematic container or wait for processes to complete

Connection Refused

Ensure SSH tunnel is active
Verify correct port (5000 for Ollama, 8000 for vLLM)
Check if container is running: docker ps

Server Overview​

LLM Services​

GPU Commands​

Check GPU Status​

Ollama (Port 5000)

Docker Commands​

View Running Containers​

List Available Models​

Check GPU Usage by Container​

Managing Models​

Pull a New Model​

Remove a Model​

Accessing Ollama from Local Machine​

Step 1: Create SSH Tunnel​

Step 2: Test the Connection​

Step 3: Use the API​

vLLM - Gemma 12B (Port 8000)

Container Configuration​

Key Parameters​

Docker Commands​

Check Container Status​

View Container Logs​

Restart Container​

Stop Container​

Start Container​

Accessing vLLM from Local Machine​

Step 1: Create SSH Tunnel​

Step 2: Test the Connection​

Step 3: Use the OpenAI-Compatible API​

Troubleshooting​

Container Not Running​

Model Not Found (Ollama)​

vLLM Container Crashes​

Out of Memory (OOM)​

Connection Refused​

Server Overview

LLM Services

GPU Commands

Check GPU Status

Docker Commands

View Running Containers

List Available Models

Check GPU Usage by Container

Managing Models

Pull a New Model

Remove a Model

Accessing Ollama from Local Machine

Step 1: Create SSH Tunnel

Step 2: Test the Connection

Step 3: Use the API

Container Configuration

Key Parameters

Docker Commands

Check Container Status

View Container Logs

Restart Container

Stop Container

Start Container

Accessing vLLM from Local Machine

Step 1: Create SSH Tunnel

Step 2: Test the Connection

Step 3: Use the OpenAI-Compatible API

Troubleshooting

Container Not Running

Model Not Found (Ollama)

vLLM Container Crashes

Out of Memory (OOM)

Connection Refused