NVIDIA A6000 Pro Blackwell Server
This guide covers working with LLMs and GPU resources on the Blackwell server. This server has two GPUs and runs two LLM services: Ollama and vLLM.
Server Overview
| Specification | Details |
|---|---|
| GPUs | 2x NVIDIA A6000 Pro Blackwell |
| VRAM | 96 GB per GPU (192 GB total) |
| Server IP | 129.10.156.97 |
LLM Services
| Service | Container | Port | GPU |
|---|---|---|---|
| Ollama | ollama-blackwell | 5000 | GPU 1 |
| vLLM (Gemma 12B) | gemma_12b | 8000 | GPU 0 |
GPU Commands
Check GPU Status
View GPU utilization and running processes:
nvidia-smi
For continuous monitoring:
watch -n 1 nvidia-smi
Ollama (Port 5000)
Ollama runs on GPU 1 and is configured via:
/usr/local/bin/ollama-gpu.sh
Docker Commands
View Running Containers
docker ps
List Available Models
docker exec ollama-blackwell ollama list
Check GPU Usage by Container
docker exec ollama-blackwell nvidia-smi
Managing Models
Pull a New Model
Download models from Ollama Model Library:
docker exec ollama-blackwell ollama pull <model_name>
Examples:
# Pull Llama 3.2
docker exec ollama-blackwell ollama pull llama3.2
# Pull Mistral
docker exec ollama-blackwell ollama pull mistral
# Pull CodeLlama
docker exec ollama-blackwell ollama pull codellama
Remove a Model
docker exec ollama-blackwell ollama rm <model_name>
Accessing Ollama from Local Machine
Step 1: Create SSH Tunnel
ssh -L 5000:localhost:5000 test_user@129.10.156.97
Step 2: Test the Connection
curl http://localhost:5000/api/tags
Step 3: Use the API
curl http://localhost:5000/api/generate -d '{
"model": "llama3.2",
"prompt": "Hello, how are you?"
}'
vLLM - Gemma 12B (Port 8000)
vLLM runs on GPU 0 serving the google/gemma-3-12b-it model.
Container Configuration
The vLLM container is started with the following command:
docker run -d \
--name gemma_12b \
--gpus '"device=0"' \
-p 8000:8000 \
--network presbot-server-uat_presbot-network \
--restart unless-stopped \
vllm/vllm-openai:latest \
--model google/gemma-3-12b-it \
--gpu-memory-utilization 0.70 \
--max-num-seqs 20 \
--max-model-len 8192 \
--enable-prefix-caching \
--disable-log-requests \
--host 0.0.0.0 \
--port 8000
Key Parameters
| Parameter | Value | Description |
|---|---|---|
--model | google/gemma-3-12b-it | Gemma 3 12B Instruct model |
--gpu-memory-utilization | 0.70 | Uses 70% of GPU memory |
--max-num-seqs | 20 | Maximum concurrent sequences |
--max-model-len | 8192 | Maximum context length |
--enable-prefix-caching | - | Enables prefix caching for efficiency |
Docker Commands
Check Container Status
docker ps | grep gemma_12b
View Container Logs
docker logs gemma_12b
Restart Container
docker restart gemma_12b
Stop Container
docker stop gemma_12b
Start Container
docker start gemma_12b
Accessing vLLM from Local Machine
Step 1: Create SSH Tunnel
ssh -L 8000:localhost:8000 test_user@129.10.156.97
Step 2: Test the Connection
curl http://localhost:8000/v1/models
Step 3: Use the OpenAI-Compatible API
vLLM provides an OpenAI-compatible API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-12b-it",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Troubleshooting
Container Not Running
# Check all containers
docker ps -a
# Start Ollama container
docker start ollama-blackwell
# Start vLLM container
docker start gemma_12b
Model Not Found (Ollama)
# List available models
docker exec ollama-blackwell ollama list
# Pull the model
docker exec ollama-blackwell ollama pull <model_name>
vLLM Container Crashes
Check the logs for errors:
docker logs gemma_12b --tail 100
Common issues:
- Out of memory: Reduce
--gpu-memory-utilizationor--max-num-seqs - Model download failed: Check network connectivity
Out of Memory (OOM)
- Check GPU usage:
nvidia-smi - Identify which container is using too much memory
- Restart the problematic container or wait for processes to complete
Connection Refused
- Ensure SSH tunnel is active
- Verify correct port (5000 for Ollama, 8000 for vLLM)
- Check if container is running:
docker ps