Skip to main content

Running NVIDIA Nemotron 3 Super on Vast.ai

NVIDIA Nemotron 3 Super is a 120B parameter model that only activates 12B parameters per token, which means you get the quality of a much larger model at a fraction of the compute. It uses a novel hybrid architecture — Mamba-2 for fast sequence processing, Transformer attention where precision matters, and a Latent Mixture-of-Experts layer for efficient routing — and supports context windows up to 1M tokens. The model is particularly interesting because it ships with a built-in reasoning toggle. You can turn reasoning on for complex tasks like math and coding, switch to a low-effort mode for lighter thinking, or turn it off entirely for fast direct answers — all from the same deployment, controlled per request. It also supports Multi-Token Prediction for faster inference through speculative decoding, and performs well on agentic benchmarks involving tool use and multi-step task execution. This guide deploys the FP8 variant on Vast.ai using SGLang and queries it via the OpenAI-compatible API.

Prerequisites

Before getting started, you’ll need:
  • A Vast.ai account with credits (Sign up here)
  • Vast.ai CLI installed (pip install vastai)
  • Your Vast.ai API key configured
  • Python 3.8+ (for the OpenAI SDK examples)
Get your API key from the Vast.ai account page and set it with vastai set api-key YOUR_API_KEY.

Understanding Nemotron 3 Super

Key capabilities:
  • Efficient MoE Architecture: 120B total parameters, only 12B active per token
  • Hybrid Layers: Mamba-2 (linear-time) + Transformer attention + Latent MoE
  • Reasoning Toggle: On, off, or low-effort modes via chat_template_kwargs
  • Long Context: Up to 1M tokens (256K default)
  • Commercial License: NVIDIA Nemotron Open Model License

Hardware Requirements

The FP8 variant requires:
  • GPUs: 2× H100 SXM (80GB each) with NVLink for tensor parallelism
  • Disk Space: 200GB minimum (model is ~120GB)
  • CUDA Version: 12.4 or higher
  • Docker Image: lmsysorg/sglang:v0.5.9
H100 SXM GPUs are required (not PCIe) because NVLink is needed for efficient tensor parallelism across 2 GPUs.

Instance Configuration

Step 1: Search for Suitable Instances

Bash
vastai search offers \
  "gpu_name=H100_SXM num_gpus=2 gpu_ram>=80 cuda_vers>=12.4 \
   disk_space>=200 direct_port_count>1 inet_down>=500 rentable=true" \
  --order "dph_base" --limit 10
This searches for:
  • 2× H100 SXM GPUs with at least 80GB VRAM each
  • CUDA 12.4 or higher
  • At least 200GB disk space
  • Direct port access for the API endpoint
  • High download speed for faster model loading
  • Sorted by price (lowest first)

Step 2: Create the Instance

Select an instance ID from the search results and deploy:
Bash
vastai create instance <INSTANCE_ID> \
  --image lmsysorg/sglang:v0.5.9 \
  --env '-p 5000:5000' \
  --disk 200 \
  --onstart-cmd "python3 -m sglang.launch_server \
    --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
    --served-model-name nvidia/nemotron-3-super \
    --host 0.0.0.0 \
    --port 5000 \
    --trust-remote-code \
    --tp 2 \
    --kv-cache-dtype fp8_e4m3 \
    --reasoning-parser nano_v3"
Key parameters explained:
  • --image lmsysorg/sglang:v0.5.9 — SGLang stable release with Nemotron 3 Super support
  • --env '-p 5000:5000' — Expose port 5000 for the API endpoint
  • --disk 200 — 200GB for the ~120GB model weights plus overhead
  • --tp 2 — Tensor parallelism across both H100 GPUs
  • --kv-cache-dtype fp8_e4m3 — FP8 KV cache for efficient memory usage
  • --reasoning-parser nano_v3 — Enables reasoning content parsing for thinking mode
  • --trust-remote-code — Required for the custom Nemotron architecture

Monitoring Deployment

Check Deployment Status

Bash
vastai logs <INSTANCE_ID>
Look for this message indicating the server is ready:
Text
The server is fired up and ready to roll!

Get Your Endpoint

Once deployment completes, get your instance details:
Bash
vastai show instance <INSTANCE_ID> --raw
Look for the ports field — it maps internal port 5000 to an external port. Your API endpoint will be:
Text
http://<PUBLIC_IP>:<EXTERNAL_PORT>/v1

Using the Nemotron 3 Super API

Quick Test with cURL

Verify the server is responding:
Bash
curl -X POST http://<IP>:<PORT>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-super",
    "messages": [{"role": "user", "content": "What is 25 * 37?"}],
    "max_tokens": 500,
    "temperature": 1.0,
    "top_p": 0.95
  }'
NVIDIA requires temperature=1.0 and top_p=0.95 for all inference with this model.

Python Integration

Using the OpenAI Python SDK:
Python
from openai import OpenAI

client = OpenAI(
    base_url="http://<IP>:<PORT>/v1",
    api_key="EMPTY"  # SGLang doesn't require an API key
)

response = client.chat.completions.create(
    model="nvidia/nemotron-3-super",
    messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
    max_tokens=300,
    temperature=1.0,
    top_p=0.95
)

print(response.choices[0].message.content)

Reasoning Modes

Nemotron 3 Super supports three reasoning modes, controlled via chat_template_kwargs. By default, reasoning is enabled.

Reasoning ON (Default)

The model shows its thinking in reasoning_content before giving the final answer in content:
Python
response = client.chat.completions.create(
    model="nvidia/nemotron-3-super",
    messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
    max_tokens=300,
    temperature=1.0,
    top_p=0.95,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)

msg = response.choices[0].message
print("Thinking:", msg.reasoning_content)
print("Answer:", msg.content)

Reasoning OFF

Disable reasoning for faster, direct responses:
Python
response = client.chat.completions.create(
    model="nvidia/nemotron-3-super",
    messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
    max_tokens=300,
    temperature=1.0,
    top_p=0.95,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

msg = response.choices[0].message
# With reasoning OFF, the answer is in reasoning_content
print("Answer:", msg.reasoning_content)
When reasoning is disabled via SGLang’s nano_v3 parser, the response text is returned in reasoning_content instead of content (which will be None). Make sure to read from the correct field based on the mode you’re using.

Low-Effort Reasoning

A middle ground — brief reasoning with fast responses:
Python
response = client.chat.completions.create(
    model="nvidia/nemotron-3-super",
    messages=[{"role": "user", "content": "What is 25 * 37?"}],
    max_tokens=300,
    temperature=1.0,
    top_p=0.95,
    extra_body={"chat_template_kwargs": {"enable_thinking": True, "low_effort": True}}
)

msg = response.choices[0].message
print("Thinking:", msg.reasoning_content)  # Brief reasoning
print("Answer:", msg.content)

Reasoning with cURL

Pass chat_template_kwargs at the top level of the JSON body:
Bash
curl -X POST http://<IP>:<PORT>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-super",
    "messages": [{"role": "user", "content": "What is 25 * 37?"}],
    "max_tokens": 500,
    "temperature": 1.0,
    "top_p": 0.95,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Cleanup

When you’re done, destroy the instance to stop billing:
Bash
vastai destroy instance <INSTANCE_ID>
Always destroy your instance when you’re finished to avoid unnecessary charges.

Additional Resources

Conclusion

Nemotron 3 Super delivers frontier-class reasoning performance by activating only 12B of its 120B parameters per token. With SGLang and Vast.ai, you can deploy the model on 2× H100 GPUs and start querying it via the OpenAI-compatible API. The reasoning toggle is particularly useful: enable it for complex tasks like math, coding, and analysis, or disable it for fast direct answers in production pipelines.