Running NVIDIA Nemotron 3 Super on Vast.ai
NVIDIA Nemotron 3 Super is a 120B parameter model that only activates 12B parameters per token, which means you get the quality of a much larger model at a fraction of the compute. It uses a novel hybrid architecture — Mamba-2 for fast sequence processing, Transformer attention where precision matters, and a Latent Mixture-of-Experts layer for efficient routing — and supports context windows up to 1M tokens. The model is particularly interesting because it ships with a built-in reasoning toggle. You can turn reasoning on for complex tasks like math and coding, switch to a low-effort mode for lighter thinking, or turn it off entirely for fast direct answers — all from the same deployment, controlled per request. It also supports Multi-Token Prediction for faster inference through speculative decoding, and performs well on agentic benchmarks involving tool use and multi-step task execution. This guide deploys the FP8 variant on Vast.ai using SGLang and queries it via the OpenAI-compatible API.Prerequisites
Before getting started, you’ll need:- A Vast.ai account with credits (Sign up here)
- Vast.ai CLI installed (
pip install vastai) - Your Vast.ai API key configured
- Python 3.8+ (for the OpenAI SDK examples)
Get your API key from the Vast.ai account page and set it with
vastai set api-key YOUR_API_KEY.Understanding Nemotron 3 Super
Key capabilities:- Efficient MoE Architecture: 120B total parameters, only 12B active per token
- Hybrid Layers: Mamba-2 (linear-time) + Transformer attention + Latent MoE
- Reasoning Toggle: On, off, or low-effort modes via
chat_template_kwargs - Long Context: Up to 1M tokens (256K default)
- Commercial License: NVIDIA Nemotron Open Model License
Hardware Requirements
The FP8 variant requires:- GPUs: 2× H100 SXM (80GB each) with NVLink for tensor parallelism
- Disk Space: 200GB minimum (model is ~120GB)
- CUDA Version: 12.4 or higher
- Docker Image:
lmsysorg/sglang:v0.5.9
H100 SXM GPUs are required (not PCIe) because NVLink is needed for efficient tensor parallelism across 2 GPUs.
Instance Configuration
Step 1: Search for Suitable Instances
Bash
- 2× H100 SXM GPUs with at least 80GB VRAM each
- CUDA 12.4 or higher
- At least 200GB disk space
- Direct port access for the API endpoint
- High download speed for faster model loading
- Sorted by price (lowest first)
Step 2: Create the Instance
Select an instance ID from the search results and deploy:Bash
--image lmsysorg/sglang:v0.5.9— SGLang stable release with Nemotron 3 Super support--env '-p 5000:5000'— Expose port 5000 for the API endpoint--disk 200— 200GB for the ~120GB model weights plus overhead--tp 2— Tensor parallelism across both H100 GPUs--kv-cache-dtype fp8_e4m3— FP8 KV cache for efficient memory usage--reasoning-parser nano_v3— Enables reasoning content parsing for thinking mode--trust-remote-code— Required for the custom Nemotron architecture
Monitoring Deployment
Check Deployment Status
Bash
Text
Get Your Endpoint
Once deployment completes, get your instance details:Bash
ports field — it maps internal port 5000 to an external port. Your API endpoint will be:
Text
Using the Nemotron 3 Super API
Quick Test with cURL
Verify the server is responding:Bash
NVIDIA requires
temperature=1.0 and top_p=0.95 for all inference with this model.Python Integration
Using the OpenAI Python SDK:Python
Reasoning Modes
Nemotron 3 Super supports three reasoning modes, controlled viachat_template_kwargs. By default, reasoning is enabled.
Reasoning ON (Default)
The model shows its thinking inreasoning_content before giving the final answer in content:
Python
Reasoning OFF
Disable reasoning for faster, direct responses:Python
When reasoning is disabled via SGLang’s
nano_v3 parser, the response text is returned in reasoning_content instead of content (which will be None). Make sure to read from the correct field based on the mode you’re using.Low-Effort Reasoning
A middle ground — brief reasoning with fast responses:Python
Reasoning with cURL
Passchat_template_kwargs at the top level of the JSON body:
Bash
Cleanup
When you’re done, destroy the instance to stop billing:Bash
Always destroy your instance when you’re finished to avoid unnecessary charges.
Additional Resources
- NVIDIA Nemotron 3 Super Blog Post — Architecture details and benchmarks
- HuggingFace Model Card (FP8) — Model card and usage instructions
- SGLang Documentation — SGLang configuration and usage
- Vast.ai CLI Guide — Learn more about the Vast.ai CLI
- GPU Instance Guide — Understanding Vast.ai instances