Running NVIDIA Nemotron 3 Super on Vast.ai

NVIDIA Nemotron 3 Super is a 120B parameter model that only activates 12B parameters per token, which means you get the quality of a much larger model at a fraction of the compute. It uses a novel hybrid architecture — Mamba-2 for fast sequence processing, Transformer attention where precision matters, and a Latent Mixture-of-Experts layer for efficient routing — and supports context windows up to 1M tokens. The model is particularly interesting because it ships with a built-in reasoning toggle. You can turn reasoning on for complex tasks like math and coding, switch to a low-effort mode for lighter thinking, or turn it off entirely for fast direct answers — all from the same deployment, controlled per request. It also supports Multi-Token Prediction for faster inference through speculative decoding, and performs well on agentic benchmarks involving tool use and multi-step task execution. This guide deploys the FP8 variant on Vast.ai using SGLang and queries it via the OpenAI-compatible API.

Prerequisites

Before getting started, you’ll need:

A Vast.ai account with credits (Sign up here)
Vast.ai CLI installed (pip install vastai)
Your Vast.ai API key configured
Python 3.8+ (for the OpenAI SDK examples)

Get your API key from the Vast.ai account page and set it with vastai set api-key YOUR_API_KEY.

Understanding Nemotron 3 Super

Key capabilities:

Efficient MoE Architecture: 120B total parameters, only 12B active per token
Hybrid Layers: Mamba-2 (linear-time) + Transformer attention + Latent MoE
Reasoning Toggle: On, off, or low-effort modes via chat_template_kwargs
Long Context: Up to 1M tokens (256K default)
Commercial License: NVIDIA Nemotron Open Model License

Hardware Requirements

The FP8 variant requires:

GPUs: 2× H100 SXM (80GB each) with NVLink for tensor parallelism
Disk Space: 200GB minimum (model is ~120GB)
CUDA Version: 12.4 or higher
Docker Image: lmsysorg/sglang:v0.5.9

H100 SXM GPUs are required (not PCIe) because NVLink is needed for efficient tensor parallelism across 2 GPUs.

Instance Configuration

Step 1: Search for Suitable Instances

Bash

vastai search offers \
  "gpu_name=H100_SXM num_gpus=2 gpu_ram>=80 cuda_vers>=12.4 \
   disk_space>=200 direct_port_count>1 inet_down>=500 rentable=true" \
  --order "dph_base" --limit 10

This searches for:

2× H100 SXM GPUs with at least 80GB VRAM each
CUDA 12.4 or higher
At least 200GB disk space
Direct port access for the API endpoint
High download speed for faster model loading
Sorted by price (lowest first)

Step 2: Create the Instance

Select an instance ID from the search results and deploy:

Bash

vastai create instance <INSTANCE_ID> \
  --image lmsysorg/sglang:v0.5.9 \
  --env '-p 5000:5000' \
  --disk 200 \
  --onstart-cmd "python3 -m sglang.launch_server \
    --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
    --served-model-name nvidia/nemotron-3-super \
    --host 0.0.0.0 \
    --port 5000 \
    --trust-remote-code \
    --tp 2 \
    --kv-cache-dtype fp8_e4m3 \
    --reasoning-parser nano_v3"

Key parameters explained:

--image lmsysorg/sglang:v0.5.9 — SGLang stable release with Nemotron 3 Super support
--env '-p 5000:5000' — Expose port 5000 for the API endpoint
--disk 200 — 200GB for the ~120GB model weights plus overhead
--tp 2 — Tensor parallelism across both H100 GPUs
--kv-cache-dtype fp8_e4m3 — FP8 KV cache for efficient memory usage
--reasoning-parser nano_v3 — Enables reasoning content parsing for thinking mode
--trust-remote-code — Required for the custom Nemotron architecture

Monitoring Deployment

Check Deployment Status

Bash

vastai logs <INSTANCE_ID>

Look for this message indicating the server is ready:

Text

The server is fired up and ready to roll!

Get Your Endpoint

Once deployment completes, get your instance details:

Bash

vastai show instance <INSTANCE_ID> --raw

Look for the ports field — it maps internal port 5000 to an external port. Your API endpoint will be:

Text

http://<PUBLIC_IP>:<EXTERNAL_PORT>/v1

Using the Nemotron 3 Super API

Quick Test with cURL

Verify the server is responding:

Bash

curl -X POST http://<IP>:<PORT>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-super",
    "messages": [{"role": "user", "content": "What is 25 * 37?"}],
    "max_tokens": 500,
    "temperature": 1.0,
    "top_p": 0.95
  }'

NVIDIA requires temperature=1.0 and top_p=0.95 for all inference with this model.

Python Integration

Using the OpenAI Python SDK:

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://<IP>:<PORT>/v1",
    api_key="EMPTY"  # SGLang doesn't require an API key
)

response = client.chat.completions.create(
    model="nvidia/nemotron-3-super",
    messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
    max_tokens=300,
    temperature=1.0,
    top_p=0.95
)

print(response.choices[0].message.content)

Reasoning Modes

Nemotron 3 Super supports three reasoning modes, controlled via chat_template_kwargs. By default, reasoning is enabled.

Reasoning ON (Default)

The model shows its thinking in reasoning_content before giving the final answer in content:

Python

response = client.chat.completions.create(
    model="nvidia/nemotron-3-super",
    messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
    max_tokens=300,
    temperature=1.0,
    top_p=0.95,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)

msg = response.choices[0].message
print("Thinking:", msg.reasoning_content)
print("Answer:", msg.content)

Reasoning OFF

Disable reasoning for faster, direct responses:

Python

response = client.chat.completions.create(
    model="nvidia/nemotron-3-super",
    messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
    max_tokens=300,
    temperature=1.0,
    top_p=0.95,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

msg = response.choices[0].message
# With reasoning OFF, the answer is in reasoning_content
print("Answer:", msg.reasoning_content)

When reasoning is disabled via SGLang’s nano_v3 parser, the response text is returned in reasoning_content instead of content (which will be None). Make sure to read from the correct field based on the mode you’re using.

Low-Effort Reasoning

A middle ground — brief reasoning with fast responses:

Python

response = client.chat.completions.create(
    model="nvidia/nemotron-3-super",
    messages=[{"role": "user", "content": "What is 25 * 37?"}],
    max_tokens=300,
    temperature=1.0,
    top_p=0.95,
    extra_body={"chat_template_kwargs": {"enable_thinking": True, "low_effort": True}}
)

msg = response.choices[0].message
print("Thinking:", msg.reasoning_content)  # Brief reasoning
print("Answer:", msg.content)

Reasoning with cURL

Pass chat_template_kwargs at the top level of the JSON body:

Bash

curl -X POST http://<IP>:<PORT>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-super",
    "messages": [{"role": "user", "content": "What is 25 * 37?"}],
    "max_tokens": 500,
    "temperature": 1.0,
    "top_p": 0.95,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Cleanup

When you’re done, destroy the instance to stop billing:

Bash

vastai destroy instance <INSTANCE_ID>

Always destroy your instance when you’re finished to avoid unnecessary charges.

Additional Resources

NVIDIA Nemotron 3 Super Blog Post — Architecture details and benchmarks
HuggingFace Model Card (FP8) — Model card and usage instructions
SGLang Documentation — SGLang configuration and usage
Vast.ai CLI Guide — Learn more about the Vast.ai CLI
GPU Instance Guide — Understanding Vast.ai instances

Conclusion

Nemotron 3 Super delivers frontier-class reasoning performance by activating only 12B of its 120B parameters per token. With SGLang and Vast.ai, you can deploy the model on 2× H100 GPUs and start querying it via the OpenAI-compatible API. The reasoning toggle is particularly useful: enable it for complex tasks like math, coding, and analysis, or disable it for fast direct answers in production pipelines.

AI/ML Frameworks

Serving Infrastructure

AI Agents

MCP

Text Generation

Image Generation

Video Generation

Audio Generation

Transcription

OCR

Embeddings

NER

Virtual Computing

Graphics Rendering

GPU Programming

Distributed Computing

Development Tools

Specific GPUs

NVIDIA Nemotron 3 Super

Running NVIDIA Nemotron 3 Super on Vast.ai

Prerequisites

Understanding Nemotron 3 Super

Hardware Requirements

Instance Configuration

Step 1: Search for Suitable Instances

Step 2: Create the Instance

Monitoring Deployment

Check Deployment Status

Get Your Endpoint

Using the Nemotron 3 Super API

Quick Test with cURL

Python Integration

Reasoning Modes

Reasoning ON (Default)

Reasoning OFF

Low-Effort Reasoning

Reasoning with cURL

Cleanup

Additional Resources

Conclusion

AI/ML Frameworks

Serving Infrastructure

AI Agents

MCP

Text Generation

Image Generation

Video Generation

Audio Generation

Transcription

OCR

Embeddings

NER

Virtual Computing

Graphics Rendering

GPU Programming

Distributed Computing

Development Tools

Specific GPUs

​Running NVIDIA Nemotron 3 Super on Vast.ai

​Prerequisites

​Understanding Nemotron 3 Super

​Hardware Requirements

​Instance Configuration

​Step 1: Search for Suitable Instances

​Step 2: Create the Instance

​Monitoring Deployment

​Check Deployment Status

​Get Your Endpoint

​Using the Nemotron 3 Super API

​Quick Test with cURL

​Python Integration

​Reasoning Modes

​Reasoning ON (Default)

​Reasoning OFF

​Low-Effort Reasoning

​Reasoning with cURL

​Cleanup

​Additional Resources

​Conclusion

Running NVIDIA Nemotron 3 Super on Vast.ai

Prerequisites

Understanding Nemotron 3 Super

Hardware Requirements

Instance Configuration

Step 1: Search for Suitable Instances

Step 2: Create the Instance

Monitoring Deployment

Check Deployment Status

Get Your Endpoint

Using the Nemotron 3 Super API

Quick Test with cURL

Python Integration

Reasoning Modes

Reasoning ON (Default)

Reasoning OFF

Low-Effort Reasoning

Reasoning with cURL

Cleanup

Additional Resources

Conclusion