Skip to main content
Brightnode offers custom Instant Cluster pricing plans for large scale and enterprise workloads. If you’re interested in learning more, contact our sales team.
Brightnode Instant Clusters provide fully managed compute clusters with high-performance networking for distributed workloads like multi-bnode training and large-scale AI inference.

Key features

  • High-speed networking from 1600 to 3200 Gbps within a single data center.
  • On-demand clusters are available from 2-8 bnodes (16-64 GPUs)
  • Contact our sales team for larger clusters (up to 512 GPUs).
  • Supports H200, B200, H100, and A100 GPUs.
  • Automatic cluster configuration with static IP and environment variables.
  • Multiple deployment options for different frameworks and use cases.

Networking performance

Instant Clusters feature high-speed local networking for efficient data movement between bnodes:
  • Most clusters include 3200 Gbps networking.
  • A100 clusters offer up to 1600 Gbps networking.
This fast networking enables efficient scaling of distributed training and inference workloads. Brightnode ensures bnodes selected for clusters are within the same data center for optimal performance.

Zero configuration

Brightnode automates cluster setup so you can focus on your workloads:
  • Clusters are pre-configured with static IP address management.
  • All necessary environment variables for distributed training are pre-configured.
  • Supports popular frameworks like PyTorch, TensorFlow, and Slurm.

Get started

Choose the tutorial that matches your preferred framework and use case. Deploy a Slurm cluster: Set up a managed Slurm cluster for high-performance computing workloads. Slurm provides job scheduling, resource allocation, and queue management for research environments and batch processing workflows. Deploy a PyTorch distributed training cluster: Set up multi-bnode PyTorch training for deep learning models. This tutorial covers distributed data parallel training, gradient synchronization, and performance optimization techniques. Deploy an Axolotl fine-tuning cluster: Use Axolotl’s framework for fine-tuning large language models across multiple GPUs. This approach simplifies customizing pre-trained models like Llama or Mistral with built-in training optimizations. Deploy an unmanaged Slurm cluster: For advanced users who need full control over Slurm configuration. This option provides a basic Slurm installation that you can customize for specialized workloads. You can also follow this video tutorial to learn how to deploy Kimi K2 using Instant Clusters.
All accounts have a default spending limit. To deploy a larger cluster, submit a support ticket at help@brightnode.cloud.

Network interfaces

High-bandwidth interfaces (ens1, ens2, etc.) handle communication between bnodes, while the management interface (eth0) manages external traffic. The NCCL environment variable NCCL_SOCKET_IFNAME uses all available interfaces by default. The PRIMARY_ADDR corresponds to ens1 to enable launching and bootstrapping distributed processes. Instant Clusters support up to 8 interfaces per bnode. Each interface (ens1 - ens8) provides a private network connection for inter-bnode communication, made available to distributed backends such as NCCL or GLOO.

Environment variables

The following environment variables are present in all bnodes in an Instant Cluster:
Environment VariableDescription
PRIMARY_ADDR / MASTER_ADDRThe address of the primary bnode.
PRIMARY_PORT / MASTER_PORTThe port of the primary bnode. All ports are available.
NODE_ADDRThe static IP of this bnode within the cluster network.
NODE_RANKThe cluster rank (i.e. global rank) assigned to this bnode. NODE_RANK = 0 for the primary bnode.
NUM_NODESThe number of bnodes in the cluster.
NUM_TRAINERSThe number of GPUs per bnode.
HOST_NODE_ADDRA convenience variable, defined as PRIMARY_ADDR:PRIMARY_PORT.
WORLD_SIZEThe total number of GPUs in the cluster (NUM_NODES * NUM_TRAINERS).
Each bnode receives a static IP address (NODE_ADDR) on the overlay network. When a cluster is deployed, the system designates one bnode as the primary bnode by setting the PRIMARY_ADDR and PRIMARY_PORT environment variables. This simplifies working with multiprocessing libraries that require a primary bnode. The following variables are equivalent:
  • MASTER_ADDR and PRIMARY_ADDR
  • MASTER_PORT and PRIMARY_PORT.
MASTER_* variables are available to provide compatibility with tools that expect these legacy names.

NCCL configuration for multi-bnode training

For distributed training frameworks like PyTorch, you must explicitly configure NCCL to use the internal network interface to ensure proper inter-bnode communication:
export NCCL_SOCKET_IFNAME=ens1
Without this configuration, bnodes may attempt to communicate using external IP addresses in the 172.xxx range, which are reserved for internet connectivity only. This will result in connection timeouts and failed distributed training jobs in your cluster.
When troubleshooting multi-bnode communication issues, also consider adding debug information:
export NCCL_DEBUG=INFO

When to use Instant Clusters

Instant Clusters offer distributed computing power beyond the capabilities of single-machine setups. Consider using Instant Clusters for:
  • Multi-GPU language model training: Accelerate training of models like Llama or GPT across multiple GPUs.
  • Large-scale computer vision projects: Process massive imagery datasets for autonomous vehicles or medical analysis.
  • Scientific simulations: Run climate, molecular dynamics, or physics simulations that require massive parallel processing.
  • Real-time AI inference: Deploy production AI models that demand multiple GPUs for fast output.
  • Batch processing pipelines: Create systems for large-scale data processing, including video rendering and genomics.