Together AI adds enterprise-grade autoscaling, RBAC, observability dashboards, and self-healing node repair to GPU Clusters as company pursues $1B funding roundTogether AI adds enterprise-grade autoscaling, RBAC, observability dashboards, and self-healing node repair to GPU Clusters as company pursues $1B funding round

Together AI Upgrades GPU Clusters With Autoscaling and Self-Healing Features

2026/03/11 01:34
3 min read
For feedback or concerns regarding this content, please contact us at [email protected]

Together AI Upgrades GPU Clusters With Autoscaling and Self-Healing Features

Lawrence Jengar Mar 10, 2026 17:34

Together AI adds enterprise-grade autoscaling, RBAC, observability dashboards, and self-healing node repair to GPU Clusters as company pursues $1B funding round.

Together AI Upgrades GPU Clusters With Autoscaling and Self-Healing Features

Together AI has rolled out a significant infrastructure upgrade to its GPU Clusters platform, adding autoscaling, role-based access control, full-stack observability, and self-healing node repair capabilities. The enhancements arrive as the AI cloud company reportedly pursues $1 billion in fresh funding, according to reports from earlier this month.

The timing isn't coincidental. Enterprise customers running distributed training workloads across hundreds of GPUs need more than raw compute—they need infrastructure that doesn't require babysitting.

Autoscaling Targets GPU Waste

The new autoscaling feature, powered by the Kubernetes Cluster Autoscaler, monitors for GPU-constrained workloads and automatically provisions or decommissions nodes based on real-time demand. For teams running variable inference workloads or bursty training jobs, this means no more paying for idle hardware during quiet periods.

Static GPU provisioning has been a persistent pain point. Organizations either overprovision (expensive) or underprovision (performance bottlenecks during demand spikes). Together's approach lets clusters expand during peak load and contract when demand subsides.

Self-Healing Addresses Hardware Reality

GPU hardware fails. In large fleets, it's not a question of if but when. For distributed training, a single unstable node can invalidate hours of compute time.

Together's solution: self-serve health checks that users can trigger before launching major training jobs. Tests range from basic DCGM diagnostics to multi-node NCCL and InfiniBand bandwidth tests. When a node does fail, a three-click self-repair process automatically cordons, drains, and recreates the node—bringing clusters back to healthy status within minutes rather than hours.

Acceptance tests now run automatically during provisioning. Clusters won't be marked ready until they pass.

Enterprise Access Controls

The RBAC implementation introduces "Projects" as isolation boundaries for teams. Two default roles split responsibilities cleanly: Admins get full control plane access for cluster creation and deletion, while Members can access GPU worker nodes and run workloads without touching infrastructure provisioning.

This matters for organizations where platform engineers need to lock down infrastructure while giving ML researchers freedom to experiment.

Observability Gets Native

Every GPU Cluster project now includes a dedicated Grafana instance with pre-built dashboards. Telemetry covers GPU utilization via DCGM metrics, InfiniBand and NIC-level networking data, storage I/O performance, and Kubernetes orchestration health. The feature is currently in private preview.

Market Context

Together AI has been building momentum in the GPU-as-a-service space. The company launched self-service GPU infrastructure in September 2025 and introduced Instant GPU Clusters at NVIDIA GTC 2025 in March of that year. The platform supports NVIDIA Hopper (H100) and Blackwell (B200) GPUs, with Instant Clusters scaling up to 64 GPUs and Dedicated Clusters reaching 1,000 GPUs.

With a reported $7.5 billion market cap and a potential billion-dollar funding round in progress, Together is positioning itself as a serious alternative to hyperscaler GPU offerings—targeting teams that want bare-metal performance without the operational overhead of managing their own hardware.

The new features are available immediately to existing Together GPU Clusters customers.

Image source: Shutterstock
  • together ai
  • gpu infrastructure
  • ai computing
  • cloud infrastructure
  • enterprise ai
Market Opportunity
NodeAI Logo
NodeAI Price(GPU)
$0.03054
$0.03054$0.03054
-1.10%
USD
NodeAI (GPU) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Husky Inu (HINU) Completes Move To $0.00020688

Husky Inu (HINU) Completes Move To $0.00020688

Husky Inu (HINU) has completed its latest price jump, rising from $0.00020628 to $0.00020688. The price jump is part of the project’s pre-launch phase, which began on April 1, 2025.
Share
Cryptodaily2025/09/18 01:10
Cryptos Signal Divergence Ahead of Fed Rate Decision

Cryptos Signal Divergence Ahead of Fed Rate Decision

The post Cryptos Signal Divergence Ahead of Fed Rate Decision appeared on BitcoinEthereumNews.com. Crypto assets send conflicting signals ahead of the Federal Reserve’s September rate decision. On-chain data reveals a clear decrease in Bitcoin and Ethereum flowing into centralized exchanges, but a sharp increase in altcoin inflows. The findings come from a Tuesday report by CryptoQuant, an on-chain data platform. The firm’s data shows a stark divergence in coin volume, which has been observed in movements onto centralized exchanges over the past few weeks. Bitcoin and Ethereum Inflows Drop to Multi-Month Lows Sponsored Sponsored Bitcoin has seen a dramatic drop in exchange inflows, with the 7-day moving average plummeting to 25,000 BTC, its lowest level in over a year. The average deposit per transaction has fallen to 0.57 BTC as of September. This suggests that smaller retail investors, rather than large-scale whales, are responsible for the recent cash-outs. Ethereum is showing a similar trend, with its daily exchange inflows decreasing to a two-month low. CryptoQuant reported that the 7-day moving average for ETH deposits on exchanges is around 783,000 ETH, the lowest in two months. Other Altcoins See Renewed Selling Pressure In contrast, other altcoin deposit activity on exchanges has surged. The number of altcoin deposit transactions on centralized exchanges was quite steady in May and June of this year, maintaining a 7-day moving average of about 20,000 to 30,000. Recently, however, that figure has jumped to 55,000 transactions. Altcoins: Exchange Inflow Transaction Count. Source: CryptoQuant CryptoQuant projects that altcoins, given their increased inflow activity, could face relatively higher selling pressure compared to BTC and ETH. Meanwhile, the balance of stablecoins on exchanges—a key indicator of potential buying pressure—has increased significantly. The report notes that the exchange USDT balance, around $273 million in April, grew to $379 million by August 31, marking a new yearly high. CryptoQuant interprets this surge as a reflection of…
Share
BitcoinEthereumNews2025/09/18 01:01
Nasdaq Elliott Wave: End of correction?

Nasdaq Elliott Wave: End of correction?

The post Nasdaq Elliott Wave: End of correction? appeared on BitcoinEthereumNews.com. Executive summary Trend bias: Wave ii rally. Key support level: 24,629 – 24
Share
BitcoinEthereumNews2026/03/11 07:31