Robot Learning Startup Tech Stack: Key Decisions in 2025

Decision 1: ROS2 vs. Custom Middleware

This is the question every robotics startup debates and most get wrong by over-engineering. ROS2 gets you a running robot faster, gives you access to a large ecosystem of drivers and tools, and makes it easier to hire -- most robotics engineers know ROS2. The pub/sub architecture handles the complexity of multi-node systems, and packages like MoveIt2, Nav2, and ros2_control handle problems that would take months to implement from scratch.

Custom middleware delivers two genuine advantages: latency below 5ms (ROS2's DDS overhead makes this nearly impossible) and freedom from licensing concerns if your commercial deployment model involves redistributing a modified middleware layer. There's also a real argument that custom middleware is simpler to debug when you fully own the stack.

The practical rule: use ROS2 unless you are at Series B or later with a shipping product that has demonstrated real latency constraints. Premature optimization of middleware is one of the most expensive mistakes in robotics startups -- the opportunity cost of 6 months rebuilding DDS is enormous in the early stages. If latency becomes a real constraint later, you can always replace the transport layer while keeping ROS2 interfaces.

ROS2 DDS Vendor Selection

If you choose ROS2, the DDS vendor matters more than most teams realize. The default CycloneDDS is a solid choice for most applications. FastDDS (the previous default) has better discovery in large multi-machine setups but higher memory overhead. For real-time control loops, CycloneDDS with iceoryx shared-memory transport (zero-copy) eliminates the serialization overhead that makes standard DDS too slow for inner-loop control.

A common pattern at SVRC-supported startups: run CycloneDDS for all non-real-time nodes (perception, planning, logging), and use a lightweight custom transport (raw UDP or shared memory) for the 500Hz+ inner control loop between the policy inference node and the motor controller. This hybrid approach gets you 95% of the ROS2 ecosystem with the latency characteristics of custom middleware where it matters.

# Example: ROS2 + zero-copy shared memory via iceoryx
# In your ros2 workspace, set the middleware:
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

# cyclonedds.xml -- enable iceoryx shared memory transport
# <CycloneDDS>
#   <Domain>
#     <SharedMemory>
#       <Enable>true</Enable>
#       <LogLevel>warning</LogLevel>
#     </SharedMemory>
#   </Domain>
# </CycloneDDS>

export CYCLONEDDS_URI=file://$(pwd)/cyclonedds.xml

Decision 2: Simulation Platform

The simulation choice shapes your training loop for years. Four platforms dominate in 2026, each for different reasons:

Simulator	Parallel Envs	Contact Physics	License	Best For	GPU Required
Isaac Lab	4,096+ on A100	Adequate	MIT (BSD-3 for Sim)	GPU-parallel RL, locomotion	Yes (RTX 3090+)
MuJoCo	~1,000 (MJX on GPU)	Excellent	Apache 2.0	Contact-rich manipulation, dexterous hands	No (CPU OK; GPU for MJX)
Genesis	10,000+ on A100	Good (MPM for soft body)	Apache 2.0	Soft body, deformables, fluids	Yes
Gazebo (Ignition)	1-10 (CPU only)	Adequate	Apache 2.0	ROS2 integration testing, sensor sim	No

NVIDIA Isaac Lab: Best choice if GPU-accelerated RL training is your primary use case. 4,096+ parallel environments on a single A100, tight Isaac Sim integration, MIT license, growing model zoo including Unitree G1 and Franka. Weakest on contact accuracy -- fine for locomotion and pick-place, struggles for precision assembly. The PhysX backend is fast but uses penalty-based contact that can allow penetration at high timestep sizes.

MuJoCo: Best contact physics in any general-purpose simulator. Constraint-based dynamics with stable contact handling at high compliance. Choose this for dexterous hand research, contact-rich manipulation, or any task where contact accuracy matters more than parallelization. Now free under Apache 2.0 since Google acquisition. MuJoCo XLA (MJX) brings GPU acceleration via JAX, achieving ~1,000 parallel environments -- not as fast as Isaac Lab but sufficient for most manipulation RL.

Genesis: The newest entrant, gaining rapid adoption for tasks involving deformable objects, fluids, and soft bodies. Genesis uses Material Point Method (MPM) for soft-body physics, which is superior to both MuJoCo and Isaac Lab for simulating cloth, food, and human tissue. If your startup works in food handling, textile manipulation, or surgical robotics, Genesis is worth serious evaluation. Throughput on rigid-body tasks rivals Isaac Lab.

Gazebo (Ignition): Choose only if deep ROS2 integration is a hard requirement -- Gazebo is the official ROS2 simulator and has the tightest toolchain integration. Physics is weaker than the GPU-accelerated alternatives. Acceptable for navigation, sensor simulation, and integration testing. Not viable for RL training due to CPU-only execution.

Policy Architecture: ACT vs Diffusion Policy vs OpenVLA

Your policy architecture choice is tightly coupled with your simulation and data strategy. Here is the current landscape:

Architecture	Paradigm	Data Needed	Inference Speed	Best For
ACT	IL (CVAE + Transformer)	50-200 demos/task	~5ms (fast)	Precise bimanual manipulation
Diffusion Policy	IL (DDPM denoising)	100-500 demos/task	~100-200ms (slow)	Multi-modal tasks, diverse strategies
OpenVLA / RT-2	VLA (vision-language-action)	Fine-tune: 20-100 demos + pre-trained backbone	~200-500ms (very slow)	Language-conditioned, multi-task
PPO/SAC (RL)	RL (reward-driven)	0 demos (sim only)	~1-5ms (very fast)	Locomotion, tasks with clear rewards

Practical guidance: Start with ACT if you are collecting teleoperation data and need fast iteration. ACT trains in hours on a single GPU and the inference is fast enough for real-time control. Switch to Diffusion Policy if you find that your task has multiple valid strategies (e.g., approach from left or right) -- ACT's unimodal CVAE struggles with multi-modal demonstrations, while Diffusion Policy handles them naturally. Use OpenVLA/RT-2 only if you need language conditioning (executing natural language instructions) or are fine-tuning from a pre-trained foundation model. The inference speed of VLAs (200-500ms) limits them to tasks where reaction time is not critical.

Decision 3: Cloud vs. Edge Inference

Inference latency determines where your model runs. The threshold is approximately 150ms round-trip: below that, cloud inference from a co-located data center is viable for most applications. Above 150ms introduces perceptible lag in teleoperation and causes instability in closed-loop manipulation controllers.

Edge Compute: Current Hardware Landscape

Device	TOPS	Price	ACT Inference	Diffusion Policy	OpenVLA 7B
Jetson Orin Nano (8GB)	40	$249	~15ms	~300ms	Not feasible
Jetson AGX Orin (64GB)	275	$1,999	~5ms	~120ms	~800ms (quantized)
Jetson Thor (expected 2025-26)	~800	~$2,500 est.	~2ms	~40ms	~200ms (feasible)

Cloud inference makes sense when: robot connectivity is reliable (lab or factory floor with fiber), latency > 150ms is acceptable for the application, and model size exceeds what edge hardware can serve. A hybrid approach works well -- run fast low-latency reactive controllers on-device, offload slow planning and perception to cloud.

Cloud Training: Cost Benchmarks

Training costs vary dramatically by policy architecture and dataset size. Here are real numbers from SVRC training runs as of Q1 2026:

ACT (single task, 200 demos): ~8 hours on 1x A100 (40GB). Cost: ~$16 on Lambda Cloud ($2/hr). ~$24 on AWS p4d instances.
Diffusion Policy (single task, 500 demos): ~24 hours on 1x A100. Cost: ~$48 on Lambda. Image observations with 3 cameras roughly triple training time vs. low-dim observations.
OpenVLA fine-tune (7B, 100 demos): ~12 hours on 4x A100 (LoRA). Cost: ~$96 on Lambda. Full fine-tune requires 8x A100 and ~$200+.
PPO locomotion (Isaac Lab, 4096 envs): ~4 hours on 1x RTX 4090. Cost: ~$4 on vast.ai.

Decision 4: Data Platform Strategy

The build-vs-buy decision on data infrastructure is simpler than it appears. Build your own data platform only if you have a dedicated ML infrastructure engineer on staff whose primary job is data tooling -- not a researcher who also maintains tooling, a dedicated infra engineer. Otherwise, the maintenance burden compounds into a significant ongoing tax on your most expensive people.

What a Data Platform Must Do

The core capabilities you need: episode storage with versioning, metadata indexing for fast retrieval, visualization for QA, dataset splitting and export to training formats (HDF5/Zarr/RLDS), and access control for multi-operator environments. Building this from scratch takes 3-6 months and requires continuous maintenance.

Dataset Management Tools Comparison

Tool	Type	Robot-Specific	Collection Tools	Training Integration
SVRC Platform	Managed service	Yes (teleop, annotation, QA)	Full (hardware + software)	HDF5/RLDS/LeRobot export
HuggingFace LeRobot	Open-source lib	Yes (robot datasets)	Recording scripts	ACT, DP, TDMPC built-in
Weights & Biases	Managed service	No (general ML)	Experiment tracking only	Any framework
DVC + MinIO	Open-source stack	No (general ML)	Version control only	Any framework

The principle that should guide every stack decision: buy infrastructure, build differentiation. Your competitive advantage is your robot hardware, your task expertise, and your policy architecture -- not your episode storage system. Spend engineering time accordingly.

The SVRC platform provides the full data infrastructure stack -- collection, storage, annotation, training pipeline -- as a managed service with API access.

Recommended Stack by Stage

Stage	Middleware	Sim	Policy	Inference	Data Platform
Pre-seed / Seed	ROS2 Humble	MuJoCo or Isaac Lab	ACT (fast iteration)	Edge (Orin Nano)	SVRC or LeRobot
Series A	ROS2 + custom control	Isaac Lab + MuJoCo	Diffusion Policy or VLA fine-tune	Hybrid edge+cloud	SVRC or build if infra hire
Series B+	Custom if proven need	Custom + one of above	Custom architecture	Custom serving (TRT)	Build if 2+ infra eng