Decision 1: ROS2 vs. Custom Middleware

This is the question every robotics startup debates and most get wrong by over-engineering. ROS2 gets you a running robot faster, gives you access to a large ecosystem of drivers and tools, and makes it easier to hire -- most robotics engineers know ROS2. The pub/sub architecture handles the complexity of multi-node systems, and packages like MoveIt2, Nav2, and ros2_control handle problems that would take months to implement from scratch.

Custom middleware delivers two genuine advantages: latency below 5ms (ROS2's DDS overhead makes this nearly impossible) and freedom from licensing concerns if your commercial deployment model involves redistributing a modified middleware layer. There's also a real argument that custom middleware is simpler to debug when you fully own the stack.

The practical rule: use ROS2 unless you are at Series B or later with a shipping product that has demonstrated real latency constraints. Premature optimization of middleware is one of the most expensive mistakes in robotics startups -- the opportunity cost of 6 months rebuilding DDS is enormous in the early stages. If latency becomes a real constraint later, you can always replace the transport layer while keeping ROS2 interfaces.

ROS2 DDS Vendor Selection

If you choose ROS2, the DDS vendor matters more than most teams realize. The default CycloneDDS is a solid choice for most applications. FastDDS (the previous default) has better discovery in large multi-machine setups but higher memory overhead. For real-time control loops, CycloneDDS with iceoryx shared-memory transport (zero-copy) eliminates the serialization overhead that makes standard DDS too slow for inner-loop control.

A common pattern at SVRC-supported startups: run CycloneDDS for all non-real-time nodes (perception, planning, logging), and use a lightweight custom transport (raw UDP or shared memory) for the 500Hz+ inner control loop between the policy inference node and the motor controller. This hybrid approach gets you 95% of the ROS2 ecosystem with the latency characteristics of custom middleware where it matters.

# Example: ROS2 + zero-copy shared memory via iceoryx
# In your ros2 workspace, set the middleware:
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

# cyclonedds.xml -- enable iceoryx shared memory transport
# <CycloneDDS>
#   <Domain>
#     <SharedMemory>
#       <Enable>true</Enable>
#       <LogLevel>warning</LogLevel>
#     </SharedMemory>
#   </Domain>
# </CycloneDDS>

export CYCLONEDDS_URI=file://$(pwd)/cyclonedds.xml

Decision 2: Simulation Platform

The simulation choice shapes your training loop for years. Four platforms dominate in 2026, each for different reasons:

Simulator Parallel Envs Contact Physics License Best For GPU Required
Isaac Lab4,096+ on A100AdequateMIT (BSD-3 for Sim)GPU-parallel RL, locomotionYes (RTX 3090+)
MuJoCo~1,000 (MJX on GPU)ExcellentApache 2.0Contact-rich manipulation, dexterous handsNo (CPU OK; GPU for MJX)
Genesis10,000+ on A100Good (MPM for soft body)Apache 2.0Soft body, deformables, fluidsYes
Gazebo (Ignition)1-10 (CPU only)AdequateApache 2.0ROS2 integration testing, sensor simNo

NVIDIA Isaac Lab: Best choice if GPU-accelerated RL training is your primary use case. 4,096+ parallel environments on a single A100, tight Isaac Sim integration, MIT license, growing model zoo including Unitree G1 and Franka. Weakest on contact accuracy -- fine for locomotion and pick-place, struggles for precision assembly. The PhysX backend is fast but uses penalty-based contact that can allow penetration at high timestep sizes.

MuJoCo: Best contact physics in any general-purpose simulator. Constraint-based dynamics with stable contact handling at high compliance. Choose this for dexterous hand research, contact-rich manipulation, or any task where contact accuracy matters more than parallelization. Now free under Apache 2.0 since Google acquisition. MuJoCo XLA (MJX) brings GPU acceleration via JAX, achieving ~1,000 parallel environments -- not as fast as Isaac Lab but sufficient for most manipulation RL.

Genesis: The newest entrant, gaining rapid adoption for tasks involving deformable objects, fluids, and soft bodies. Genesis uses Material Point Method (MPM) for soft-body physics, which is superior to both MuJoCo and Isaac Lab for simulating cloth, food, and human tissue. If your startup works in food handling, textile manipulation, or surgical robotics, Genesis is worth serious evaluation. Throughput on rigid-body tasks rivals Isaac Lab.

Gazebo (Ignition): Choose only if deep ROS2 integration is a hard requirement -- Gazebo is the official ROS2 simulator and has the tightest toolchain integration. Physics is weaker than the GPU-accelerated alternatives. Acceptable for navigation, sensor simulation, and integration testing. Not viable for RL training due to CPU-only execution.

Policy Architecture: ACT vs Diffusion Policy vs OpenVLA

Your policy architecture choice is tightly coupled with your simulation and data strategy. Here is the current landscape:

Architecture Paradigm Data Needed Inference Speed Best For
ACTIL (CVAE + Transformer)50-200 demos/task~5ms (fast)Precise bimanual manipulation
Diffusion PolicyIL (DDPM denoising)100-500 demos/task~100-200ms (slow)Multi-modal tasks, diverse strategies
OpenVLA / RT-2VLA (vision-language-action)Fine-tune: 20-100 demos + pre-trained backbone~200-500ms (very slow)Language-conditioned, multi-task
PPO/SAC (RL)RL (reward-driven)0 demos (sim only)~1-5ms (very fast)Locomotion, tasks with clear rewards

Practical guidance: Start with ACT if you are collecting teleoperation data and need fast iteration. ACT trains in hours on a single GPU and the inference is fast enough for real-time control. Switch to Diffusion Policy if you find that your task has multiple valid strategies (e.g., approach from left or right) -- ACT's unimodal CVAE struggles with multi-modal demonstrations, while Diffusion Policy handles them naturally. Use OpenVLA/RT-2 only if you need language conditioning (executing natural language instructions) or are fine-tuning from a pre-trained foundation model. The inference speed of VLAs (200-500ms) limits them to tasks where reaction time is not critical.

Decision 3: Cloud vs. Edge Inference

Inference latency determines where your model runs. The threshold is approximately 150ms round-trip: below that, cloud inference from a co-located data center is viable for most applications. Above 150ms introduces perceptible lag in teleoperation and causes instability in closed-loop manipulation controllers.

Edge Compute: Current Hardware Landscape

Device TOPS Price ACT Inference Diffusion Policy OpenVLA 7B
Jetson Orin Nano (8GB)40$249~15ms~300msNot feasible
Jetson AGX Orin (64GB)275$1,999~5ms~120ms~800ms (quantized)
Jetson Thor (expected 2025-26)~800~$2,500 est.~2ms~40ms~200ms (feasible)

Cloud inference makes sense when: robot connectivity is reliable (lab or factory floor with fiber), latency > 150ms is acceptable for the application, and model size exceeds what edge hardware can serve. A hybrid approach works well -- run fast low-latency reactive controllers on-device, offload slow planning and perception to cloud.

Cloud Training: Cost Benchmarks

Training costs vary dramatically by policy architecture and dataset size. Here are real numbers from SVRC training runs as of Q1 2026:

  • ACT (single task, 200 demos): ~8 hours on 1x A100 (40GB). Cost: ~$16 on Lambda Cloud ($2/hr). ~$24 on AWS p4d instances.
  • Diffusion Policy (single task, 500 demos): ~24 hours on 1x A100. Cost: ~$48 on Lambda. Image observations with 3 cameras roughly triple training time vs. low-dim observations.
  • OpenVLA fine-tune (7B, 100 demos): ~12 hours on 4x A100 (LoRA). Cost: ~$96 on Lambda. Full fine-tune requires 8x A100 and ~$200+.
  • PPO locomotion (Isaac Lab, 4096 envs): ~4 hours on 1x RTX 4090. Cost: ~$4 on vast.ai.

Decision 4: Data Platform Strategy

The build-vs-buy decision on data infrastructure is simpler than it appears. Build your own data platform only if you have a dedicated ML infrastructure engineer on staff whose primary job is data tooling -- not a researcher who also maintains tooling, a dedicated infra engineer. Otherwise, the maintenance burden compounds into a significant ongoing tax on your most expensive people.

What a Data Platform Must Do

The core capabilities you need: episode storage with versioning, metadata indexing for fast retrieval, visualization for QA, dataset splitting and export to training formats (HDF5/Zarr/RLDS), and access control for multi-operator environments. Building this from scratch takes 3-6 months and requires continuous maintenance.

Dataset Management Tools Comparison

Tool Type Robot-Specific Collection Tools Training Integration
SVRC PlatformManaged serviceYes (teleop, annotation, QA)Full (hardware + software)HDF5/RLDS/LeRobot export
HuggingFace LeRobotOpen-source libYes (robot datasets)Recording scriptsACT, DP, TDMPC built-in
Weights & BiasesManaged serviceNo (general ML)Experiment tracking onlyAny framework
DVC + MinIOOpen-source stackNo (general ML)Version control onlyAny framework

The principle that should guide every stack decision: buy infrastructure, build differentiation. Your competitive advantage is your robot hardware, your task expertise, and your policy architecture -- not your episode storage system. Spend engineering time accordingly.

The SVRC platform provides the full data infrastructure stack -- collection, storage, annotation, training pipeline -- as a managed service with API access.

Recommended Stack by Stage

StageMiddlewareSimPolicyInferenceData Platform
Pre-seed / SeedROS2 HumbleMuJoCo or Isaac LabACT (fast iteration)Edge (Orin Nano)SVRC or LeRobot
Series AROS2 + custom controlIsaac Lab + MuJoCoDiffusion Policy or VLA fine-tuneHybrid edge+cloudSVRC or build if infra hire
Series B+Custom if proven needCustom + one of aboveCustom architectureCustom serving (TRT)Build if 2+ infra eng

Related Reading