AXIS: A Robot Orchestration Layer for Physical AI

AXIS, short for Autonomous eXecution & Integration System, is an experiment in robot orchestration: how can one operator coordinate robot work through a single control layer instead of manually controlling each machine?

The project explores the layer between human intent and physical execution. Rather than treating each robot as a separate system, AXIS connects high-level instructions, robot skills, robot state, camera context, and execution tools into one control loop.

Watch demo here

Motivation

Most robotics demos focus on a single robot performing a single task. But real physical AI systems need more than isolated skills. They need a way to understand what work should be done, which robot can do it, whether the robot is ready, and how execution should be monitored.

AXIS was built around that problem.

The goal was not to create a new low-level controller. The goal was to build an orchestration layer above existing robot skills — a system that can receive an instruction, inspect available capabilities, choose the right action, and keep the operator informed while the robot executes.

In this architecture, the human gives intent. AXIS handles coordination. The robot skills handle physical execution.

What AXIS Does

AXIS acts as a central coordination layer for robot work.

When a task is given, AXIS checks the available robot skills, robot status, and scene context. It then maps the task to the appropriate robot behavior and sends commands through the execution layer.

A simplified flow looks like this:

The operator gives a high-level task.
AXIS reads robot state, available skills, and scene context.
The system selects the right robot skill.
The command is sent to the robot.
The robot reports progress back.
The operator keeps oversight through video and status updates.

This makes AXIS a bridge between natural language tasking and robot control. The system does not assume every command should immediately run. It first checks what the robot can do, what state it is in, and what is happening around it.

Architecture

AXIS is built around four main components.

Robot skills
The robot executes reusable skills trained or scripted ahead of time. In the prototype, these skills were built using LeRobot workflows and Action Chunking Transformer-style imitation learning.

Agent orchestration layer
The AXIS agent receives the task, reasons over available tools and context, selects the right skill, and issues execution commands.

Robot state and skill registry
The system tracks which skills are available, whether the robot is ready, and what execution state the robot is currently in.

Scene and camera context
Camera snapshots and scene descriptions are used to give the agent awareness of the environment. Since the planning model was text-based, visual information was converted into language before being injected into the agent context.

This separation is important. AXIS does not replace the robot controller. It operates above it. The low-level skill controls the physical motion, while AXIS decides when, why, and how that skill should be used.

Implementation

The prototype ran on the open-source LeRobot platform, with a Jetson Orin Nano serving as the local robotics computer. The Jetson handled local robot control, camera streams, and execution tools.

For planning, AXIS used an AI agent powered by openai/gpt-oss-120b. The agent was connected to orchestration tools that exposed robot capabilities, skill execution, and system state.

Engineering Challenges

The main challenges were not just robotics problems. They were system integration problems.

Compute
gpt-oss-120b required more compute than the Jetson could provide locally. AXIS solved this by moving large model calls off-device while keeping robot execution local.

Video processing
Camera decoding initially pushed CPU usage close to 99%. A custom GStreamer pipeline with CUDA decoding reduced CPU usage to around 30%, making the system much more usable.

Context reliability
The agent needed updated information about skills, robot state, and scene context. Static prompting was not enough, so AXIS dynamically injected the latest system state into the agent context.

Vision gap
The model was text-only, so it could not directly understand camera images. To work around this, AXIS used an image-to-text tool that converted camera snapshots into scene descriptions.

Conclusion

AXIS is an early step toward a more general physical AI stack: systems where robots are not controlled one command at a time, but coordinated through layers that connect intent, perception, skills, and execution.

The main lesson from this project is that robot intelligence is not only about training better models. It is also about building the infrastructure around those models; the skill registry, execution tools, state tracking, camera context, and operator feedback loops that make them useful in the real world.

For Daionics, AXIS represents the orchestration side of that stack. It focuses on how robot work is assigned, monitored, and executed. Future work will expand this into multi-robot coordination and connect it with adaptive control layers like Cortex, where trained policies can adjust behavior based on uncertainty, risk, and changing conditions.