Logo

Simulating $200K teleoperation setup for $2K

Dima Yanovsky, Mauricio Pereira
Prox, MIT
yanovsky@mit.edu, maurici0@mit.edu

Real-time teleoperation of dual Shadow Hands setup ($200K hardware value) running in our DART teleoperation platform on Apple Vision Pro

Bracing for Challenge

Lots of things in robotics are hard, but some are insanely hard. Collecting large datasets of dexterous manipulation is the latter. If we wanted to collect a dataset like this using two Shadow Hands —among the most dexterous robotic hands in the world— we'd have to spend $200,000+ just for the hardware.

Not having those kinds of resources, we decided to port the Shadow Hand setup into DART, our teleoperation platform for Apple Vision Pro. In robotics, failure comes first, so this blog recounts weeks of trial and error that led to a teleoperable bimanual Shadow Hand setup. This gives us a throughput of 40 hours of teleop data per day, meaning we can collect more simulation data in a single day than any existing dataset on the Shadow Hand.

The Shadow Hand is one of the most complex inventions in mechatronics. We knew that adding it to our teleoperation platform was going to be tough. We had a lot of experience creating teloperations for bimanual arm and gripper setups, but a gripper is a 1D control problem —you can describe it entirely as the distance between fingertips.

UR5e with a gripper: a 1D control problem


A robotic hand is a 3D control problem —dozens of joints must move in sync to do something as simple as pinch. To prepare ourselves, we first decided to add a simpler Allegro Hand, which is a step beyond a simple gripper, but not quite as dexterous as the most complex hands. Most teleoperation demos of robotic hands stop at “coarse dexterity,” such as grabbing a bottle or vaguely moving fingers. Our bar was much higher: True dexterity means being able to pick up even small bolts and nuts.

Allegro Hand working

Coarse Shadow Hand movements



Allegro was a good warm-up choice because every joint is directly actuated. That makes inverse kinematics (IK) fairly straightforward. The Shadow Hand, however, is underactuated: tendons couple its joints together. To handle IK control signals for those joints, we approximated them as a weighted sum based on tendon lengths. This worked well enough to get coarse movements out of the Shadow Hand. But to unlock true dexterity, we had to iterate much harder.

Placing Bets

Before touching the code, we reasoned our way to 3 bets that seemed obvious for getting to a working implementation fastest:

Bet 1: Keep inverse kinematics code (IK) elegant by using as few IK tasks as possible (e.g., just solve for fingertip poses)
Bet 2: Keep the ahnd configuration file as simple as possible - as close to the default from MuJoCo menagerie
Bet 3: Attach motion capture (mocap) tracking directly to the fingertip segments of the Shadow Hand

We thought we were reasoning about these bets from first principles. Well, turns out we weren't . We failed on all of them. After a week of work, the Shadow Hand was still moving horribly.

One of the worst results during iteration was IK tasks that led to inverse movement: Extension of teleoperator's fingers caused bending of the Shadow Hand’s fingers, and vice versa.


Infinite Iteration

Why did the bets fail?

Bet 1: The Shadow Hand is bigger and has different proportions than a human hand. Giving the IK solver too much freedom is not good when you try to make a robotic hand with different dimensions match the movements of every finger joint of the teleoperator.

Bet 2: If we take the raw XML without any changes that means we have to change a lot of code on the Apple Vision Pro side. Mainly the transformations between MuJoCo and Apple Vision Pro's Reality Kit. These were the ones giving us most of the pain as we had to manually define many transformations between the two. This complexity coupled with figuring out the correct scale and offset of mocap bodies for each different teleoperator's hand resulted in what seemed like an infinite iteration.

This happens when you load the Shadow Hands directly from MuJoCo menagerie. You can fix this in code (what we were doing first) or you can edit the XML.


Bet 3: Welding mocaps to physical parts of the robot is a mistake because the IK solver is trying to match the origin of the body to the origin of the mocap. The issue is that the origin of the distal phalanges is not exactly at the finger tips; So, we got coarse, imprecise movements. It took teleoperators hours to find the pose to grab a bolt.

To make matters worse, the Shadow Hands were moving sluggishly. We were asking too much from the M2 chip of the Apple Vision Pro. We had rewritten Mink in Swift, and had achieved real time teleoperation with other robots, but the Shadow Hand has way more DOFs than a gripper. If we couldn't solve this, then we cannot achieve real time dexterous teleoperation and collect data.

What Ended Up Working

Bet 1: The solution is 2 fold. Balancing out adding more mocap bodies to follow and more IK tasks to solve. (e.g. We "grounded" the palm by adding mocap bodies and IK tasks for the knuckles) Finding which relationships work best. For some robots IK tasks relative to their frame works best. We found that absolute IK tasks work better for the Shadow Hand.

Bet 2: Apple Vision Pro uses Reality Kit to render objects and it has a different coordinate frame than MuJoCo. Setting up the transformations of the Shadow Hand in the XML such that the initial pose in MuJoCo matches the initial pose of the Apple Vision Pro's Reality Kit simplified the iteration loop. We were dealing with different conventions between MuJoCo and Reality Kit, fingers sometimes moving inversely as expected, and handling more than 50 DOFs. This eliminated complexity from the codebase.

Shadow Hand in MuJoCo
Shadow Hand in Reality Kit

Modifying the XML (right) simplified the code by eliminating many pose transformations. What was left to solve in code was only scale and offset required to match hand tracking data to mocap bodies.


Bet 3: Welding mocap bodies to sites placed exactly at the finger tips, not to the distal phalanges of the Shadow Hand. Sites can be placed exactly at the finger tip, so it is more intuitive for the teleoperator to control the robot: we think about pinching with the very tips of our fingers, not with the base of our distal phalanges.


Welding mocap bodies at the distal phalanges (left) gets the job done, but one manipulates thinking on were the finger tips are. So, welding the mocaps at the finger tips (right) results in more intuitive teleoperation and finer dexterity.


And we solved the sluggishness of the teleoperation experience with high DOFs robots like the Shadow Hand. Before, we had the XML file with both hands, which due to the inherent way MuJoCo builds a Jacobian matrix meant that the IK solver takes in huge matrices (sometimes very sparse) which slows down the loop immensely with 10 Hz - way below real time. We optimized the IK solver (e.g. warmed up the solution, separated the solver between left and right hands, etc.) which resulted in real time teleoperation at 30 Hz.


From coarse to fine. On the right, you can see the Shadow Hand not only meet fingertip to fingertip, but do it in two different ways—curving the fingers, then keeping them straight.


Complete Setup

We currently have 5 Apple Vision Pros. We can collect 40 hours of simulation data every day, which is more than any public dataset on Shadow Hand and any highly dexterous robotic hand. And we can do this in one day.

UR5e Shadow Hand Teleoperation Setup

$200K setup versus our $2K setup!


Final Shadow Hand Setup 1
Final Shadow Hand Setup 2