Dima Yanovsky, Ekaterina Tiukhtikova Prox, MIT yanovsky@mit.edu, katarint@mit.edu
Intro
Imagine you know nothing about robot learning and you decided to get into the field. In just a few hours you will realize that there is basically no data to train the models on. There is no GitHub for robotics. The next thing that will come to your mind is: "How do I bypass the need to collect real robot data?", which will most likely lead you to the following answer: collect data in simulation. We ourselves bet on this idea and we count our learnings below.
Before diving deep in, let's do a quick guessing game. Here is a picture of our table with objects we use every single day. They don't look too complicated, right?
Let's put this desk in simulation and teleoperate it then! Guess how many objects from this table can be put in sim.
Now, once you've placed your bets, let's start.
How It Started
When we started, we already had a platform that was running MuJoCo locally on Apple Vision Pro chip and via some engineering solution around the Apple Reality Kit a scene could be teleoperatable in Apple Vision. We already had several robots in the platform, and more robots were being added as well. So what we needed was just a lot of environments to teleoperate on, and we wanted to explore how hard it was to do it and how quick we could do it.
Making a Single Scene from Scratch
The web has a lot of 3D assets. You can find whatever object you can think of very quickly. So our thought was: Oh, if we essentially have all the 3D objects available to us, how hard can it even be to just drag and drop them into a scene? Just get an object from the internet, put it into simulation and teleoperate.
So our bet was: scene creation is hard, but it is simply an engineering challenge, and fundamentally this task is optimizable or even automatable.
So the next question was: what kind of tasks are even feasible?
We were only thinking about manipulation tasks without any locomotion. One obvious limit is having to stick to rigid body tasks. Limited by collision geometries fidelity, all tasks that require extreme precision like screwing a lid or tightening a nut are also infeasible. Plus, one can only use low-polygon meshes: MuJoCo takes super long to load models with big numbers of vertices. We tried manually reducing vertex count for some interesting meshes, but it was very time-consuming and extremely not worth all the effort.
So, out of all tasks in the world (their count is actually infinite), one can only do a small fraction.
We decided to do very simple scenes first: the first scene where we went through the whole pipeline was toolbox. It consisted entirely of simple rigid bodies. The task was simply to put some tools in a box — pick-and-place at its finest. It should have been the easiest thing to do — yet when creating this scene, we faced the fundamental problems with simulation. Instead of several hours, as we expected, it took us several days to make this scene teleoperatable.
For collisions, MuJoCo approximates any imported mesh as a primitive geometry. So we had to handle collisions ourselves. We used the CoACD algorithm to make the convex hulls for collisions. It took forever to find an optimal split, generated a huge number of objects, yet the holes for sockets were not even close to round, and instruments did not fit into their designated holes. We had to iterate through our pipeline dozens of times, regenerating and manually adjusting collision geometries, and even deleting half of the socket holes from the toolbox.
That's why the scene creation process could not really be automated. On each iteration, it needed quality checks and corrections. We thought something like a VLM could assess how good the collision geometries were, but eventually decided that these attempts to automate this process could even slow it down. So the pipeline that was supposed to look like this:
Turned out to look something like this (simplified!):
MuJoCo's performance degrades with the complexity of scenes: there are lots of instabilities happening, scenes with many objects take forever to load, and the code for each scene is up to 5000+ lines, making it very tedious to make even minor edits.
One of the later scenes we created was a kitchen scene. The task was sorting dishes from a dish rack. At just 30 freejoint objects, the simulation was already not real-time, which completely undermined the idea of making MuJoCo run locally on Apple Vision.
Sticking with just rigid bodies did not seem like a solution anymore — even they did not work well in big scenes. So we decided to take MuJoCo one big step further and explore deformable objects.
In Rigid-Body Prison
So we came to the conclusion that creating one scene, even one that fully consists of rigid bodies, is incredibly laborious. But we still wanted to explore how we could take MuJoCo one step up and simulate deformable bodies. Our bet was: yes, it requires a lot of engineering effort to create one scene, but if we can actually create a scene with deformable bodies and get out of the pick-and-place rigid body prison, it would be worth the effort.
First, we explored how deformables were approached by MuJoCo. Because they are approximated as a bunch of small rigid bodies connected with joints, it is a huge computational load to handle high-fidelity deformables like cloth or cables.
41 piece cable
71 piece cable
91 piece cable
Cables are approximated as collections of capsules. The more capsules, the better it looks, and the more realistic the physics is. But at just 100 segments, the simulation is unstable and slow. This is too much of a load for MuJoCo, and it does not even look good and smooth yet. We managed to create a scene with a cable and teleoperate it, but it still was not smooth enough.
Third-person POV
Wrist camera POV
We stuck to MuJoCo on CPU and set our goal to make high-fidelity cables work. Spent several days playing around with the model settings, optimizing storing constraints, and tuning inner settings of the deformables. This just once again showed how much of a time drainer creating environments was.
Cloth was worse — it needed hundreds of divisions to look somewhat smooth. This huge jointed nested structure was blowing up in simulation every time the robot touched it. There is no need to even talk about the speed of this sim.
It was obvious that MuJoCo running on CPU could not handle it, so we turned to the GPU version — MuJoCo Warp. However, it is still under development and does not support some plugins we were using to simulate deformable objects. Overall it looks like its main goal never was handling deformables. So now we knew: MuJoCo can't handle big rigid-body scenes. It can't handle deformables, at least without huge engineering efforts. But are these efforts even worth it?
Working with Teleoperators
One of our big bets on simulation was price: to teleoperate on a real setup, you need to buy and set it up, which will cost a lot. With simulation, the cost you pay is basically the cost of Apple Vision Pro and hourly wage to teleoperators.
But one hour of their work hardly equals one hour of data (especially high quality data). Teleoperators tend to record worse data with time. They slam the robot's grippers against the table. They drop objects, which causes the scene to go crazy, or just complete the task wrong. Sometimes the scene resets just because there was a collision MuJoCo could not handle - and all the work the teleoperator has done so far is immediately invalid.
Teleoperator dropping objects
Scene resetting upon collisions
One other thing was teleoperators complaining about various things about each scene — here the wrench is hard to pick up, there the shelf is hard to reach. We needed to manually adjust a whole lot of small things to finally get them to collect data without getting distracted. Could this be optimized or automated? No way.
Bright Idea Meets Brutal Reality
After a month and a half of working on creating new scenes in sim, we see that its disadvantages strongly outweigh advantages.
Simulation has a lot of advantages. Simulation is exciting. But if we were to sit at home and think out of nowhere: "How do I put everything on this whole table into simulation?", we would most probably not be able to. Glasses, liquids, pieces of paper, soft tissues - things that we use every single day and never consider to be something super complicated - are an absolute nightmare to put into simulation.
Let's now return to our question from the very beginning. Check your guesses: can we put these objects into sim?
How many things from here can we put into simulation and not spend days on it? Barely a couple.
With the rigid body limit and considering how hard it is to go into details, all of the tasks we can simulate right now are some combinations of pick-and-place. Some of them may require planning. Some of them look cool and exciting. But fundamentally we cannot approach assembly tasks, building, cooking, and so on. Even if deformable bodies and fluid were feasible, the amount of time it takes to create one scene and test it is a huge limitation.
Right now, the technology is not there to continue with simulation. It is a bright idea. But right now this idea meets the brutality of real world possibilities.