Compressing Robot Vision into 8 Objects
We replaced 256 visual patch tokens with 8 learned object slots and trained a robot VLA from scratch. Slot compression improved training efficiency by 11%.
We replaced 256 visual patch tokens with 8 learned object slots and trained a robot VLA from scratch. Slot compression improved training efficiency by 11%.
We tried replacing SRPO's 1.1B-parameter V-JEPA with the VLA's own SigLIP encoder. Here's what we learned.