MIT’s New AI system Can Learn to See by Touching and Feel by Seeing

  • The AI model can imagine the feeling of touching an object just by looking it.
  • The model could help future robots to more easily grasp and recognize objects.
  • The team uses a KUKA robot arm with a special tactile sensor called GelSight to train the model.

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have come up with a predictive artificial intelligence (AI) that can learn to see by touching and feel by seeing.

The team’s system can create realistic tactile signals from visual inputs and predict which object and what part is being touched directly from those tactile inputs.

Yunzhu Li, CSAIL PhD student and lead author on a paper about the system, says that the model can help future robots to more easily grasp and recognize objects.

He explains, “By looking at the scene, our model can imagine the feeling of touching a flat surface or a sharp edge. By blindly touching around, our model can predict the interaction with the environment purely from tactile feelings. Bringing these two senses together could empower the robot and reduce the data we might need for tasks involving manipulating and grasping objects.”

The team used a KUKA robot arm with a special tactile sensor called GelSight, designed by another group at MIT, to train the model.

Then using a simple web camera, the team recorded nearly 200 objects touched by the arm more than 12,000 times. Breaking those 12,000 video clips down into static frames, the team compiled “VisGel,” a dataset of more than 3 million visual/tactile-paired images.

Andrew Owens, a postdoc at the University of California at Berkeley, believes that this AI system can help robots in deciding how firmly it should grip an object.

“This is the first method that can convincingly translate between visual and touch signals. Methods like this have the potential to be very useful for robotics, where you need to answer questions like ‘is this object hard or soft?’, or ‘if I lift this mug by its handle, how good will my grip be?’ This is a very challenging problem, since the signals are so different, and this model has demonstrated great capability.”

Future plan

The current dataset only has examples of interactions in a controlled environment. The team hopes to improve this by collecting data in more unstructured areas.

According to the team, this type of model could help in creating a more harmonious relationship between vision and robotics, especially for object recognition, grasping and better scene understanding. It could also help in developing seamless human-robot integration in an assistive or manufacturing setting.