Carnegie Mellon University

Digitized drawing of Seattle skyline

December 16, 2024

Combining (a Little) AI and Extended Reality

By Sarah Maenner

Krista Burns

Two electrical and computer engineering students, Sruti Srinidhi and Edward Lu, won the Best Student Paper award at this year’s IEEE International Symposium on Mixed and Augmented Reality (ISMAR). Their project integrates large-language models (LLMs) with an extended reality (XR) headset which uses machine learning to help users with tasks.

They call the project “XaiR” (pronounced “X-air”) because they “insert (a little) AI into XR experiences using the cloud,” integrating these two emerging technologies. LLMs such as ChatGPT can now process some multimodal information, allowing them to respond to users’ inputs with images as well as with text. Extended reality refers to combining the physical and digital worlds.

“We began the project by exploring what we can do with these two technologies and how we could bring them together,” Srinidhi said.

XaiR is meant to help users deal with confusing or unknown situations, such as figuring out what a mystery implement is and how to use it with the coffee machine, or when instructions are too vague or difficult to understand.

It runs on a virtual reality headset which records the user’s surroundings to form a three-dimensional model of their environment. That information is sent to an external server where a “reality encoder” transforms it into a prompt that can be understood by LLMs; in particular, GTP-4V and and Apple’s Ferret (although the system can handle as many LLMs as the user wishes to add). GPT-4V generates text instructions about the task, while Ferret coordinates where the text should appear in the environment.

Those instructions and coordinates are sent through a “reality decoder” back to the headset, where they are mapped onto the environment model that was created in an earlier step. In the end, the user sees text explaining the next steps they should take and arrows pointing to the relevant objects and locations.

At first, it was difficult to combine these technologies because LLMs require many computational resources in order to function, which current XR technologies such as the headset are not capable of running. So, they offloaded many of those computations to the server, creating the XR response pipeline through the reality encoder and decoder.

They evaluated the project’s effectiveness by having users try it out. “The user tests were definitely the most fun part because we could see how different users responded to and perceived the same technology,” Srinidhi said. The current version of the project explored what was possible; now, the team is working to improve their design and create a more robust system.

Srinidhi and Lu presented their design and findings at this year’s IEEE International Symposium on Mixed and Augmented Reality (ISMAR) conference, held in Seattle from October 21–25, 2024. Out of all the student papers presented there, their paper was voted “Best Student Paper.”

“I am really happy about the award,” Srinidhi commented. “It validates that my work is actually impactful and appreciated by the research community. Being my first paper, it also gives me a sense of confidence that I am doing good work.”

She is incredibly grateful to her advisor, Anthony Rowe, professor of electrical and computer engineering, and thanks him for his mentorship.