SYNERGAI: Perception Alignment for Human-Robot Collaboration

Yixin Chen*, Guoxi Zhang*, Yaowei Zhang*, Hongming Xu*, Peiyuan Zhi, Qing Li📧, Siyuan Huang📧
Beijing Institute for General Artificial Intelligence (BIGAI)
*Indicates Equal Contribution
TL;DR: We introduce SYNERGAI, a LLM-based system that can align percepturally and collaborate with humans on 3D reasoning tasks in a zero-shot manner.

Abstract

Recently, large language models (LLMs) have shown strong potential in facilitating human-robotic interaction and collaboration. However, existing LLM-based systems often overlook the misalignment between human and robot perceptions, which hinders their effective communication and real- world robot deployment. To address this issue, we introduce SYNERGAI, a unified system designed to achieve both perceptual alignment and human-robot collaboration. At its core, SYNERGAI employs 3D Scene Graph (3DSG) as its explicit and innate representation. This enables the system to leverage LLM to break down complex tasks and allocate appropriate tools in intermediate steps to extract relevant information from the 3DSG, modify its structure, or generate responses. Importantly, SYNERGAI incorporates an automatic mechanism that enables perceptual misalignment correction with users by updating its 3DSG with online interaction. SYNERGAI achieves comparable performance with the data-driven models in ScanQA in a zero- shot manner. Through comprehensive experiments across 10 real-world scenes, SYNERGAI demonstrates its effectiveness in establishing common ground with humans, realizing a success rate of 61.9% in alignment tasks. It also significantly improves the success rate from 3.7% to 45.68% on novel tasks by transferring the knowledge acquired during alignment.

Overview

Leveraging 3DSG as its representation, SYNERGAI decomposes complex tasks with LLMs and takes actions with our designed tools in intermediate steps. It interacts with humans through natural language and non-verbal mouse clicking to enhance object references, capable of facilitating human-robot collaboration and perceptual alignment by automatically modifying the data stored in 3DSG.

System Design

SYNERGAI represents 3D scene with 3DSGs and leverages LLMs to respond to user inputs. It is first prompted to generate a plan, which effectively decomposes the input task into sub-tasks to be solved in a sequential process. At each step, SYNERGAI selects a tool as its action based on the observation, which contains the results of the previous actions. In this example, the system identifies the correct object of relationship “on the blue box”, but incorrectly recognizes it as a book, where perception misalignment happens.

Tool Design

The tools developed for SYNERGAI support accessing relevant information (the top five) from 3DSG, modifying its data (the following four) and generating responses to the user (the last two).

Human-Robot Collaboration

We demonstrate SYNERGAI’s capability in high-level human-robot collaboration with its zero-shot 3D reasoning task performance, including object captioning, scene captioning, question-answering and task planning. We quantitatively show that it reaches comparable performance to the data-driven methods on ScanQA benchmark in a zero-shot manner.

Human-Robot Alignment

We systematically assess SYNERGAI’s capability in achieving perceptual alignment with humans spanning 10 real-world scenes sourced from the ScanNet dataset. The tests include two phases, i.e., alignment tasks and knowledge transfer.

Alignment Task List and Examples

Scene Reconstruction and 3DSG

From a sequence of posed RGBD images, the 3D mesh of a scene can be reconstructed through either depth fusion and marching cubes algorithm, or point accumulation from the depth image like ConceptGraphs or via neural rendering with state-of-the-art methods like MonoSDF. We construct a 3D Scene Graph (3DSG)as its data structure, which encapsulates hierarchical topology and key information necessary for 3D reasoning.

Prompts and Doc-strings


BIGAI logo