John Atkinson*
AI-Empowered, Santiago, Chile
*Corresponding author: John Atkinson, AI-Empowered, Santiago, Chile
Submission: May 30, 2025;Published: July 17, 2025
ISSN:2832-4463 Volume4 Issue5
Generative Artificial Intelligence (GenAI) is fundamentally reshaping robotics, moving the eld beyond rigid, pre-programmed systems toward flexible, adaptive, and creative machines. Traditional robotics has long relied on precise control systems, detailed planning, and narrow task definitions, but GenAI through technologies such as large vision-language models, diffusion models, and imitation learning enables robots to learn from demonstrations, natural language, and online data. These advances are further amplified by collaborative efforts like Open X-Embodiment, which pool data from diverse robots to build scalable, generalist AI models. Despite these breakthroughs, significant challenges remain before robots can be fully integrated into everyday life. Issues such as safety, interpretability, data efficiency, and real-time performance continue to limit deployment in high- stakes or consumer-facing contexts. Moreover, robots still lack the general- purpose commonsense needed for complex, multi-step tasks in unstructured environments. Nonetheless, the future of robotics is being rapidly transformed by GenAI, with promising directions including open-ended skill acquisition, personalized user interactions, and integration with emerging technologies. Accordingly, this review discusses recent research, challenges and applications of GenAI and robotics and its impact in real-life applications.
Keywords:Generative artificial intelligence; Intelligent robotics; Large vision language models; LLM-driven robotics
Generative Artificial Intelligence (GenAI) has made significant strides in recent years, particularly in natural language processing, image synthesis, and multimodal learning. Its integration with robotics an area traditionally dominated by deterministic control systems, perception algorithms, and classical planning signals a paradigm shift toward more adaptable, data-driven, and creative robotic systems. Recent advancements in GenAI are reshaping the eld of robotics, opening new possibilities for learning, adaptation, and interaction. Unlike traditional robotics, which often relies on pre-programmed behaviors and narrowly defined tasks, GenAI enables robots to reason, imagine, and create solutions in dynamic environments [1]. This shift marks a critical evolution in how robots are designed and deployed, with the potential to impact industries ranging from manufacturing and healthcare to education and the creative arts. Through the integration of large vision language models, multimodal learning, and imitation from demonstrations, robots can now learn new tasks continuously from diverse sources including online content, user instructions, and their own experiences [2,3]. This approach dramatically reduces the need for manual programming and expands the range of environments in which robots can operate e effectively.
GenAI also improves human-robot interaction by making communication more natural and intuitive. Robots equipped with generative models can interpret spoken or written commands, understand context, and generate appropriate responses or actions. Moreover, the creative potential of GenAI is unlocking new roles for robots in art, design, and entertainment. To support these sophisticated functions, researchers are increasingly exploring new computational architectures, AI techniques and collected robotics datasets. They make the deployment of advanced robotic systems more practical and scalable. Together, these developments suggest a future in which GenAI is a foundational element of next-generation robotics. This review explores the current trajectory and future prospects of GenAI in robotics.
Tasks that are cognitively or physically trivial for humans
often pose substantial challenges for robotic systems, whereas
tasks that demand sustained precision or endurance are relatively
straightforward for machines to perform [4]. For instance, a robot
can play chess or maintain a fixed grip on an object indefinitely
with high reliability. In contrast, tasks such as tying shoelaces,
intercepting a moving object, or engaging in natural language
dialogue require sophisticated perceptual, motor, and cognitive
integration. These challenges stem from several key limitations:
1. Imprecise motor control and coordination
2. Constrained perceptual understanding due to dependency
on limited-resolution sensor data, and
3. An absence of intuitive physical reasoning, which humans
typically develop through embodied experience.
Traditionally, roboticists have addressed these limitations through model- based control and explicit motion planning. This approach typically involves the use of vision systems to detect and classify objects and environments, followed by the construction of detailed predictive models to estimate the consequences of specific motor commands. Based on these models, planners generate highly deterministic action sequences, which are rigorously tested and incrementally refined in controlled laboratory settings to ensure robustness and repeatability [5]. This approach has its limits. Robots trained like this are strictly choreographed to work in one specific setting. Compared with other elds, such as computer vision, robotics has been in the dark ages. But that might not be the case for much longer, because the eld is experiencing a big shake-up. Thanks to new approaches such as GenAI, the focus is now shifting from feats of physical dexterity to building general-purpose robot brains in the form of deep neural networks. Much as the human brain is adaptable and can control different aspects of the human body, these networks can be adapted to work in different robots and different scenarios.
Recent technological trends in GenAI and robotics are being driven by the integration of advanced generative models, particularly foundation models that combine multiple modalities such as vision, language, and motor control [6]. These large-scale, general-purpose models enable robots to generalize across diverse tasks and environments, representing a major step toward true embodied intelligence. Additionally, diffusion models are emerging as powerful tools for robotic planning, capable of generating high-quality action sequences, adaptive policies, and even full simulations providing greater flexibility and robustness in decisionmaking [7,8].
Another key area is simulation-to-real transfer, where generative models play a critical role in narrowing the gap between virtual training and real-world deployment. By generating realistic textures, physics behaviors, and sensor noise, these models make it easier to transfer skills learned in simulation to physical robots. At the same time, researchers are working on embodied agents that incorporate memory and reasoning capabilities, aiming to create robots that can understand context, recall relevant experiences, and reason symbolically. These trends collectively point toward a future of more autonomous, intelligent, and adaptable robotic systems. Thus, instead of the traditional painstaking planning and training, deep learning and neural networks have been used to create systems that learn from their environment on the go and adjust their behavior accordingly.
At the same time, the emergence of low-cost hardware such
as commercially available components and affordable robotic
platforms like Stretch has significantly lowered the barrier to entry
for conducting large-scale robotic experimentation. In general,
current research leverages artificial intelligence and Generative AI
(GenAI) to train robotic systems via two state-of-the-art techniques
[9]:
A. Reinforcement learning (RL): it allows systems to
improve through trial and error, so the robotic system can adapt
its movements in new environments. It can be used learning to
create a robotic system that can do extreme tasks (i.e., parkour)
with minimal pre-programming. This approach is inspired by
human navigation in which Humans receive information about
the surrounding world from their eyes, and this helps them
instinctively place one foot in front of the other to get around
in an appropriate way. Thus, a robot can use a camera to look
ahead. The robot was then able to memorize what was in front
of it for long enough to guide its leg placement. The robot
learned about the world in real time, without internal maps,
and adjusted their behavior accordingly [9,10].
B. Imitation learning: a model learns to perform tasks by,
for example, imitating the actions of a human tele-operating
a robot or using a VR head- set to collect data on a robot.
This technique has recently become more popular with
robots that do manipulation tasks. By pairing this technique
with GenAI methods such as Large Language Models (LLM),
GANs (Generative Adversarial Networks), Transformers and
Diffusion models, researchers have been able to quickly teach
robots to do many new tasks. This may extend the technology
propelling GenAI from the realm of text, images, and videos into
the domain of robot movements [11,12].
A common approach begins with human teleoperation, where a human operator manually controls the robot to demonstrate target behaviors. These demonstrations serve as foundational data for training, which is subsequently leveraged by generative AI (GenAI) techniques such as diffusion models to enable the robot to learn complex skills autonomously from the provided data [5]. For instance, researchers have successfully trained robots to perform over 200 distinct tasks, including fine motor activities such as peeling vegetables and pouring liquids, with ongoing efforts aimed at scaling this capability to over 1,000 skills by year-end [13]. In parallel, industry efforts have advanced the development of multimodal robotic foundation models. A notable example is Covariant’s RFM-1, which integrates diverse input modalities text, images, video, robot command sequences, and sensor measurements to facilitate flexible task specification and execution.
GenAI models not only enhance a robot’s ability to interpret complex multimodal instructions but also enable the generation of contextual visual representations (e.g., task-related images or video simulations). A recent development by Stanford researchers, ALOHA (Affordable Low-cost Open-source Hardware for teleoperation), demonstrated that a robot could learn to perform tasks such as cooking shrimp using as few as 20 human demonstrations, supplemented by data from unrelated tasks (e.g., removing a paper towel or tape) [14]. These findings indicate that GenAI enables cross-task generalization, where training on a specific task can improve performance on others through shared representational learning and transferable skill acquisition.
Recent advancements suggest that GenAI has the potential to render many conventional robotics methodologies obsolete. This evolution is timely, as the robotics eld despite decades of rigorous algorithmic development and system engineering continues to face significant limitations in core areas such as perception, motion planning, reasoning, grasping, manipulation, and human robot interaction, particularly when operating in unstructured, dynamic environments characteristic of the human world [15]. Deep learning-based approaches are increasingly demonstrating competitive performance relative to traditional, model-based techniques in both control and sensorimotor processing tasks. In particular, large language models (LLMs), when trained on sufficiently diverse and large-scale datasets, exhibit a compelling capacity to generalize across a wide range of tasks and situational contexts, offering a promising new paradigm for robotic autonomy and adaptability [16,17].
However, gathering training data for robots is costly and slow. Some estimates show that to reach a similar amount of data available for Natural Language Processing (NLP), from streams of images and text produced by internet users, robotics training data needs to scale up by a factor of 27 million. A recent community effort named Open X-Embodiment has produced a dataset of 22 robots, 527 skills and 160,266 tasks, which seems a sizeable start. However, the feasibility of ever gathering sufficient data to develop a general-purpose robotics model is questionable.
The complexity of real-world human robot interactions requires exceptionally high standards of reliability and robustness. While zero-shot performance rates of 50% to 75% may be considered notable achievements under controlled laboratory conditions, such performance levels remain insufficient for safety-critical or human-facing deployment scenarios. Beyond quantitative benchmarks, concerns related to the reliability and trustworthiness of general-purpose robotic models present significant challenges. Unlike language-based systems (e.g., ChatGPT or Gemini), where occasional factual inaccuracies or hallucinations may be tolerable, physical robotic systems operating in human environments must adhere to strict safety and dependability constraints. Consequently, robotics must continue to integrate models grounded in physical reasoning and embodied understanding of the environment.
To address these challenges, researchers have begun exploring the integration of Large Vision-Language Models (LVLMs) into robotic systems [18,19]. Early research suggests that LVLMs significantly enhance capabilities in scene understanding, human robot interaction, and high-level action planning. Models such as GPT-4 and Gemini, having been trained on internet-scale multimodal data, exhibit a form of emergent commonsense knowledge that can potentially be leveraged for robotic reasoning and decision-making in open-world environments [20]. However, this commonsense representation remains fundamentally different from human-like under- standing and continues to raise questions about reliability and interpretability. Nevertheless, the semantic priors embedded within LVLMs particularly regarding everyday objects, actions, and interactions offer a promising foundation for advancing robotic perception and interaction in complex, dynamic settings.
Nonetheless, significant challenges remain in addressing the
complexities associated with operating in dynamic, unstructured
environments. How robots can physically interact with their
environment will depend on their bodies, and a next step is
highlighted in the `SayCan’ project [21], in which the PaLM model
is grounded in the affordances of real-world mobile robots into two
primary components:
a) LLM: it uses language models such as GPT-4 that
understands and generates natural language. This model is
good at understanding contextual nuances, inferring implicit
intents, and generating actionable plans based on natural
language inputs (aka. prompts).
b) Action model: it performs semantic grounding by
translating natural language commands into executable lowlevel
robotic actions. It evaluates the operational feasibility of
candidate actions, ranks them according to task-specific and
environmental context, and manages their sequential execution
within the robot’s control architecture.
A related research direction is to develop LVLMs with an advanced, physical commonsense understanding of the world. An essential ingredient is curated data collection of examples from videos for a better understanding of physical properties of objects and physical effects in manipulating them [22]. Designing robotic systems that can safely and reliably work in the real world remains a challenging issue, but GenAI is injecting the eld with fresh ideas.
Other effort such as Open X-Embodiment [23] aims at collaboratively developing generalist AI models for robots (aka. RT-X models), that can learn and adapt to various robots, tasks, and environments. It involves creating a large, open-source dataset of real robot trajectories, and providing standardized data formats and model checkpoints for research. The goal is to move beyond training separate models for each robot and task to enable robots to leverage experience from diverse sources. The initiative has been able to partner with 34 research labs and about 150 researchers to collect data from 22 different robots. The resulting dataset consists of robots demonstrating 527 skills, such as picking, pushing, and moving. The initiative sought to establish a robot internet by aggregating robotic data from laboratories worldwide, thereby enabling access to larger, more scalable, and diverse datasets for the research community. This effort parallels the deep learning breakthrough catalyzed by the introduction of ImageNet, a largescale online image dataset that significantly advanced computer vision and laid the foundation for modern generative AI. In this context, researchers developed two implementations of a robotic model named RT-X: one designed for local deployment on individual laboratory infrastructure, and another accessible remotely via web-based interfaces, facilitating distributed experimentation and collaboration.
The larger, web-accessible model was pretrained with internet data to develop a ‘visual commonsense’, or a baseline understanding of the world, from LLMs and image models. When the RT-X model was ran on many different robotics platforms, robots were observed to learn skills 50% more successfully than in the systems each individual lab was developing. These large robotic dataset and GenAI which are able to analyze image and language data, might offer robots important hints as to how the surrounding world works. These models provide high-level semantic representations of the world, which can support robotic systems in tasks involving reasoning, inference, and visual understanding. In order to evaluate this capability, researchers deployed a robot pre-trained on a large multimodal model and instructed it to identify a specific person’s image. Despite the absence of explicit training data containing images of the individual, the robot successfully localized the target image, leveraging its web-scale, multimodal knowledge to infer its identity through contextual and semantic associations.
Novel LVLMs has been introduced for robots using the previous approach, RT-2 This model gets its general understanding of the world from online text and images it has been trained on, as well as its own interactions in the real world. It translates that data into robotic actions. Each robot has a slightly different way of translating English into action.
While rapid advancements in robotic systems are advancing,
significant challenges remain before they can be viably deployed
in real-world, consumer-facing environments. Current platforms
exhibit limited dexterity and reliability, making it difficult to
justify their high cost for everyday users. Moreover, these systems
generally lack robust commonsense reasoning capabilities, which
constrains their ability to perform multitask operations or adapt to
unstructured scenarios. Progress is still needed to transition from
basic manipulation tasks such as object grasping and placement
to more complex, goal-directed activities involving sequential and
context-aware actions. For instance, tasks like reassembling a board
game, packaging its components, and returning it to a designated
storage location exemplify the level of functional autonomy yet to
be achieved. Accordingly, several applications could be useful in the
near future, including:
A. Motion and trajectory generation: Generative models
like Variational Autoencoders (VAEs), GANs, and diffusion
models are increasingly used to generate plausible movement
trajectories for complex robotic systems.
B. Grasp and manipulation planning: Generative models
can create synthetic grasp configurations or infer manipulation
strategies in high-dimensional spaces, often outperforming
traditional planning methods in unstructured environments.
C. Scene understanding and simulation: GenAI can
produce synthetic environments and simulate sensor data,
which is useful for training robots in virtual worlds before
deployment.
D. Language-to-action translation: LLMs combined with
generative policies allow robots to interpret and act on natural
language commands, enabling more intuitive human-robot
interaction.
E. Design and prototyping: Generative design tools assist
in the physical design of robotic components by creating novel,
optimized shapes or mechanical architectures.
As a consequence, generative models face several critical challenges that limit their deployment in real-world robotics. Safety and reliability remain major concerns, as these models are inherently stochastic and can produce unpredictable or unsafe outputs, which is particularly problematic in high-stakes domains like healthcare or manufacturing. Additionally, data efficiency is a barrier as training such models typically requires large-scale datasets that are costly and impractical to obtain in physical environments; research is ongoing in self- supervised and few-shot learning to address this. Another demanding issue is interpretability it is often unclear why a generative model made a particular decision, complicating debugging and eroding user trust, especially in settings that demand human-robot collaboration. Finally, the real-time performance of generative models poses a challenge due to their high computational demands, motivating efforts to optimize them for efficient inference on edge devices.
Based on the recent advances, future directions in robotics will be increasingly shaped by the integration of GenAI, paving the way for more adaptive, creative, and intelligent machines. One key trend is open-ended skill acquisition, where robots continually learn new tasks through interaction, web-based information, and human demonstrations, moving beyond pre-programmed behavior. This adaptability also supports the emergence of creative robotics, allowing machines to contribute to fields like art, architecture, and music. Additionally, generative AI enables personalized robotics, where systems tailor their actions to individual user preferences especially impactful in domestic and healthcare settings.
GenAI is poised to revolutionize robotics by enabling systems that are not only reactive but also imaginative, adaptive, and creative. While significant challenges remain in safety, interpretability, and efficiency, the convergence of generative modeling and robotics opens the door to more intelligent, versatile, and collaborative machines. A major breakthrough is in open-ended learning, where robots leverage generative models to acquire new skills from human demonstrations, natural language, and large-scale internet data moving away from rigid, pre-programmed instructions. This allows robots to generalize across tasks, adapt in real time, and handle more complex, unstructured scenarios. Generative AI also enhances human-robot interaction by allowing robots to interpret intent, generate natural language responses, and refine their behavior through continuous feedback. Furthermore, GenAI is pushing robotics into creative and personalized domains. Robots can now participate in artistic, architectural, and musical endeavors, suggesting a future where machines become collaborators in creative industries. In personal settings, generative models enable robots to tailor their behavior to individual users, especially valuable in assistive healthcare and home automation. Overall, GenAI will transform robotics from task-specific tools into adaptive, intelligent partners capable of evolving with human needs.
© 2025 John Atkinson. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and build upon your work non-commercially.