ChatGPT and the rise of large language models

ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health

What are Large Language Models (LLMs)?

Large Language Models (LLMs) have recently gathered attention with the release of ChatGPT, a user-centered chatbot released by OpenAI. In this perspective article, we retrace the evolution of LLMs to understand the revolution brought by ChatGPT in the artificial intelligence (AI) field.

Introduction to Large Language Models

The opportunities offered by LLMs in supporting scientific research are multiple and various models have already been tested in Natural Language Processing (NLP) tasks in this domain.

The impact of ChatGPT has been huge for the general public and the research community, with many authors using the chatbot to write part of their articles and some papers even listing ChatGPT as an author. Alarming ethical and practical challenges emerge from the use of LLMs, particularly in the medical field for the potential impact on public health. Infodemic is a trending topic in public health and the ability of LLMs to rapidly produce vast amounts of text could leverage misinformation spread at an unprecedented scale, this could create an “AI-driven infodemic,” a novel public health threat. Policies to contrast this phenomenon need to be rapidly elaborated, the inability to accurately detect artificial-intelligence-produced text is an unresolved issue.

1. Introduction

“ChatGPT” is a large language model (LLM) trained by OpenAI, an Artificial intelligence (AI) research and deployment company, released in a free research preview on November 30th 2022, to get users’ feedback and learn about its strengths and weaknesses (1) Previously developed LLMs were able to execute different natural language processing (NLP) tasks, but ChatGPT differs from its predecessors. It’s an AI chatbot optimized for dialog, especially good at interacting in a human-like conversation.

With the incredibly fast spread of ChatGPT, with over one million users in 5 days from its release, (2) many have tried out this tool to answer complex questions or to generate short texts. It is a small leap to infer that ChatGPT could be a valuable tool for composing scientific articles and research projects. But can these generated texts be considered plagiarism? (3, 4).

It took a while to adopt systems at the editorial level to recognize potential plagiarism in scientific articles, but intercepting a product generated by ChatGPT would be much more complicated.

In addition, the impact that this tool may have on research is situated within a background that has been profoundly affected after the COVID-19 pandemic (5). In particular, health research has been strongly influenced by the mechanisms of dissemination of information regarding SARS-CoV-2 through preprint servers that often allowed for rapid media coverage and the consequent impact on individual health choices (6, 7).

Even more than scientific literature, social media have been the ground of health information dissemination during the COVID-19 pandemic, with the rise of a phenomenon known as infodemic (8).

Starting from a background on the evolution of LLMs and the existing evidence on their use to support medical research, we focus on ChatGPT and speculate about its future impact on research and public health. The objective of this paper is to promote a debate on ChatGPT’s space in medical research and the possible consequences in corroborating public health threats, introducing the novel concept of “AI-driven infodemic.”

Principles of Multi-Modal Learning

2. The evolution of pre-trained large language models

The LLMs’ evolution in the last 5 years has been exponential and their performance in a plethora of different tasks has become impressive.

Before 2017, most NLP models were trained using supervised learning for particular tasks and could be used only for the task they were trained on (9).

To overcome those issues, the self-attention network architecture, also known as Transformer, (10) was introduced in 2017 and was used to develop two game-changing models in 2018: Bidirectional Encoder Representations from Transformers (BERT) and Generative Pretrained Transformer (GPT) (11, 12).

Both models achieved superior generalization capabilities, thanks to their semi-supervised approach. Using a combination of unsupervised pre-training and supervised fine-tuning, these models can apply pre-trained language representations to downstream tasks.

GPT models rapidly evolved in different versions, being trained on a larger corpus of textual data and with a growing number of parameters.

The third version of GPT (GPT-3), with 175 billion parameters, is 100 times bigger than GPT-2 and approximately two times the number of neurons in the human brain (13).

GPT-3 can generate text that is appropriate for a wide range of contexts, but unfortunately, it often expresses unintended behaviors such as making up facts, generating biased text, or simply not following user instructions (14).

This can be explained since the objective of many LLMs, including GPT-3, is to predict the next element in a text, based on a large corpus of text data from the internet, (15) thus LLMs learn to replicate biases and stereotypes present in that data (16).

Here comes the major problem of alignment: the difficulty of ensuring that a LLM is behaving in a way that is aligned with human values and ethical principles.

Addressing the alignment problem for LLMs is an ongoing area of research and OpenAI developed a moderation system, trained to detect a broad set of categories of undesired content, including sexual and hateful content, violence, and other controversial topics (17).

ChatGPT incorporates a moderation system, but the true innovation lies in its user-centered approach, which was used to fine-tune the model from GPT-3 to follow the user instructions “helpfully and safely” (14).

This process started from InstructGPT, a LLM with “only” 1.3 billion parameters trained using reinforcement learning from human feedback (RLHF), a combined approach of supervised learning, to obtain human feedback, and reinforcement learning using human preferences as a reward signal.

RLHF is used for adapting the pre-trained model GPT-3 to the specific task of following users’ instructions. From the optimization of InstructGPT for dialog, ChatGPT was born.

Despite these advancements, ChatGPT still sometimes writes plausible-sounding but incorrect or nonsensical answers, due to its inability of fact-checking and its knowledge limited until 2021 (1).

Lecture 9.1: Reinforcement Learning (Multimodal Machine Learning, Carnegie Mellon University)

Lecture 9.2: Multimodal RL (Multimodal Machine Learning, Carnegie Mellon University)

3. Large language models to support academic research

One potential application of LLM is in support of academic research. The scientific literature, with around 2.5 million papers published every year, (20) due to its magnitude is already beyond human handling capabilities.

AI could be a solution to tame the scientific literature and support researchers in collecting the available evidence, (21) by generating summaries or recommendations of papers, which could make it easier for researchers to quickly get the key points of a scientific result. Overall, AI tools have the potential to make the discovery, consumption, and sharing of scientific results more convenient and personalized for scientists. The increasing demand for accurate biomedical text mining tools for extracting information from the literature led to the development of BioBERT, a domain-specific language representation model pre-trained on large-scale biomedical corpora (22).

BioBERT outperforms previous models on biomedical NLP tasks mining, including named entity recognition, relation extraction, and question answering.

Another possible approach is the one of domain-specific foundation models, such as BioGPT and PubMedGPT 2.7B, (23, 24) that were trained exclusively on biomedical abstracts and papers and used for medical question answering and text generation.

Med-PaLM, a LLM trained using few-shot prompting, exceeds previous state-of-the-art models on MedQA, a medical question answering dataset consisting of United States Medical Licensing Exam (USMLE) style questions (25). The performance of ChatGPT on USMLE was recently evaluated and it achieved around 50–60% accuracy across all examinations, near the passing threshold, but still inferior to Med-PaLM (26).

GPT-4 exceeds the passing score on USMLE by over 20 points and outperforms earlier LLMs. Nevertheless, there is a large gap between competency and proficiency examinations and the successful use of LLMs in clinical applications (27).

In the NLP task of text summarization GPT-2 was one of the best-performing models used for summarizing COVID-19 scientific research topics, using a database with over 500,000 research publications on COVID-19 (CORD-19) (28, 29).

CORD-19 was also used for the training of CoQUAD, a question-answering system, designed to find the most recent evidence and answer any related questions (30).

A web-based chatbot that produces high-quality responses to COVID-19-related questions was also developed, this user-friendly approach was chosen to make the LLM more accessible to the general audience (31).

LLMs have also been used for abstract screening for systematic reviews, this allows the use of unlabelled data in the initial step of scanning abstracts, thus saving researcher’s time and effort (32).

LLMs facilitate the implementation of advanced code generation capabilities for statistical and data analytics, two large-scale AI-powered code generation tools have recently come into the spotlight:

OpenAI Codex, a GPT language model fine-tuned on publicly available code from GitHub, (33) and DeepMind AlphaCode, designed to address the main challenges of competitive programming (34).

On one hand, AI tools can make programmers’ jobs easier, aid in education and make programming more accessible (35).

On the other hand, the availability of AI-based code generation raises concerns: the risk of using code generation models is users’ over-reliance on the generated outputs, especially non-programmers may quickly become accustomed to auto-suggested solutions (36).

The above-described deskilling issue is not limited to coding. If we conceive a scenario in which AI is extensively used for scientific production, we must consider the risk of deskilling in researchers’ writing abilities. Some have already raised concerns about the peril of seeing the conduct of research being significantly shaped through AI, leading to a decline in the author’s ability to craft meaningfully and substantively her objects of study (37).

Our reflections highlight a growing interest in the use of LLMs in academic research, with the release of ChatGPT this interest has only increased (38).

Foundation models and the next era of AI

4. The revolution of ChatGPT and the potential impact on scientific literature production

The user-centered approach of ChatGPT is the paradigm shift that makes it different from previous LLMs. The revolutionary impact of ChatGPT does not lie in its technical content, which appears to be merely a different methodology for training, but in the different perspective that it is bringing. ChatGPT will probably be overtaken soon, but the idea of making AI accessible to the broader community and putting the user at the center will stand.

Google AI chief: The promise of multi-modal learning

The accessibility and user-friendly interface of ChatGPT could induce researchers to use it more extensively than previous LLMs. ChatGPT offers the opportunity to streamline the work of researchers, providing valuable support throughout the scientific process, from suggesting research questions to generate hypotheses. Its ability to write scripts in multiple programming languages and provide clear explanations of how the code works, makes it a useful asset for improving understanding and efficiency. ChatGPT output to our prompt requesting to make examples of its abilities to support researchers in suggesting research questions, generate hypotheses, writing scripts in multiple programming languages and providing explanations of how the code works.

ChatGPT can also be used to suggest titles, write drafts, and help to express complex concepts in fluent and grammatically correct scientific English. This can be particularly useful for researchers who may not have a strong background in writing or who are not native English speakers. By supplementing the work of researchers, rather than replacing it, automating many of the repetitive tasks, ChatGPT may help researchers focus their efforts on the most impactful aspects of their work.

The high interest of the scientific community in this tool is demonstrated by the rapid increase in the number of papers published on this topic, shortly after its release. The use of ChatGPT in scientific literature production has already become a reality during the writing of this draft, many authors stated to have used ChatGPT to write at least part of their papers (39). This underlines how ChatGPT has already been integrated into the research process, even before addressing ethical concerns and discussing common rules. For example, ChatGPT has been listed as the first author of four papers, (26, 40, 41, 42) without considering the possibility of “involuntary plagiarism” or intellectual property issues surrounding the output of the model.

The number of pre-prints produced using ChatGPT points out that the use of this technology is inevitable and a debate in the research community is a priority (43).

5. Navigating the threats of ChatGPT in public health: AI-driven infodemic and research integrity

A potential concern related to the emergence of LLMs is the submissiveness in following users’ instructions. Despite the limitations imposed by programmers, LLMs can be easily tricked into producing text on controversial topics, including misinforming content (44).

The ability of LLMs to generate texts similar to those composed by humans could be used to create fake news articles or other seemingly legitimate but actually fabricated or misleading content, (45, 46) without the reader realizing that the text is produced by AI (47).

Under this damaging matter, the counter-offensive rises: some authors highlight the importance of creating LLM detectors that can be able to identify fake news, (48) while others propose LLMs to support the enhancement of detector performance (49). Commonly used GPT-2 detectors were flawed in recognizing text written by AI when generated by ChatGPT (50), new detectors were rapidly developed and released to address this gap, but these tools do not perform well in identifying GPT-4 generated text. This poses a continuous unfair competition to improve detectors that need to follow the pace of LLMs’ rapid advancement, leaving a gap for malicious intent.

As a result, this poses a continuous unfair competition to improve detectors that need to follow the pace of LLMs’ rapid advancement, leaving a gap for malicious intent.

The absence of accurate detectors calls for precautionary measures, for example, the International Conference on Machine Learning (ICML) for its 2023 call for papers prohibited the use of LLMs such as ChatGPT in submitted drafts. However, ICML acknowledges that currently there is not any tool to verify compliance with this rule and thus they are relying on the discretion of participants and await the development of shared policies within the scientific community.

Many scientific journals are questioning the policy matter, publishing editorials on the topic, and updating author’s guidelines (51).

For example, Springer Nature journals have been the first to add rules in the guide to authors: to avoid accountability issues, LLMs cannot be listed as authors and their use should be documented in the methods or acknowledgments sections (3). Also Elsevier created guidelines on the use of AI-assisted writing for scientific production, confirming the rules imposed by Springer and requiring the authors to specify the AI tools employed, giving details on their use. Elsevier declared to be committed to monitor the development around generative AI and to refine the policy if necessary (52).

The misuse of ChatGPT in scientific research could lead to the production of fake scientific abstracts, papers, and bibliographies. In the earlier versions of ChatGPT (up to the 15th December version), when asked to cite references to support its statements, the output was a list of fake bibliographic references. (e.g., fabricated output reference: Li, X., & Kim, Y. (2020). Fake academic papers generated by AI: A threat to the integrity of research. PLOS ONE, 15 (3), e0231891.)

Talk: What Makes Multi-modal Learning Better than Single (Provably)

The usage of real authors’ names, journals, and plausible titles makes the fake reference difficult to immediately spot. This calls for preventive actions, such as the mandatory use of the digital object identifier system (DOI), which could be used to rapidly and accurately identify fake references.

In fields where fake information can endanger people’s safety, such as medicine, journals may have to take a more rigorous approach to verify the information as accurate (53). A combined evaluation by more up-to-date AI-output detectors and human reviewers is necessary to identify AI-generated scientific abstracts and papers, though this process may be time-consuming and imperfect. We, therefore, suggest adopting a “detectable-by-design” policy: the release of new generative AI models by the tech industry to the public should be permitted only if the output generated by the AI is detectable and thus can be unequivocally identified as AI-produced. The impact that generating false and potentially mystifying texts can have on people’s health is huge. The issue of the dissemination of untruthful information has long been known: starting with the unforgettable Wakefield case and the then-generated disbelief that vaccines can cause autism, (54) to the current observation of non-conservative behavior evidenced by the various phases of the COVID-19 pandemic (55). In this context, it has been more evident than ever that junk and manipulative research, through underperforming studies or with study designs unfit to carry out the intended research objective, has had an impact on the behavior of the general population and, more worryingly, on health professionals (56).

The diffusion of misinformation conveyed through rapidly disseminated channels such as mass media and social networks, can generate the phenomenon known as infodemic (57). The consequence on the scientific framework is considerable, even with implications on possible healthcare choices, already a determining factor in the recent pandemic. (58) Infodemic can influence medical decision-making on treatment or preventive measures, (59, 60) for example some people used hydroxychloroquine as a treatment for COVID-19 based on false or unproven information, endorsed by popular and influential people (61). The risk is that we may face a new health emergency where new information can be rapidly produced using LLMs to generate human-like texts ready to spread even if incorrect or manipulated. The concept of infodemic was introduced in 2003 by Rothkopf as an “epidemic of information,” (62) and evolved in 2020 after the COVID19 pandemic, integrating the element of rapid misinformation spreading (63). With the global diffusion of LLMs the infodemic concept must evolve again into the one of “AI-driven Infodemic.” Not only is it possible to rapidly disseminate misinformation via social media platforms and other outlets, but also to produce exponentially growing amounts of health-related information, regardless of one’s knowledge, skills, and intentions. Given the nature of social media content diffusion, LLMs could be used to create content specifically designed for target population groups and in order to go viral and foster misinformation spread. We foresee a scenario in which human-like AI-produced contents will dramatically exacerbate every future health threat that can generate infodemics, that from now on will be AI-driven. Social media and gray literature have already been the ground for infodemic, (63) but scientific literature could become a new and powerful means of disinformation campaigns. The potential of LLMs and in particular ChatGPT in easily generating human-like texts, could convey excessive and, without proper control, low-quality scientific literature production in the health field. The abundance of predatory journals, that accept articles for publication without performing quality checks for issues such as plagiarism or ethical approval, (64) could allow the flooding of the scientific literature with AI-generated articles on an unprecedented scale. The consequences on the integrity of the scientific process and the credibility of the literature would be dreadful (65).

6. Discussion

Large language models have already shown hints of their potential in supporting scientific research and in the next months we expect a growing amount of papers talking about the use of ChatGPT in this field.

The accessibility and astonishing abilities of ChatGPT made it popular across the world and allowed it to achieve a milestone, setting AI conversational tools to the next level.

But soon after its release possible threats emerged, ChatGPT’s ability to follow user’s instruction is a double-edged sword: on one hand, this approach makes it great at interacting with humans, on the other hand being submissive ab origine exposes it to misuse, for example by generating convincing human-like misinformation.

The field of medical research may be a great source for both opportunities and threats coming from this novel approach.

Given that the scientific community has not yet determined the principles to follow for a helpful and safe use of this disruptive technology, the risks coming from the fraudulent and unethical use of LLMs in the health context cannot be ignored and should be assessed with a proactive approach.

We define the novel concept of “AI-driven infodemic,” a public health threat coming from the use of LLMs to produce a vast amount of scientific articles, fake news, and misinformative contents. The “AI-driven infodemic” is a consequence of the use of LLM’s ability to write large amounts of human-like texts in a short period of time, not only with malicious intent, but in general without any scientific ground and support. Beyond text-based content, other AI tools, such as generative-adversarial networks, could also generate audio and video Deepfakes that could be used to disseminate misinformation content, especially on social media (66). Political Deepfakes already contributed toward generalized indeterminacy and disinformation (67).

Reinforcement Learning 5: Function Approximation and Deep Reinforcement Learning

To address this public health threat is important to raise awareness and rapidly develop policies through a multidisciplinary effort, updating the current WHO public health research agenda for managing infodemics (68). There is a need for policy action to ensure that the benefits of LLMs are not outweighed by the risks they pose. In this context, we propose the detectable-by-design approach, which involves building LLMs with features that make it easier to detect when they are being used to produce fake news or scientific articles. However, implementing this approach could slow down the development process of LLMs, and for this reason, it might not be readily accepted by AI companies. The constitution of groups of experts inside health international agencies (e.g., WHO, ECDC) dedicated to monitor the use of LLMs for fake news and scientific articles production is needed, as the scenario is rapidly evolving and the AI-driven infodemic threat is forthcoming. Such groups could work closely with AI companies to develop effective strategies for detecting and preventing the use of LLMs for nefarious purposes. Additionally, there might be a need for greater regulation and oversight of the AI industry to ensure that LLMs are developed and used responsibly. Recently, the President of the Italian Data Protection Authority (DPA) has taken action against Open AI for serious breaches of the European legislation on personal data processing and protection (69). The DPA has imposed a temporary ban on ChatGPT in Italy due to the company’s failure to provide adequate privacy information to its users its and lack of a suitable legal basis for data collection. The absence of a suitable legal basis for data collection raises serious concerns about the ethical implications of using personal data without consent or an adequate legal framework.

In the WHO agenda, AI is considered a possible ally to fight infodemics, allowing automatic monitoring for misinformation detection; but the rise of LLMs and in particular ChatGPT should raise concerns that it could play an opposite role in this phenomenon.

LLMs will continue to improve and rapidly become precious allies for researchers, but the scientific community needs to ensure that the advances made possible by ChatGPT and other AI technologies are not overshadowed by the risks they pose. All stakeholders should foster the development and deployment of these technologies aligned with the values and interests of society. It is crucial to increase understanding of AI challenges of transparency, accountability, and fairness in order to develop effective policies. A science-driven debate to develop shared principles and legislation is necessary to shape a future in which AI has a positive impact on public health, not having such a conversation could result in a dangerous AI-fueled future (70).

Lecture 10.1: Fusion, co-learning, and new trend (Multimodal Machine Learning, CMU)

Introducing PaLM 2

When you look back at the biggest breakthroughs in AI over the last decade, Google has been at the forefront of so many of them. Our groundbreaking work in foundation models has become the bedrock for the industry and the AI-powered products that billions of people use daily. As we continue to responsibly advance these technologies, there’s great potential for transformational uses in areas as far-reaching as healthcare and human creativity.

Over the past decade of developing AI, we’ve learned that so much is possible as you scale up neural networks — in fact, we’ve already seen surprising and delightful capabilities emerge from larger sized models. But we’ve learned through our research that it’s not as simple as “bigger is better,” and that research creativity is key to building great models. More recent advances in how we architect and train models have taught us how to unlock multimodality, the importance of having human feedback in the loop, and how to build models more efficiently than ever. These are powerful building blocks as we continue to advance the state of the art in AI while building models that can bring real benefit to people in their daily lives.

Introducing PaLM 2

Building on this work, today we’re introducing PaLM 2, our next generation language model. PaLM 2 is a state-of-the-art language model with improved multilingual, reasoning and coding capabilities.

Multilinguality: PaLM 2 is more heavily trained on multilingual text, spanning more than 100 languages. This has significantly improved its ability to understand, generate and translate nuanced text — including idioms, poems and riddles — across a wide variety of languages, a hard problem to solve. PaLM 2 also passes advanced language proficiency exams at the “mastery” level.

Reasoning: PaLM 2’s wide-ranging dataset includes scientific papers and web pages that contain mathematical expressions. As a result, it demonstrates improved capabilities in logic, common sense reasoning, and mathematics.

Coding: PaLM 2 was pre-trained on a large quantity of publicly available source code datasets. This means that it excels at popular programming languages like Python and JavaScript, but can also generate specialized code in languages like Prolog, Fortran and Verilog.

A versatile family of models

Even as PaLM 2 is more capable, it’s also faster and more efficient than previous models — and it comes in a variety of sizes, which makes it easy to deploy for a wide range of use cases. We’ll be making PaLM 2 available in four sizes from smallest to largest: Gecko, Otter, Bison and Unicorn. Gecko is so lightweight that it can work on mobile devices and is fast enough for great interactive applications on-device, even when offline. This versatility means PaLM 2 can be fine-tuned to support entire classes of products in more ways, to help more people.

Powering over 25 Google products and features

At I/O today, we announced over 25 new products and features powered by PaLM 2. That means that PaLM 2 is bringing the latest in advanced AI capabilities directly into our products and to people — including consumers, developers, and enterprises of all sizes around the world. Here are some examples:

PaLM 2’s improved multilingual capabilities are allowing us to expand Bard to new languages, starting today. Plus, it’s powering our recently announced coding update.

Workspace features to help you write in Gmail and Google Docs, and help you organize in Google Sheets are all tapping into the capabilities of PaLM 2 at a speed that helps people get work done better, and faster.

Med-PaLM 2, trained by our health research teams with medical knowledge, can answer questions and summarize insights from a variety of dense medical texts. It achieves state-of-the-art results in medical competency, and was the first large language model to perform at “expert” level on U.S. Medical Licensing Exam-style questions. We're now adding multimodal capabilities to synthesize information like x-rays and mammograms to one day improve patient outcomes. Med-PaLM 2 will open up to a small group of Cloud customers for feedback later this summer to identify safe, helpful use cases.

Sec-PaLM is a specialized version of PaLM 2 trained on security use cases, and a potential leap for cybersecurity analysis. Available through Google Cloud, it uses AI to help analyze and explain the behavior of potentially malicious scripts, and better detect which scripts are actually threats to people and organizations in unprecedented time.

Since March, we've been previewing the PaLM API with a small group of developers. Starting today, developers can sign up to use the PaLM 2 model, or customers can use the model in Vertex AI with enterprise-grade privacy, security and governance. PaLM 2 is also powering Duet AI for Google Cloud, a generative AI collaborator designed to help users learn, build and operate faster than ever before.

Advancing the future of AI

PaLM 2 shows us the impact of highly capable models of various sizes and speeds — and that versatile AI models reap real benefits for everyone. Yet just as we’re committed to releasing the most helpful and responsible AI tools today, we’re also working to create the best foundation models yet for Google.

Our Brain and DeepMind research teams have achieved many defining moments in AI over the last decade, and we’re bringing together these two world-class teams into a single unit, to continue to accelerate our progress. Google DeepMind, backed by the computational resources of Google, will not only bring incredible new capabilities to the products you use every day, but responsibly pave the way for the next generation of AI models.

We’re already at work on Gemini — our next model created from the ground up to be multimodal, highly efficient at tool and API integrations, and built to enable future innovations, like memory and planning. Gemini is still in training, but it’s already exhibiting multimodal capabilities never before seen in prior models. Once fine-tuned and rigorously tested for safety, Gemini will be available at various sizes and capabilities, just like PaLM 2, to ensure it can be deployed across different products, applications, and devices for everyone’s benefit.

Massive Update of Chat GPT! - Artificial Intelligence

PaLM-E: An embodied multimodal language model

Recent years have seen tremendous advances across machine learning domains, from models that can explain jokes or answer visual questions in a variety of languages to those that can produce images based on text descriptions. Such innovations have been possible due to the increase in availability of large scale datasets along with novel advances that enable the training of models on these data. While scaling of robotics models has seen some success, it is outpaced by other domains due to a lack of datasets available on a scale comparable to large text corpora or image datasets.

Today we introduce PaLM-E, a new generalist robotics model that overcomes these issues by transferring knowledge from varied visual and language domains to a robotics system. We began with PaLM, a powerful large language model, and “embodied” it (the “E” in PaLM-E), by complementing it with sensor data from the robotic agent. This is the key difference from prior efforts to bring large language models to robotics — rather than relying on only textual input, with PaLM-E we train the language model to directly ingest raw streams of robot sensor data. The resulting model not only enables highly effective robot learning, but is also a state-of-the-art general-purpose visual-language model, while maintaining excellent language-only task capabilities.

How ChatGPT is Trained

An embodied language model, and also a visual-language generalist

On the one hand, PaLM-E was primarily developed to be a model for robotics, and it solves a variety of tasks on multiple types of robots and for multiple modalities (images, robot states, and neural scene representations). At the same time, PaLM-E is a generally-capable vision-and-language model. It can perform visual tasks, such as describing images, detecting objects, or classifying scenes, and is also proficient at language tasks, like quoting poetry, solving math equations or generating code.

PaLM-E combines our most recent large language model, PaLM, together with one of our most advanced vision models, ViT-22B. The largest instantiation of this approach, built on PaLM-540B, is called PaLM-E-562B and sets a new state of the art on the visual-language OK-VQA benchmark, without task-specific fine-tuning, and while retaining essentially the same general language performance as PaLM-540B.

How does PaLM-E work?

Technically, PaLM-E works by injecting observations into a pre-trained language model. This is realized by transforming sensor data, e.g., images, into a representation through a procedure that is comparable to how words of natural language are processed by a language model.

Language models rely on a mechanism to represent text mathematically in a way that neural networks can process. This is achieved by first splitting the text into so-called tokens that encode (sub)words, each of which is associated with a high-dimensional vector of numbers, the token embedding. The language model is then able to apply mathematical operations (e.g., matrix multiplication) on the resulting sequence of vectors to predict the next, most likely word token. By feeding the newly predicted word back to the input, the language model can iteratively generate a longer and longer text.

The inputs to PaLM-E are text and other modalities — images, robot states, scene embeddings, etc. — in an arbitrary order, which we call "multimodal sentences". For example, an input might look like, "What happened between <img_1> and <img_2>?", where <img_1> and <img_2> are two images. The output is text generated auto-regressively by PaLM-E, which could be an answer to a question, or a sequence of decisions in text form.

PaLM-E model architecture, showing how PaLM-E ingests different modalities (states and/or images) and addresses tasks through multimodal language modeling.

The idea of PaLM-E is to train encoders that convert a variety of inputs into the same space as the natural word token embeddings. These continuous inputs are mapped into something that resembles "words" (although they do not necessarily form discrete sets). Since both the word and image embeddings now have the same dimensionality, they can be fed into the language model.

We initialize PaLM-E for training with pre-trained models for both the language (PaLM) and vision components (Vision Transformer, a.k.a. ViT). All parameters of the model can be updated during training.

Transferring knowledge from large-scale training to robots

PaLM-E offers a new paradigm for training a generalist model, which is achieved by framing robot tasks and vision-language tasks together through a common representation: taking images and text as input, and outputting text. A key result is that PaLM-E attains significant positive knowledge transfer from both the vision and language domains, improving the effectiveness of robot learning.

Positive transfer of knowledge from general vision-language tasks results in more effective robot learning, shown for three different robot embodiments and domains.

Results show that PaLM-E can address a large set of robotics, vision and language tasks simultaneously without performance degradation compared to training individual models on individual tasks. Further, the visual-language data actually significantly improves the performance of the robot tasks. This transfer enables PaLM-E to learn robotics tasks efficiently in terms of the number of examples it requires to solve a task.

An introduction to Policy Gradient methods - Deep Reinforcement Learning

Results

We evaluate PaLM-E on three robotic environments, two of which involve real robots, as well as general vision-language tasks such as visual question answering (VQA), image captioning, and general language tasks. When PaLM-E is tasked with making decisions on a robot, we pair it with a low-level language-to-action policy to translate text into low-level robot actions.

In the first example below, a person asks a mobile robot to bring a bag of chips to them. To successfully complete the task, PaLM-E produces a plan to find the drawer and open it and then responds to changes in the world by updating its plan as it executes the task. In the second example, the robot is asked to grab a green block. Even though the block has not been seen by that robot, PaLM-E still generates a step-by-step plan that generalizes beyond the training data of that robot.

PaLM-E controls a mobile robot operating in a kitchen environment. Left: The task is to get a chip bag. PaLM-E shows robustness against adversarial disturbances, such as putting the chip bag back into the drawer. Right: The final steps of executing a plan to retrieve a previously unseen block (green star). This capability is facilitated by transfer learning from the vision and language models.

In the second environment below, the same PaLM-E model solves very long-horizon, precise tasks, such as “sort the blocks by colors into corners,” on a different type of robot. It directly looks at the images and produces a sequence of shorter textually-represented actions — e.g., “Push the blue cube to the bottom right corner,” “Push the blue triangle there too.” — long-horizon tasks that were out of scope for autonomous completion, even in our own most recent models. We also demonstrate the ability to generalize to new tasks not seen during training time (zero-shot generalization), such as pushing red blocks to the coffee cup.

PaLM-E controlling a tabletop robot to successfully complete long-horizon tasks.

The third robot environment is inspired by the field of task and motion planning (TAMP), which studies combinatorially challenging planning tasks (rearranging objects) that confront the robot with a very high number of possible action sequences. We show that with a modest amount of training data from an expert TAMP planner, PaLM-E is not only able to also solve these tasks, but it also leverages visual and language knowledge transfer in order to more effectively do so.

Deep RL Bootcamp Frontiers Lecture I: Recent Advances, Frontiers and Future of Deep RL

PaLM-E produces plans for a task and motion planning environment.

As a visual-language generalist, PaLM-E is a competitive model, even compared with the best vision-language-only models, including Flamingo and PaLI. In particular, PaLM-E-562B achieves the highest number ever reported on the challenging OK-VQA dataset, which requires not only visual understanding but also external knowledge of the world. Further, this result is reached with a generalist model, without fine-tuning specifically on only that task.

PaLM-E exhibits capabilities like visual chain-of-thought reasoning in which the model breaks down its answering process in smaller steps, an ability that has so far only been demonstrated in the language-only domain. The model also demonstrates the ability to perform inference on multiple images although being trained on only single-image prompts. The image of the New York Knicks and Boston Celtics is under the terms CC-by-2.0 and was posted to Flickr by kowarski. The image of Kobe Bryant is in the Public Domain. The other images were taken by us.

Conclusion

PaLM-E pushes the boundaries of how generally-capable models can be trained to simultaneously address vision, language and robotics while also being capable of transferring knowledge from vision and language to the robotics domain. There are additional topics investigated in further detail in the paper, such as how to leverage neural scene representations with PaLM-E and also the extent to which PaLM-E, with greater model scale, experiences less catastrophic forgetting of its language capabilities.

PaLM-E not only provides a path towards building more capable robots that benefit from other data sources, but might also be a key enabler to other broader applications using multimodal learning, including the ability to unify tasks that have so far seemed separate.

More Information:

https://wiki.pathmind.com/deep-reinforcement-learning

https://magazine.sebastianraschka.com/p/ahead-of-ai-7-large-language-models

https://primo.ai/index.php?title=Large_Language_Model_%28LLM%29

ChatGPT and the rise of large language models