Hey Everyone,
So recently we are witnessing improvements to how LLMs work to reduce hallucinations.
In a nutshell this 1st paper, Gorilla, finetunes Llama on 1000s of API calls. The method reduces hallucination and outperforms GPT-4 on certain tasks. Also releases APIBench, a comprehensive dataset of HuggingFace, TorchHub, and TensorHub APIs.
What is it?
Gorilla is a LLM that can provide the appropriate API calls. It is trained on three massive machine learning hub datasets: Torch Hub, TensorFlow Hub and HuggingFace. Zero-shot Gorilla outperforms GPT-4, Chat-GPT and Claude. Gorilla is extremely reliable, and significantly reduces the hallucination errors.
Is it a Milestone?
The world of LLMs is witnessing a new milestone with Microsoft Research's introduction of Gorilla, a fine-tuned LLaMA model explicitly designed for API calls, surpassing the performance of GPT-4 in writing API calls.
API Advancements
Despite the recent advancements in LLMs, their potential to effectively use tools via API calls has remained largely untapped. This is where Gorilla steps in - given a natural language query, Gorilla can generate more than 1,600 semantically and syntactically correct API calls across Hugging Face, Torch Hub, and TensorFlow.
I View this as an Important Paper
Gorilla's strength lies in its ability to adapt to API changes during inference. It also significantly reduces the issue of hallucination by using a retrieval system that pulls the relevant API call from a large database.
Integrating the retrieval system with Gorilla showcases the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs.
To evaluate Gorilla's capabilities, a comprehensive dataset called APIBench was introduced, consisting of HuggingFace, TorchHub, and TensorHub APIs.
Together with other agent-like libraries such as Hugging Face’s Transformers Agent, one can significantly extend LLM’s abilities, automating whole workflows with minimal human intervention.
See the Demo Video
They Claim it Surpasses GPT-4
Gorilla picks from 1000s of APIs to complete user tasks, surpassing even GPT-4! LLMs need to interact with the world through APIs, and Gorilla teaches LLMs APIs. Presenting Gorilla-Spotlight demo
Webpage: https://gorilla.cs.berkeley.edu - see the Tweet.
Abstract
Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs.
The model and code of Gorilla are available at https://github.com/ShishirPatil/gorilla.
They also have a Discord.
Here is the HuggingFace page on this.
This was a collaboration between UC Berkeley and Microsoft Research.
Shishir Patil is a PhD student. He has been an intern at Google, Amazon and Microsoft. See on Google Scholar.
What can it do?
Gorilla
enables LLMs to use tools by invoking APIs. Given a natural language query, Gorilla can write a semantically- and syntactically- correct API to invoke. With Gorilla, we are the first to demonstrate how to use LLMs to invoke 1,600+ (and growing) API calls accurately while reducing hallucination. We also release APIBench, the largest collection of APIs, curated and easy to be trained on! Join us, as we try to expand the largest API store and teach LLMs how to write them! Hop on our Discord, or open a PR, or email us if you would like to have your API incorporated as well.
TL;DR
How LLMs and APIs work together are quickly improving and show a lot of potential. Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly.
If this really improves hallucination levels this team should really start their own startup. But I think the paper is technically Microsoft Research. Figure 2: Accuracy (vs) hallucination in four settings, that is, zero-shot (i.e., without any retriever), and with retrievers. BM25 and GPT are commonly used retrievers and the oracle retriever returns relevant documents at 100%, indicating an upper bound. Higher in the graph (higher accuracy) and to the left is better (lower hallucination). Across the entire dataset, our model, Gorilla, improves accuracy while reducing hallucination.
They plan to release a Gorilla model with Apache 2.0 license by Jun 5, 2023. It’s a fairly exciting project. What do you think?