EleutherAI's Aran Komatsuzaki on Open-Source Models' Future and Thought Cloning
Ep. 12, Unsupervised Learning
In this episode of Unsupervised Learning we sat down with the Lead Researcher at EleutherAI and prolific AI twitter contributor Aran Komatsuzaki. See the full episode below!
Aran provides insight into the founding principles of EleutherAI, a perspective on the role of open source models, and his opinion on how to achieve artificial general intelligence. You can listen to the full conversation on Spotify, Apple, and YouTube, or read our highlights below!
⚡ Highlight 1: Building “The Pile”
Since their founding in 2020, EleutherAI has built or contributed to some of the most famous foundation models including GPT-J, GPT-NeoX, BLOOM, and Stable Diffusion. However, it’s first, and arguably most important contribution to the field so far has been building a public large-scale (800GB) text corpus called “The Pile” for training large language models.
“We had about a dozen contributors for each sub component of the dataset. The pile was to simply lubricate the dataset used for training GPT-3 with additional components such as stack exchange and many different smaller sub data sets to improve the diversity, which was slightly lacking with GPT-3’s original dataset.”
Since publishing “The Pile” Microsoft has used it to train its Megatron transformer model while Meta has used it to train is OPT and Llama models.
⚡ Highlight 2: Compute isn’t the only constraint for open source models
Many AI experts have weighed in on the closed vs. open source LLM debate, with several recognizing that open source models will not be able to afford the cost of training 1T+ parameter models (Semianalysis reported GPT-4 was a 1.8T parameter model!). However, Aran points to additional factors beyond training that are helping closed source models maintain their lead over open source.
“So I should also mention that you need a lot of budget for other components. Compute is obviously one of the major factors. Two other major factors are the number of top researchers and engineers you have, and how much you can spend on the dataset. More specifically, the supervised fine tuning and RL component. Most importantly the number of top researchers and engineers you have…there’s a clear gap between top players like OpenAI or Google and the smaller providers.”
However, Meta just released a fully open sourced Llama 2 and Scale subsequently announced its platform to help enterprises fine tune open-source models on their private data. The market’s appetite for open source large language models could supersede these constraints, but only time will tell. See Aran talk about this here!
⚡ Highlight 3: Video is the next modality for foundation models
With GPT-4 expanding beyond the language paradigm to include images, many think video is the next modality to be incorporated as an input. Aran discussed his work in this domain on the podcast.
“We are collecting a computer activity dataset of many different modalities. We are letting annotators use computers to do many kinds of tasks like playing video games, watching YouTube videos, browsing the internet, finding information, and using apps. We are recording those processes using video screenshots, audio, mouse, keyboard inputs and even eye movement tracking. Eye movement is actually often overlooked in machine learning today and I think it's super important because the way humans visualize input is quite different from the vision input for a camera.”