Scaling and Orchestrating Large Language Models: In Conversation With Databricks CTO Matei Zaharia

and

Mar 20, 2023

On the second episode of our Unsupervised Learning podcast, we chatted with Matei Zaharia, CTO and co-founder of Databricks and Professor at Stanford. Here are some highlights and the full transcript in case you missed it!

⚡ Highlight 1: Combining traditional retrieval-based NLP with generative models

“Right now these large language models combine both semantic understanding or reasoning and facts. But you'd like to decouple them. It's actually kind of hard if you train a model on tons of data to give it the semantic understanding. It's really hard to then make it forget those facts.”

Matei highlights a huge issue with LLMs is that it can be hard to teach them new facts - ex. who is the president in 2022? This is because their information is encoded in the weights of billions of parameters vs. a simple field as in a traditional search index. He discusses why using external indexes to store “knowledge” in combination with LLMs for “reasoning” is such a powerful paradigm.

LlamaIndex and others are helping to enable this pattern of a separate index holding knowledge and using that to query a big model in the cloud.

⚡ Highlight 2: Work is being pulled out of LLMs and into code

As workflows on top of LLMs get more complex, we need better ways to tell LLMs what to do.

“We want to use a language model for things it's uniquely good at, like approximate matching or things like that. But as much as possible, we want to retain programmatic control of what the system does.”

One of the research projects built by Omar Khattab and Matei’s research group at Stanford is the Demonstrate-Search-Predict framework. This allows engineers to be more explicit about the steps LLMs should take when answering their questions.

Fixie.ai, Langchain and others are also helping to enable this trend of pulling work out of LLMs and into code, giving engineers more control. ChatML launched by OpenAI is another example of querying LLMs in a more structured / explicit way.

⚡ Highlight 3: The downsides of scale for LLMs

We talk about how models improve with scale, but how this need for scale shrinks the number of people with the resources to train models. Even in the last couple weeks since we’ve recorded the episode, we’ve seen projects like Alpaca released that show you can achieve amazing performance even with smaller scale models (7B parameters and can run on your laptop).

“Today with language models, it's at the point where very few companies can even reproduce the largest scale ones. And for researchers, that's certainly a problem. It actually means I think that the research or the innovation will slow down because fewer people can be chipping away at it and trying out new ideas. So maybe there'll be some new thing that's easier for people to play with that leads to like a better descendant ideas and better long-term research.”

Maybe that’s Alpaca 👀

Transcript

Introduction

Patrick Chase:

This is Unsupervised Learning, Redpoint's AI podcast. Today we are so excited to welcome Matei Zaharia to the show. If there was a Mount Rushmore for AI technologists, Matei would definitely be on it. He is the creator of Apache Spark, one of the most widely used AI infrastructure projects, and is the founder and CTO of Databricks. Databricks famously built a data platform around Spark that has reached incredible scale. They passed a billion in ARR in August of last year, raised $3.5 billion in funding from A16Z, NEA, Code2, and more. And their last round of funding put the company at a $38 billion valuation.

If that wasn't enough, Matei is also a professor at Stanford focusing on computer systems and machine learning, and one of the leaders of the Stanford DAWN Project, which is a five-year research project to democratize AI. Matei, thank you so, so much for joining us. We're really excited for this show.

Matei Zaharia:

Thanks a lot for having me. Yeah.

Building Databricks

Patrick Chase:

Maybe to start, it'd be awesome to talk about your journey building Databricks. It started as your research project with Apache Spark, and then now it's used by more than 7,000 organizations worldwide, including 40% of the Fortune 500, which is just really incredible scope that you guys have reached. How did you come to start Databricks?

Matei Zaharia:

Yeah, for sure. So it basically started out of some of my research as a PhD student as you mentioned. So I was doing my PhD at UC Berkeley. And I was really interested in large scale data intensive computing, which was just starting to become a thing back then with web companies like Google and later Facebook, Yahoo, back then all these companies that were indexing the whole web and building search engines and other products. And I was excited about bringing this kind of technology to all users like storing, collecting data. Storing that isn't expensive, so why can't everyone do really large scale computations? And indeed there was this trend that in most areas, like in most areas of industry, most areas of science, you could start collecting lots of types of data.

And then I was also excited about letting people run more sophisticated algorithms on them because that's what I saw at the early map page users. You don't want to just do analytics, especially when you have so much data. It's perfect for machine learning, it's perfect for unsupervised learning when deep learning became popular and so on. Yeah. So we started first this open source project that very quickly grew. It was clear that there's demand for something. And we decided that we should start a company to really have an impact in this space because you can only get so far by just having an open source project with no one backing it.

And the other bet we made that was a bit controversial at the time, it turned out to be right, is we decided to go cloud only. So you can only use databases in the cloud on one of the three major cloud providers as a SaaS service on top of that. And this was 2013, so it wasn't as obvious that banks and other really big enterprises would move to the cloud. But it turned out to be the right bet and it really focused the company on delivering excellent experience there with things you can only do in the cloud. And it helped us land as one of the big options if you want to do these things in there.

Patrick Chase:

Yeah, amazing. It's super cool just to see how far it's come. I remember when you were just getting started. I actually was on the machine learning team at LinkedIn, and we were some of the early adopters of Apache Spark. And I think you and Reynold and you guys came and did a talk. And it's just amazing the technology that you've built and then how far it's come on.

And I remember a big part of it was kind of for these sophisticated algorithms. You were moving to training things to doing things in memory. So iterative computation would be a lot faster than for Hadoop. Was machine learning a kind of a target use case for Spark from day zero or was that something that over time it started to really emerge as like this is great for training models and that sort of thing?

Matei Zaharia:

Yeah. No, it actually was one of the main use cases from the beginning. And actually, the first applications we built on it on Berkeley were machine learning ones. Because I was sitting in this lab that had... It had a lot of the systems researchers at Berkeley, like Dave Patterson, for example, but they had set up this lab with Mike Jordan, who's a huge machine learning researcher doing both theoretical machine learning, applied ML in different areas, and so on. And so they wanted to look at the intersection of computer systems and machine learning. Can we use ML to improve computer systems, but also can we use these large scale emerging systems too to do machine learning better?

So I had seen people using map pages and other things in industry, but at Berkeley I wanted to find someone local that would actually try out my crazy new ideas for a programming model and like [inaudible 00:05:33] system that still has a few bugs and stuff. And so the people around were the machine learning ones and they were very excited to try it.

So for example, one of my classmates in the year I started was Lester Mackey. He went on to basically to get a faculty job at Stanford, and then he went to Microsoft Research. So he became like a full-time AI researcher. And at the time he was part of a team entering the Netflix prize competition to build a recommendation engine. And he told us we really need scale, we need to be able to run. We have all these ideas, all these training ones we have to do. We only have this many days before the competition, so we need scale. So he actually tried it out and was able to accelerate some things. And his team did really well in the contest. They basically placed second at the end,

And there were also other groups. So that's something we were excited about.

Databricks and MLFlow for MLOps

Jacob Effron:

Yeah. I mean, it's a good start to a tradition of I guess now the company supporting many companies building kind of cutting edge ML models on top. And I guess I'm curious. At Databricks, you set up this kind of really interesting opportunity where you get to observe I guess a ton of folks that are building interesting applications on top. And so I'm wondering, I guess, what you've noticed trendwise and how folks are deploying some of these ML models and production.l And then obviously you yourself have been very involved with MLflow and building some of the kind of supporting tooling here. And so would love just, one, kind of the trends you've seen in that.

And then two, I imagine with ChatGPT and the kind of rise and all this attention around LMs that the last few months might have been pretty interesting as well. So just anything you're kind of seeing on the ground and how folks are using these.

Matei Zaharia:

Yeah, definitely. So yeah, I think I'll call out two things. So first of all, I think just ML in the enterprise, maybe not necessarily deep learning, increasingly there's a lot of deep learning, but just, hey, ML in day-to-day business processes is established now. So it used to be... When we started out, there were many people who said, "Ah, would enterprises actually want to do machine learning? Do they even have the expertise? Is there anyone there who could help them? I believe it could help them, but is there anyone there and so on? Or will it be this thing for like only these weird web companies?"

But now, all major companies have hired data science and machine learning teams and they're building products. Their first one or two models are very likely in production. For some of them it's already hundreds of models. And they have the team, they have the connection to the business data, and they're looking to build more. And these things provide tremendous value in a lot of areas. For example, one customer that's talked a bunch about things they do with Databricks is Shell. Right? They produce all these chemicals, they have all these plants that can process them. Storing chemicals is expensive. You know? If you produce it and no one's going to use, it's bad. All kinds of things can break down. If that happens, it's really bad for you.

And so anything you can do that helps you estimate demand, optimize conditions in the plant, monitor, detect anomalies, has many millions of dollars worth of impact. And it's so easy to just collect the fine-grained time series from these things, so why not do it? Right? And so they're at the point where they're basically doing a lot of use cases through fully automated like AutoML where the ML team only has to look at it when something's broken. And that's just one example, but it's anything. Like, companies that produce milk, companies that build houses, whatever it is, they're using it. So that's one change.

Now if you do something in the space, you can find lots of enterprises who have ML teams and are doing things. And of course the next wave of that that they're all asking is, "Okay, we did the first few use cases. It was a lot of work, but now we have ideas for a hundred more use cases. So how do we make it way easier to create new ones and bring them to production?" And I think whoever cracks that code is going to determine the next platform. And that's the kind of thing we're of course trying to do through open source projects like MLflow and through the SaaS services we have. So that's one thing.

And then the other thing, yeah, you mentioned ChatGPT, LLM. So deep learning is definitely on people's minds. And I think, again, within enterprises, I think people are especially excited about natural language just because there is so much data with that and the entire products you can make using it. I see virtually all software companies now, for example, thinking of including it in the UI in some form. But I also think other things like computer vision will also be increasingly important. We're starting to see some things like that.

And the really interesting thing with ChatGPT is I think it just opened up... Like, it got more people to try these foundation models. Basically GPT-3 had been there for years, could do these things. In Stark GPT, which is easier for people to prompt, also had been there for a while. But there was no nice web interface where you could just try it out. So I think now a lot more developers are thinking about it and trying to build these applications, which is really cool.

MLOps tooling vs LLM tooling

Jacob Effron:

Right. Does the rise of more developers trying to build on top of large language models, does that change at all kind of how you think about the roadmap at MLflow or the tools you guys build the Databricks or is it kind of... Yeah. How are you thinking about that first world that you talked about, the kind of specific models for a company like Shell versus the kind of let's build a support product on top of a large language model? How are you guys thinking about supporting enterprises across those?

Matei Zaharia:

Yeah, it's a good... So I think the LLMs out there, they're sometimes called foundation models. I'm actually part of the team at Stanford that try to come up with a name [inaudible 00:11:46]-

Jacob Effron:

It's a very impactful term you guys coined. [inaudible 00:11:48]-

Matei Zaharia:

Yeah. I mean, there are pros and cons of this term. Some people don't like it because it sounds like you have to build on one of these. That wasn't really the intent. But the idea of this like pre-trained model that you can then adapt to use cases is very powerful.

At the same time though, I'm seeing that you need... For every company and every product, the application of it is different. And to get the best quality, you will want to customize it. And so people are trying to figure out what to do. I mean, some people are getting predictions from these models and then actually training their own model by fine-tuning these or even by just training their own from scratch, like just using them to label stuff. Some people are looking at really clever ways to incorporate it in their products. For example, there's this nice post on a GitHub co-pilot, like how it decides what context to send with your [inaudible 00:12:45] so that it has the highest chance of knowing about your code and completing it. Some people are designing user interfaces that actually can collect reasonable feedback and make things happen.

So I think there's still a lot of engineering, but these models let you do some simple, some basic capabilities of understanding language, intent, and fuzzy sort of semantics that you couldn't do before. And they're also super easy to try out, which is great.

Jacob Effron:

Yeah. No, it's interesting. I mean, I think some of the ML ops companies that are emerging around the foundation models themselves, it feels like it's things like around prompt engineering or around providing more context in prompts or chaining models together. And one thing I'm curious about, and I'm sure you guys think about this, is it's hard to know how much of that is point in time for GPT-3 today that's required versus how much of this is always going to be a fixture of working with these models. And so you guys must think about this in terms of what you build. How do you think about, I guess, how much of these problems today or point in time for where foundation models are right now versus like permanent fixtures of an MLOps Stack that you might need for them long term?

Matei Zaharia:

Yeah, great question. Yeah. I think, yeah, it's important to separate things that are very algorithm or technique specific from general things you need. And for Databricks as a company at least, we've focused on figuring out what are the platform pieces that are really general and will be useful later. So what do I mean by that? So for example, one thing is whatever you do with your model, you're going to deploy it and you're going to want to collect back data on how it's doing or maybe A/B different versions and easily analyze that data. Could be lots of data because you want to log everything you can to improve it. So making that workflow super easy.

So anytime you deploy the model, you just got a table that. You know? Here's all the stuff that it served. Or you'll want to have different versions of it and compare them, or you'll want to have reproducibility of seeing what went into it. I think that'll be important whether you're running Scikit-learn or PyTorch or you're calling an API for the model with a prompt; you still need all these things. And they're not easy to do. It's not easy to figure out the right workflows. So there's a lot to do there.

Of course, you can also do something more specialized for one type of model or one application. And you can have really great products there. I think, again, some of the most successful ones are ones that are pretty broad. For example, things that let you look at computer vision more in general and maybe design, like easily build an object detector for a new kind of object or something like that. Those are likely to be widely used, but it remains to be seen.

Orchestrating LLMs and the Demostrate-Search-Predict Framework

Matei Zaharia:

I should also say on this prompting aspect specifically, and I think maybe you folks would've prompted me to talk about it later too, but yeah, some of my research at Stanford is on this now, is on how to build complex applications using language models. So we have this programming model called Demonstrate Search and Predict or DSP. We have some open source code. And we're also looking at these things as a building block. And actually the approach we've taken that's maybe a bit different from some companies is we want to use a language model for things it's uniquely good at, like approximate matching or things like that. But as much as possible, we want to retain programmatic control of what the system does.

So we have something where you basically write a pipeline of steps in Python and then some of the steps call into an LM. So you can also basically fine-tune it and so on. But we're trying to limit its scope to test that will be basically really easy for it so that the overall accuracy as you combine many of these is still good. Because otherwise, if you bet on it doing one thing, say, come up with a plan to answer this question and then follow each step, and it comes up with the wrong plan, then you're toast. It's never going to succeed.

So there'll be different philosophies. I'm sure there'll be people doing the opposite too, but we're looking at it from an engineering point of view of like we want something that's 99% reliable. Like, what's the failure probability of each stage that we want to get it to work?

Patrick Chase:

Yeah, it's super cool. And DSP is definitely one of the projects that was on our radar. And it seems like one that people are really excited about. So we were going to prompt you on it later.

Jacob Effron:

No pun intended.

Patrick Chase:

No pun intended. But, yeah, it seems like a really interesting area of research in terms of how people are interfacing with LLMs and what work is done by the LLM versus in code. It seems like with DSP, it's actually pulling work out of the LLM and into code. Is that the right way to think about it when people are writing these Python scripts to orchestrate LLM calls?

Matei Zaharia:

Yeah, more generally, as I said, we want to have... Some things you can call where some steps, or if you're pretty sure you want to do one thing or another, you then use the model in a limited capacity. Like, say, generate synonyms for this word or whatever, then I'll search for them or something like that, as opposed to asking it a very complicated question and hoping that it figures everything out. So it's just sort of a challenge, but it remains to be seen.

I mean, the other factory this is getting at that's maybe kind of interesting is right now these large language models bake together and combine both what I'll call semantic understanding or reasoning and facts, like knowledge about the world. So for example, who's the president of the US? Well, if you ask ChatGPT, it... I mean today maybe it gives the right answer, but in four years it'll be wrong. Right? Actually, I don't remember what it does. Maybe they clarify it.

Patrick Chase:

Yeah.

Jacob Effron:

Unclear.

Matei Zaharia:

It used to say Trump for a long time. Maybe it's now good. But those things you'd like to decouple them. So we also want to look for a way to separate these. And it's actually kind of hard if you train a model on tons of data to give it the semantic understanding. It's really hard to then make it forget those facts. Like, now just imagine a world where suddenly this thing changed. Now can you answer my question? So yeah.

Bringing together LLMs and traditional Information Retrieval Systems

Patrick Chase:

I think that's super interesting because it is a big problem that then this knowledge would be encoded in the model weights and you'd have to... Not easy to update. Right? Do you think in the future there will be more of a traditional search index that is encoding kind of knowledge and then something that is more generative, generating the response? Or how do you see traditional information retrieval working with LLMs in the future?

Matei Zaharia:

Yeah, I think that's a good... That's an attractive architecture. And we need strong reasons not to do it. But I think at least in what I'd call knowledge intensive applications where there's some either public knowledge or even private knowledge, like support tickets in my company or stuff like that, it makes sense. And you want to be able to plug those in. And there's a question of how to best build these, but you can get very impressive demos and quality even in fairly easy ways of combining current tools. Yeah.

Patrick Chase:

Yeah. Very cool. The last thing I want to touch on at this point is there also seems to be some interesting things where you have your DSP program, but then you also have a language model telling you how to orchestrate a DSP type program. Do you see those being written by a language model in the future? Or do you think that's kind of the right level of abstraction where it's like human takes over writing this program and then interfacing with the LLMs that way?

Matei Zaharia:

No. You could have things produced by a language model. And I think one of the best new use cases for these is... And they are good at generating long kind of complicated answers or examples or whatever for something, that's like, for example, when people ask, "Hey, is ChatGPT like a Google replacement?" I would say it's not so much a Google replacement in its current form as a Stack Overflow or like query or whatever type of replacement. Of course, and even the ChatGPT augmented by a search engine, I think the most interesting bits are the ones where you're asking it to generate a lot of stuff.

So it is very powerful for teaching people programming. Actually, when we think about... Like, in Databricks, you can imagine we're also building ways to use these language models in our product. And one of the things we're most excited about is it'll make the product easier to learn for people. Right? We're all about expanding access to data to more people, non-experts. Like, let them do cool things. And hosting things as a cloud service is one approach, but just kind of like, hey, natural language that gives you something that's maybe usable or you can edit from is another one. So I think there'll be an interesting back and forth.

And the other thing I'll say about that is if you think about like, "Oh, okay, I'm going to tell a language model to generate some code for me and then I hope the code is correct, like it's Python or something, I'm just going to run it," if you don't know how to program Python, it's kind of hard to make that work. You know? It's good if you're learning, but you could also try getting it to generate something else. Like, what if it generated a drag and drop UI in a tool like Figma or something like that? Right? Then you see it. You can edit it. You don't need to be an expert to tell whether it's correct or wrong.

What if it generated one of these visual data flow pipelines? Right? Like, generate a pipeline to parse and transform these things and then you can just click on it and view it. So it doesn't have to generate like a inscrutable programming language to be useful.

Using LLMs at Databricks

Jacob Effron:

Yeah. So it's really interesting. You mentioned potentially using these models within the Databricks product to make it more accessible to folks. I guess what are some examples of how you're thinking about that or how you might actually end up incorporating these types of models?

Matei Zaharia:

Yeah. There are a lot of these. And many, many companies are doing this, but certainly any place you could... For example, any place you can type a query in Python or SQL, maybe you could type it in natural language. That's an example. But there are also others. What if I'm doing a search either over my data sets or over the documentation? Right? Why not let me do that through a better model? Another simple one that many companies are doing is just basically autofill or recommendation. So for example, when you share a document in Google Drive and you add one person, it suggests, "Hey, maybe you also want to share with these other people." That's really easy to do.

So some of the software companies we work with, even before ChatGPT, they were doing machine learning on the platform. They wanted to do a lot more and we asked them why. And they say, "Well, look, I have this application, it's got like a thousand text fields on it, and there are millions of users using this every day. I want every single text field in there to have auto complete based on that customer's deployment. So how can I get that?" It'll make everyone my application better and all that."

So yeah, so I think there are quite a few. And I'm definitely hoping we get to a place where these quality of life things are very easy to add for any developer. Imagine I'm adding a text field in like HTML. I could just put learn equals based on these other fields basically. And then the ones where you really want to spend time hands-on, build the best quality thing you can, invest lots of engineering, you have nice tools where ML engineers can just focus on that. So when you're building like your self-driving car, whatever you're not going to do that with just one click AutoML. You want the best tools to keep cracking away at it and make sure it's really good 99.99% of the time. So yeah.

Training Large Scale Models

Patrick Chase:

Awesome. Would love to shift gears and talk a little bit about the latest in the LLM world and training those models. It seems like the transformer architecture is kind of becoming the defacto architecture that's used across a bunch of industries. So then it's interesting because it seems like people are using the same architecture, but then it's about how you can actually scale these. And you're a systems person and so you're the expert on scale. I guess, how do you think about the importance of modeling expertise versus systems expertise and how you scale the training and the serving and that sort of thing? Which one do you think will be... How do you think the importance of those might change over time?

Matei Zaharia:

Yeah, it's a great question. I mean, they're both important. The really interesting thing about scale, and I think the reason a lot of folks went for it, is you don't have to maybe think that hard, I guess, to get gains. Right? I mean, the systems people are building the system have to think. But if you built some model architecture, some training process like SDG or whatever, that seems to do well and you can throw in 10x more machines at it or 10x more data and it does even better, that's fantastic. So I think a lot of people are going down this route and saying, "Hey, I can get $X million in investment." I am doing a lot of things that involve thinking hard and trying to understand from the fundamentals, but why don't I also do... Why don't we just buy more GPUs and put in more data and try to do this thing?

So I think a lot of folks have gone down that way. Now, of course today with language models, it's at the point where very few companies can even reproduce the largest scale ones. And for researchers, that's certainly a problem. It actually means I think that the research or the innovation will slow down because fewer people can be chipping away at it and trying out new ideas. So maybe there'll be some new thing that's easier for people to play with that leads to like a better descendant ideas and better long-term research.

But it also means that there are enough people working on making it cheaper and making it easy for anyone to do stuff at scale that there'll be activity on that. And if this is the method that works well... For many applications, it's still very inexpensive. Like, even if serving a prediction costs you like a dollar or something of GPU time, in some applications it's totally worth it. So people are doing it.

What I found though is at least for very targeted applications where you want very high quality, say, answering a support ticket or something you don't want to be wrong, that becomes kind of a traditional ML engineering process where you have to think, "Do I have the right validation data? Do I have the right inputs? How do I check this version against this other version and so on?" So the foundation models are only one piece of that. And having the right tool and having the right process, the ops process and the right tools to see what's going on is also important.

Cost of LLMs

Patrick Chase:

Yeah. It makes a lot of sense. On the cost side, it's fascinating what you were saying around the training versus serving. And it seems like the training costs have gone down a lot over-

Matei Zaharia:

That's good. Yeah.

Patrick Chase:

Even just the last couple years. I think there was one estimate that it was in 2020, someone estimated that on compute alone it was $5 million to train GPT-3. And Mosaic just published a blog that you could now train a GPT quality model. So it's a little bit smaller, but quality model for less than $500K. So it's like a 10x reduction in two years. Do you think that trend will continue and we'll see another 10x reduction in the next two years? Or do you think that maybe there'll be some sort of limits that we're hitting up in terms of... Because I bet there's some low hanging fruit that people were going after early.

Matei Zaharia:

Yeah. I think there's still a lot of space to improve costs for the current models. So if you just want to take today's model architecture and do it more cheaply, there is space. There are all these companies working on accelerators that will presumably generate a lot of competition there, especially if it's ends up being a lot of transformers. Right? It's risky for some of them to go down that path because then if the architecture of choice changes, then they're kind of stuck. But there will be some who do that.

And there are also new ideas or new developments about the training process itself, like having bigger data sets, maybe having cleaner data sets, having different penalties, the human feedback stuff. So it turns out like just a language model alone that just predicts the next word is kind of hard to use, but this instruction tuning thing makes it much easier to use. There's a lot of room to do algorithmic things that might give you similar quality.

And then there are also pretty general techniques like model compression. For example, with Stable Diffusion, they ended up compressing it to the point where you can run it on like a laptop or a phone or something, which wasn't anticipated, I think, at the beginning. So yeah, I think today's models will become way more practical to use. The question I have is how much better do they get if you put in way more parameters? And again, I think it also it's tied to this knowledge versus semantic reasoning question. And if you could separate those two, then maybe you wouldn't need to put in more parameters.

Because one interesting thing is in computer vision, we don't have a GPT-3. We have excellent models. You can use them for pretty much anything you want that are sort of moderate sized. And we haven't had like... No one's adding 100x more parameters to make it better.

In language, we do have this, but how much of it is for the reasoning versus the knowledge stuff? Right? You as a human, I can tell you like I can make up a story. I can make up sci-fi universe with my own physics laws or my own factions and tell you the rules about them. And you can then read about them and tell them the rebel should definitely blow up the Death Star or whatever. You know? You don't need to spend lots of time learning about it. So the question is how much of that basic reasoning capacity have we maxed out with the current models or how are we going to get? Yeah.

Patrick Chase:

Yeah, it's really cool. The Stable Diffusion running on the iPhone. I think LAMA just came out or one of those you can run on a single GPU. And I can imagine there's just really interesting use cases, exactly what you were saying.

Matei Zaharia:

Yeah, LLAMA's like a smaller transformer, but more care going into the training process that makes the quality, at least on the standard academic benchmarks, similar to GPT-3. `Yeah, yeah. Super cool.

Closed Source vs. Open Source

Jacob Effron:

Yeah. I'm curious, given this discussion we're having about how big these models need to get, one thing I'm struck by of momentum in the space right now is obviously you have the large foundation model companies use your terms that are kind of closed source and then you've got the open source alternatives. And obviously you've spent a lot of time in the open source world. Curious how you kind of see this playing out in terms of the kind of extent of adoption of the open source solutions versus the closed source solutions.

Matei Zaharia:

Yeah, it's a great question. I think there are a few factors here. And again, it's really hard to make predictions. So I can give you both the arguments for and against the closed source service approach. So one of the arguments for would be that it's just easier to use because it is a SaaS thing. So for example, if I wanted to train my own model to do something, I would have to first acquire some GPUs. Not super easy even in the cloud to get a lot of them, or I can buy them and wait. I actually have a bunch of high-powered GPU servers at Stanford, and I ordered a few more a month ago, and they're still not here yet. So I just got to wait for them. So that's a pain.

I then have to install software. You know? I have to call it. If I'm not using the GPUs or whatever, it's an issue. So with something that's hosted as a service, you don't have any of those. I can just make a call and get a answer back. And then if they're working behind the scenes to improve its price performance or upgrade to the latest version of CUDA or even improve the model itself, the architecture, I just get those benefits. So that's one strong argument. That's like the SaaS voices package software type argument.

And then the other one would be if they can learn a lot across customers and get this sort of unbeatable data advantage by being really good at that. And that I think remains to be seen how much of that you need, because the whole point of these in sense, the ad on the 10 is no... It's basically like in context learning. I don't need to see a lot of stuff. You can just apply me to your new thing. But then at the same, so that means maybe you don't need that much data. But those could be the reason. And I think there'll be some areas, and it could include language models where the best ones will be by companies that have that data mode.

On the other hand, the counter arguments, one is the best models and the best training methods and stuff like that change very rapidly. And if your thing is closed, you don't get the benefit of that open innovation community that's happening. Right? So you might invest super hard in serving and [inaudible 00:35:53] and training a transformer, and then some grad student or undergrad somewhere these days publishes a new thing that's better and then everyone switches to that and no one... Maybe transformer was awesome too, but no one else is working on your architecture and you lose out.

So that's one of the risks. So I think it'll remain to be seen. And there's I think a tendency for more foundational things like public datasets for example. All these things are trained in large part on public datasets. These things, there's a tendency for them to become like a comment where no organization wants to do that whole thing on their own data side. Okay, we'll collaborate on that and we'll do something else. That's also what open source is basically. So yeah.

Jacob Effron:

Yeah, it's really interesting whether the underlying architectures stay the same as well as alluding to the point you were mentioning earlier, whether we can find ways to encode more reasoning into these models beyond just having to stuff every bit of knowledge in there as well. It seems to have implications for whether you're going to need to throw billions of dollars at compute to get a cutting edge model.

Matei Zaharia:

Yeah.

Stanford Research

Patrick Chase:

I guess we talked a little bit about the research that you were doing with DSP and the information retrieval and bringing those together. It'd be awesome to hear more about DAWN and what the charter is there and maybe some of the other projects that you and your team are working on in the research world.

Matei Zaharia:

Yeah, so DAWN was this bigger lab at Stanford where a bunch of machine learning and systems folks got together. Actually DAWN has basically just wrapped up. It was meant to be a three-year project, so it's kind of wrapped up now. Sorry, sorry. It was meant to be a five-year project. But we brought together a bunch of faculty looking at different aspects of... And the goal was to look at democratizing machine learning and to create this environment where we can hear from people in industry. We had a bunch of workshops and retreats and events like that with folks from industry, and we can hear from other people at Stanford who are applying ML. And we can also exchange ideas on different topics.

So for example, two of the other professors in DAWN, Kunle Olukotun and Chris Ré are involved in SambaNova, which is a hardware company. And then Chris is also involved... His lab started Snorkel, which is a data oriented ML company. And there are actually a whole bunch of other startups or other connections with different people from that group. So it was really nice for seeing these different perspectives. You don't often have people who work on hardware and people who work on paralyzing algorithms like my group, and then people who actually design the algorithms in the same room.

Yeah. I think, to me, just trying to think of what's exciting. I mean, there are a number of, actually at this point, pretty large companies that were founded out of DAWN, including Snorkel and SambaNova. And I think there are also a whole bunch of open source projects that have reached some amount of use or models that have reached use in industry. Actually [inaudible 00:39:26], which is the retrieval model we worked on that is used in DSP, was one of those that's like we put out this model architecture and now it's becoming one of the kind of standard things people consider for DNN based information retrieval. Yeah.

Patrick Chase:

Yeah, there's been some incredible companies, as you were mentioning, Snorkel and a bunch of other ones that have come out of Stanford. And I feel like that's kind of following the footsteps of the AMPLab and what the folks at Berkeley did. Is there any rivalry-

Matei Zaharia:

My Stanford colleagues would say Stanford was also doing that all along, but yeah-

Patrick Chase:

That's true. Well-

Matei Zaharia:

Obviously I like the big lab model there. Yeah.

Patrick Chase:

I was curious about... Is there a rivalry between Stanford and Cal for research? You were at Cal and then you became a professor at Stanford. Are people pissed about that?

Matei Zaharia:

I think people want their PhD students to go somewhere else and to get to know the other folks and to get other ideas and so on. So they don't want inbreeding of ideas in the same place. So I think it's okay. A lot of people collaborate across it too, but it's definitely interesting because I think historically Stanford was seen as the main place for startups out of academia. But now I think Berkeley is... I see more people at Berkeley honestly interested in starting companies than here. And partly it's because there were a few large ones that started out of that like Databricks, and then there were alums that were really passionate about it.

So last year, for example, a few of the Databricks co-founders and other faculty there ran a course on basically doing a startup, a course for [inaudible 00:41:17] PhD graduate students. And I think there were like a hundred people in the course or something all thinking about it, which I have not seen here. There's no CS course here on doing a startup. You can go to other departments, but I don't think there's a CS course. So yeah.

Early Databricks Stories

Patrick Chase:

Yeah, totally. One other fun thing I wanted to ask you about. Reynold Xin, when we were asking him for a different piece, he said that there was a time when you had done a ton of work helping a startup integrate Spark, this was in the early days of Spark, into their stack. And I guess after working on it for weeks, they gave you a $50 Amazon gift card. What's the story there?

Matei Zaharia:

Yeah. There were a few that I helped early on because I was really excited to see people using the stuff we're building. And when you have a piece of software that's early on, there's also risk for them. Right? It might not work. They might have wasted weeks on something that doesn't work and it's out of a university, it might be buggy. Maybe I go and do a different thing next and stop working on Spark. So there's some risks. So I was learning a lot by working with them. And there were a few of these. It was really cool. You know? This is also... Back then I feel there were more in-person meetups and things like that in the Bay Area. I think now they're starting to come back, but a lot of stuff is not exactly the same level. So it was cool to have these small communities and meet people there.

We gave talks about Spark at some of the big data and machine learning meetups. And then later we actually started our own regular Spark meetup and companies would ask to host it and stuff because they wanted to recruit people who know Spark or who work on it or whatever. So it was fun. But yeah, there were all kinds of things. I also got free food and stuff. I didn't really ask for much. It was just good to get people's time and to see the cool apps and honestly to be able to talk about them to others.

Jacob Effron:

Totally. I mean, one thing I was struck by when you were talking earlier about your Databricks founding story is obviously the big decision you guys made early on was deciding you were going to do cloud only. Right? And obviously I'm sure at the time that that felt like a big risk. How did you make that decision? How did you get comfortable deciding that was the way you guys were going to go forward?

Matei Zaharia:

Yeah, I think there were a few things. So first of all, just from the point of just the strategy, like even the short term strategy of you're starting a company, you're more likely to do well, I think, with this type of product if you go after new workloads or workloads that are being migrated than if you go and try to displace an established thing that's tied into a million other processes in that enterprise. And so the cloud was good for that. In addition, the cloud allows you to have a very fast feedback loop and less divergence of the software out there. So you don't have to support 10 versions of the software because your old customers are refusing to upgrade or whatever, which was the issue for package software. In the cloud, you just have the version that's deployed now and maybe there's a second version that's calling out when you do an update. So you only really need two versions at a time.

So we thought, okay, fast feedback loop, which is needed to develop a great product plus this opportunity to go after greenfield applications or migrations or someone else. We didn't convince that company to move to the cloud. You know? Amazon did, or Microsoft did. But we said, "Look, if you're going to forklift, move everything, why don't you try this thing here for this piece?" Right? So we kind of piggyback on that.

Yeah. Long term though, and we started with it and it was going well enough that we said, "Okay, we're just going to focus entirely on this instead of going to somewhere else." We could have also, in an alternate world, we could have used that to develop a really nice product and then sold that product somewhere else in another mode, but we decided not to do that.

But in the long term, I think so much stuff is moving to SaaS services. And both for customers and for providers, they're a good arrangement because for customers they just fundamentally have less stuff to worry about and to administer. And for providers, you get the very fast feedback. So it's going to be a major force. I think the big competitor, if you are launching a new software product that is not a SaaS thing, say it's just open source bits someone downloads or a thing you install a Docker container or something like the big competitor is going to be a managed service that does a similar thing.

Fire Round

Jacob Effron:

Yeah. No, that makes sense. Well, we always like to end our interviews with a quickfire round where we ask you a few quick questions and get your thoughts. And so maybe to start, we'd just be curious, obviously there's lots of things that are talked about on AI, Twitter, and in the broader ecosystem, what's one thing that you think's overhyped and one thing that you think's overhyped?

Matei Zaharia:

Yeah. I don't know if I want to say anything is overhyped. I mean, I do think... I would say... Okay, if I am to say one thing, I would say sort of demos that work once, but then the thing doesn't reliably happen are risky in some... Depending what you're trying to do. Right? So it's in many areas it's easy to make something that works 50% of the time, 60%, 70%, but it's really hard to close that gap. So you just have to be careful about it. But on the other hand, the demos show that there's potential, that if you then do the additional engineering and stuff, you might get it to 100%. But I've heard people asking whether different technologies have what they call the self-driving car problem where it's easy to make a demo, but it's hard to make the car actually work. It's still hard to this day. So you got to worry about that.

And underhyped, I do think data and incorporating knowledge or incorporating even real-time interactions between your model and something else, it's definitely becoming more relevant with things like the search engines that use LLMs and stuff. But really thinking about what you feed in to get a great result I think is there's a lot left to do there. Yeah.

Patrick Chase:

Awesome. All right. Next fire round question. What's one thing you wish you knew when starting Databricks that you know now?

Matei Zaharia:

Oh man, there are many, many different things. Yeah. I think... I don't know. I mean, yeah, I think there are just so many areas of business that I didn't know about that we learn about including sales, marketing, product management, all these things. So yeah. And I mean, the other thing is just don't panic. Basically there'll be things along the way that are challenging, but if you have a great team and a solid strategic position, you'll do well. So yeah.

Jacob Effron:

Awesome. Well, I mean, just a fascinating wide-ranging conversation. I feel like there are so many different threads we could pull on, but I know you're a busy person, so we'll leave you back to your day-to-day. I guess I'm sure folks want to learn more about Databricks, the work you're doing at Stanford. What's kind of the best way for them to dig in further?

Matei Zaharia:

Yeah, I think if you just follow me on Twitter or LinkedIn, something like that, you'll see a lot of the things I think are cool. Yeah.

Patrick Chase:

Awesome. Well, Matei, thank you so, so much for joining us. This was great.

Jacob Effron:

Super interesting.

Matei Zaharia:

Thanks so much.

Unsupervised Learning

Scaling and Orchestrating Large Language Models: In Conversation With Databricks CTO Matei Zaharia

⚡ Highlight 1: Combining traditional retrieval-based NLP with generative models

⚡ Highlight 2: Work is being pulled out of LLMs and into code

⚡ Highlight 3: The downsides of scale for LLMs

Transcript

Introduction

Building Databricks

Databricks and MLFlow for MLOps

MLOps tooling vs LLM tooling

Orchestrating LLMs and the Demostrate-Search-Predict Framework

Bringing together LLMs and traditional Information Retrieval Systems

Using LLMs at Databricks

Training Large Scale Models

Cost of LLMs

Closed Source vs. Open Source

Stanford Research

Early Databricks Stories

Fire Round

Discussion about this post