We keep using the term “open source” in the context of large language models (LLMs) like Llama 2, but it’s not clear what we mean. I’m not referring to the fight over whether Meta should use the term “shared source” to describe its licensing for Llama 2 instead of “open source,” as RedMonk’s Steve O’Grady and others have argued.
It turns out, this isn’t the right question because there’s a much more fundamental fight. Namely, what does open source even mean in a world where LLMs (and foundation models) are developed, used, and distributed in ways that are significantly different from software?
Mehul Shah, founder and CEO of Aryn, a stealth startup that aims to reimagine enterprise search through AI, first outlined the problem for me in an interview. Shah, who spearheaded AWS’ OpenSearch business, is betting big on AI and also on open source.
Do the two go together? It’s not as simple as it first appears. Indeed, just as the open source movement had to rethink key terms like “distribution” as software shifted to the cloud, we’ll likely need to grapple with the inconsistencies introduced by applying the Open Source Definition to floating-point numbers.
Different 1s and 0s
In a long post on the topic of open source and AI, Mike Linksvayer, head of developer policy at GitHub, buried the lede: “There is no settled definition of what open source AI is.” He’s right. Not only is it not settled, but it’s hardly discussed. That needs to change.
As Shah stressed in our interview, “AI models are ostensibly just software programs, but the way they are developed, used, and distributed is unlike software.” Despite this inconsistency, we keep casually referring to things like Llama 2 as open source or as definitely not open source. What do we mean?
Some want it to mean that the software is or isn’t licensed according to the Open Source Definition. But this misses the point. The point is floating-point numbers. Or weights. Or something that isn’t quite software in the way we’ve traditionally thought about it as it relates to open source licensing.
Look under the hood of these LLMs and they’re all deep neural network models which, despite their differences, all use roughly the same architecture. It’s called the transformer architecture. Within these models you have neurons, instructions on how they’re connected, and a specification of how many layers of neurons you need. Different models call for only decoders or only encoders, or different numbers of layers, but ultimately they’re all pretty similar architecturally.
The primary difference is the numbers that connect the neurons, otherwise known as weights. Those numbers tell you when you give the model some input, which neurons get activated, and how they get propagated. Though it’s not clear, I suspect many people think these weights are the code that Meta and others are open sourcing.
Maybe. But this is where things get messy.
As Shah points out, “If you look at all the things that are in the definition of free and open source, some of those things apply and the other things don’t.” For one, you can’t modify the weights directly. You can’t go in and change a floating-point number. You have to recompile those from somewhere else.
“I want a license on the weights themselves that allows me to build products and further models on top of them with as few restrictions as possible,” Datasette creator Simon Willison stresses. That at least clarifies where the license should apply, but it doesn’t quite resolve Shah’s more fundamental question as to whether open sourcing the weights makes sense.
Where to apply the license?
In our conversation, Shah outlined a few different ways to think about “code” in the context of LLMs. The first is to think of curated training data like the source code of software programs. If we start there, then training (gradient descent) is like compilation of source code, and the deep neural network architecture of transformer models or LLMs is like the virtual hardware or physical hardware that the compiled program runs on. In this reading, the weights are the compiled program.
This seems reasonable but immediately raises key questions. First, that curated data is often owned by someone else.
Second, although the licenses are on the weights today, this may not work well because those weights are just floating-point numbers. Is this any different from saying you’re licensing code, which is just a bunch of 1s and 0s? Should the license be on the architecture? Probably not, as the same architecture with different weights can give you a completely different AI.
Should the license then be on the weights and architecture? Perhaps, but it’s possible to modify the behavior of the program without access to the source code through fine-tuning and instruction tuning.
Then there’s the reality that developers often distribute deltas or differences from the original weights. Are the deltas subject to the same license as the original model? Can they have completely different licenses?
See the problems? All solvable, but not simple. It’s not as clear as declaring that an LLM is open source or not.
Perhaps a better way to think about open source in the context of weights is to think of weights as the source code of software, which seems to be Willison’s interpretation. In this world, the compilation of the software comes down to its interpretation on different hardware (CPUs and GPUs). But, does the license on the weights include the neural network architecture? What do we do about diffs and variants of the model after fine-tuning?
What do we do about the training data, which is arguably as important, if not more so, as the weights? How about open sourcing the process of selecting the right set of data? Hugely important, but not currently envisioned by how we use open source to describe an LLM.
These aren’t academic issues. Given the explosion in AI adoption, it’s important that developers, startups, and everyone can use open source LLMs and know what that means. Willison, for example, tells me that he’d love to better understand his rights under a licensed LLM like Llama 2.
Also, what are its limitations? What do the restrictions “on using it to help train competing models—especially wrt fine-tuning” actually mean? Willison is way ahead of most of us in terms of adoption and use of LLMs to advance software development. If he has questions, we all should.
“We are in the age of data programming,” Shah declares. But for this age to have maximum impact, we need to figure out what we mean when we call it open source.