“I’m generally happy to see extensions of free use, but I’m a little resentful when they end up profiting mass corporations that massively draw value from the work of smaller authors,” Woods says.
It is clear with neural networks that they can remember their training data and reproduce copies. That risk exists whether the data includes personal data or medical secrets or protected code, explains Colin Raffel, a professor of computer science at the University of North Carolina who co-authored an overprint (not yet peer-reviewed) examining similar copying in OpenAI. GPT-2. They found that getting a model, who was trained on a large body of text, to spit out training data was pretty trivial. But it can be difficult to predict what the model will remember and copy. “You really only find out when you throw it out into the world, and people use it and abuse it,” Raffel says. Given this, he was surprised to see that GitHub and OpenAI decided to train their model with code that comes with copyright restrictions.
According to GitHub internal tests, direct copying occurs in approximately 0.1 percent of Copilot’s results – according to the company, which is an insurmountable mistake, not an inherent flaw of the AI model. That’s enough to provoke nits in the legal department of any for-profit entity (“zero-risk” is just “risk” for a lawyer), but Raffel notes that this may not be as different from employees who copy pasted restricted code. People break the rules regardless of automation. Ronacher, an open source developer, adds that most copies of Copilot appear to be relatively harmless – cases where simple solutions to problems or oddities like the infamous reappear reappear. Earthquake code, which people (improperly) copied into many different code bases. “You can get Copilot to run hilarious things,” he says. “If used as intended, I think there will be fewer problems.”
GitHub has also indicated that it has a possible solution in the work: a way to mark these literal exits when they appear so that developers and their lawyers know they are not using them for commercial purposes. But building such a system is not as simple as it sounds, Raffel notes, and a bigger problem arises: What if the output is not literally, but almost a copy of training data? What if only the variables were changed or one line was expressed differently? In other words, how many changes are needed to make the system no longer a copy? With the code generation software in its infancy, the legal and ethical boundaries are not yet clear.
Many lawyers believe that AI developers have a fairly wide freedom in choosing training data, explains Andy Sellars, director of the University of Boston Technical Law Clinic. The “fair use” of protected materials mainly comes down to whether it is “transformed” when used again. There are many ways to transform a work, such as using it for parody or criticism or summarizing – or, as the courts have repeatedly determined, using it as fuel for algorithms. In one prominent case, a federal court dismissed the lawsuit launched by the publishing group against Google Books, holding that its process of scanning books and using snippets of text to allow users to search through them is an example of fair use. But how that translates into AI training data is not firmly established, Sellars adds.
It’s a little weird to put code under the same regime as books and artwork, he notes. “We treat the source code as a literary work even though it’s a bit reminiscent of literature,” he says. The code can be considered relatively useful; the task it accomplishes is more important than how it is written. But in copyright law, the key is how the idea is expressed. “If Copilot spits out an output that does the same thing as one of its training inputs – similar parameters, similar result – but spits out a different code, it probably won’t imply copyright law,” he says.
The other thing is the ethics of the situation. “There is no guarantee that GitHub holds the interests of independent coders to heart,” Sellars says. Copilot depends on the work of its users, including those who have explicitly tried to prevent their work from being reused for profit, and can also reduce demand for those same encoders by automating more programming, he notes. “We should never forget that there is no cognition in the model,” he says. It is a statistical match of the sample. Insights and creativity derived from data are all human. Some scholars have said that Copilot emphasizes the need for new mechanisms to ensure that those who produce AI data are fairly compensated.