Allgemein

Daniel Kahn Gillmor: AI as a Compression Problem

Daniel Kahn Gillmor: AI as a Compression Problem

A recent article in The Atlantic makes the case that
very large language models effectively contain much of the works they’re trained on.
This article is an attempt to popularize the insights in the recent academic paper
Extracting books from production language models from Ahmed et al.
The authors of the paper demonstrate convincingly that
well-known copyrighted textual material can be extracted from
the chatbot interfaces of popular commercial LLM services.

The Atlantic article cites a podcast quote about the Stable Diffusion AI image-generator model, saying “We took 100,000 gigabytes of images and compressed it to a two-gigabyte file that can re-create any of those and iterations of those”.
By analogy, this suggests we might think of LLMs (which work on text, not the images handled by Stable Diffusion) as a form of lossy textual compression.

The entire text of Moby Dick, the canonical Big American Novel is merely 1.2MiB uncompressed (and less than 0.4MiB losslessly compressed with bzip2 -9).
It’s not surprising to imagine that a model
with hundreds of billions of parameters might contain copies of these works.

Warning: The next paragraph contains fuzzy math with no real concrete engineering practice behind it!

Consider a hypothetical model with 100 billion parameters,
where each parameter is stored as a 16-bit floating point value.
The model weights would take 200 GB of storage.
If you were to fill the parameter space only with losslessly compressed copies of books like Moby Dick,
you could still fit half a million books, more than anyone can read in a lifetime.
And lossy compression is typically orders of magnitude less in size than lossless compression,
so we’re talking about millions of works effectively encoded, with the acceptance of some
artifacts being injected in the output.

I first encountered this “compression” view of AI nearly three years ago, in Ted Chiang’s insightful
ChatGPT is a Blurry JPEG of the Web.
I was suprised that The Atlantic article didn’t cite Chiang’s piece.
If you haven’t read Ted Chiang, i strongly recommend his work,
and this piece is a great place to start.

Chiang aside, the more recent writing that focuses on the idea of
compressed works being “contained” in the model weights
seems to be used by people interested in wielding
esome sort of copyright claims against the AI companies that maintain or provide access to these models.
There are many many problems with AI today, but attacking AI companies based on copyright concerns
seems similar to going after Al Capone for tax evasion.

We should be much more concerned with the effect these projects have on
cultural homogeneity,
mental health,
labor rights,
privacy, and
social control
than whether they’re violating copyright in some specific instance.