Skip to main content

What’s the most important part of any digital system?

From the cloud revolution, through the big data age and into the AI era, it’s always been data. Data is the digital life of a business, and to a great extent, digital systems rise and fall based on the quality of the data that’s used.

Now, experts are warning that we might actually run out of high-quality data to feed our enormous large language models and neural networks, as we get further into AI.

“All in all, we believe that there is about a 20%2 chance that the scaling (as measured in training compute) of ML models will significantly slow down by 2040 due to a lack of training data,” write a set of authors at Epoch AI.

To understand why this is, and how you conceivably run out of data in such a huge global playground as the Internet, we first have to understand the dimensions of what we’re talking about. For example, Stable Diffusion was reportedly built on 5.8 billion text-image pairs.

You might think: no problem! there are at least 5.8 billion images that can be scraped off of the Internet, quite easily. But then there’s the issue of quality.

Three Important Types of Data Quality to Consider

The first important type of data quality is detail, and structure.

For example, if you have a low-resolution, blurry image, that’s not a high-value piece of information for a sophisticated AI system. And then your higher quality images that are clearer and more detailed, they’re more likely to be behind a paywall. (more on this later.)

What about social media? Can the system just use all of those billions of images that people are posting on sites like Facebook?

Maybe, but scientists have identified issues with bias in a lot of that data. If you remember the AI that started to use racial language, and other problems that cropped up from some types of early training mechanisms, you’ll see why social media isn’t the best source of data, either.

A third criterion is authenticity. The data that’s involved needs to be real and certifiable, and true, and useful to the AI engine itself.

That leads us to the next problem with freely available training data…

Data Sets Walled Off

In a lot of ways, this new problem is a lot like the problem that human readers encountered earlier in the evolution of the Internet.

In fact, it’s quite similar, because computers aren’t the only systems that need training data. You could say that whenever anybody is surfing the web to make a decision, they’re utilizing the same sources of training data, in similar ways, and for similar purposes. And if you’re ‘somebody human’ looking for information on current events, important tips for auto repair, or anything else, you come up with the same problem – registration walls and paywalls.

In other words, the core information itself isn’t free. It’s in a walled garden, because the person who made it and came up with it needs to profit from it in some way, shape or form. Newspapers learned this, arguably, too late, and at their own peril, but much of this stuff is now behind paywalls. The same goes for magazine content. This is the most highly prized high-quality data around, and neither the humans or the systems can get to it, increasingly, because its makers know its value.

Is Synthetic Data the Answer?

Some people have suggested that we might get around this problem by simply creating synthetic data…

First of all, what is synthetic data?

Well, essentially, it’s data that is built automatically off of other data. This gets sort of recursive and has limited value, but there may be situations where you can better train large systems with a bigger set of data that’s extrapolated from what’s already there. Suppose you have 100 health records from different patients, in a random sample that contains all relevant demographics and other diversity that you need.

But, say you want 1000 health records in order to train the system. You might be able to extrapolate 9 other records from each of those original 100, and get the sets that you need that way. The problem, of course, is that those data are only as good as the core sets they’re built from, and in a way, you’re just sort of echoing your original set.

So the value of synthetic data, in a lot of these use cases, may be limited.

Uses of Acquired Data Sets

So you can find authentic core data, use it to build synthetic data, or find unstructured data in strange places, and import it and clean it up. What are the systems using this data for?

The need is to have large data sets for validating outputs, for training, and for different types of data experiments.

“An AI dataset assembles data points that teach algorithms to recognize patterns, make decisions, or predict future data points,” writes an author at Defined.ai. “For instance, to train a facial recognition system, you’d need thousands—or even millions—of face images. Each image in that collection, labeled with relevant information, forms a part of the dataset.”

So whether it’s text, image or something else, the appetites of these systems for data are immense.

Now, some are suggesting that will run out of high-quality authentic training data in a few decades. This is worth thinking about now, because it helps us with the trajectory of new AI systems as artificial superintelligence gets more capable.

Turning to the Archives

There’s one more proposed solution for a data crunch that’s getting a lot of attention.

“Developers are also searching for content outside the free online space, such as that held by large publishers and offline repositories,” writes Rita Matulionyte at The Conversation. “Think about the millions of texts published before the internet. Made available digitally, they could provide a new source of data for AI projects.”

If they can import these assets, that might help with data scarcity.

The Labor Game

in all of this hubbub over available training data, there’s one other component that we didn’t talk about – the labor required to aggregate that data in the first place.

Sometimes you can do this in an automated way, with AI systems themselves harvesting data directly. But in the early days of labeled data and supervised systems, you were employing armies of people to do that same work. Can you scale up in the same way when all of that is automated?

It’s a relevant question.

As innovators build out the next generation of AI networks, we’ll have to be thinking about where data will come from, and how much will have been funneled into these systems in an authentic, structured way. Structured data may help more than synthetic data, but that remains to be seen. And archives certainly might come in handy. This remains an important aspect of how we as humans will co-exist with AI as we move forward.


Source: www.forbes.com…