DeepMind: Why is AI so good at languages? It is something in the language itself

news7g05/28/2022

12 5 minutes read

DeepMind: Why is AI so good at languages? It is something in the language itself

Can language frequency, and qualities such as polyphony, influence whether a neural network can suddenly solve tasks for which it was not specifically developed, known as “learning several times” ? DeepMind says yes.

Tiernan Ray for ZDNet

How can a program like OpenAI’s GPT-3 neural network be able to answer multiple-choice questions or write a poem in a particular style, even though it has never been programmed for these authors? that particular service?

According to new research by DeepMind, Google’s AI unit, it may be because human language has statistical properties that make neural networks expect the unexpected.

Natural languages, when viewed from a statistical point of view, have “heterogeneous” qualities, such as words that can represent many things, called “multi-sense”, like the word “bank” , meaning where you put your money or a raised mound. And words that sound the same can represent different things, called homonyms, like “here” and “hear”.

The qualities of language are the focus of an article posted on arXiv this month“The Distributive Attribute That Drives Multiple Learning in Transformers” by DeepMind scientists Stephanie CY Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland and Felix Hill.

Also: What is GPT-3? Everything your business needs to know about OpenAI’s groundbreaking AI language program

The authors begin by asking how programs like GPT-3 can solve the tasks for which they are presented with the types of queries for which they have not been explicitly trained, the so-called “learning” several times”.

Example: GPT-3 can answer multiple choice questions which has never been explicitly programmed to answer such a question type, simply by being prompted by a user to enter an example of a multiple choice question and answer pair.

“Language models based on large transformers can do little learning (also known as contextual learning) without being explicitly trained in it,” they write, referring to Google’s hugely popular Transformer neural network, which is the basis of GPT-3 and Google’s BERT language program.

As they explain, “We hypothesized that specific distributive properties of natural languages might drive this emergent phenomenon.”

The authors speculate that such large language modeling programs are behaving like another type of machine learning program, known as meta-learning. Meta-learning programs, which have been explored by DeepMind in recent years, work by being able to model data patterns spanning different data sets. Such programs are trained to model not a single data distribution, but distribution of data sets, as explained in Prior research by team member Adam Santoro.

Also: OpenAI’s massive GPT-3 hints at the limits of language models for AI

The key here is the idea of difference data sets. All of the language’s inconsistencies, they surmise, such as the polysemy and “long tails” of language, are the fact that speech contains words that are used with relatively little frequency – each This strange fact of the language is like a discrete data distribution.

In fact, the language, they write, is like something between supervised training data, with regular patterns, and meta-learning with various data:

As in supervised training, items (words) repeat and the mapping of item labels (e.g. word meanings) is somewhat fixed. At the same time, the long-tail distribution ensures that there exist many rare words that are only frequently repeated across context windows, but can explode (repeatedly) in context windows. We can also see that synonyms, homonyms, and synonyms are weaker versions of the completely non-permanent item label mapping used in several scenes meta training, where the mapping changes at each episode.

To test the hypothesis, Chan and colleagues take a surprising approach: they don’t actually work with language tasks. Instead, they trained a Transformer neural network to solve a visual task, called Omniglot, introduced in 2016 by scholars from NYU, Carnegie Mellon, and MIT. Omniglot challenged a program to assign a suitable categorical label to 1,623 handwriting characters.

In the case of Chan et al., they turn the labeled Omniglot challenge into a one-time task by randomly shuffling the glyph’s labels, so that the neural network learns by “episode”:

Unlike in training, where the labels are fixed on all sequences, the labels for these two image classes are randomly reassigned to each sequence […] Since the labels are randomly reassigned to each sequence, the model must use the context in the current sequence to make label predictions for the query image (2D classification problem). Unless otherwise stated, several times learning is always assessed on layers of retained images not seen in the training course.

In this way, the authors are using visual data, the strokes, to capture the heterogeneous qualities of the language. “At the time of training, we sequenced Omniglot images and labels with different language-inspired distribution properties,” they write. For example, they incrementally increase the number of class labels that can be assigned to a given glyph, to estimate the quality of the polysemy.

“Then, in our assessment, we’ll see if these properties increase learning less often.”

What they discovered was that when they multiplied the number of labels for a given glyph, the neural network got better at performing the learning several times. “We found that increasing this ‘multi-sense factor’ (the number of labels assigned to each word) also increased learning less often,” as Chan and colleagues say.

“In other words, making the generalization problem harder actually made learning several times stronger.”

At the same time, there’s something about the specific structure of the Transformer neural network that helps it achieve rapid learning, Chan and colleagues found. They tested “a vanilla repeating neural network”, they wrote, and found that such a network never gain the ability to fire several shots.

“Transformers show a significantly larger bias for few-scene learning than iterative models.”

The authors conclude that both the qualities of the data, such as the long tails of language, and the nature of neural networks, such as the Transformer structure, matter. It is not one or the other but both.

The authors list several avenues for future exploration. One is the connection with human perception since infants demonstrate what seems to be little learning.

For example, infants quickly learn the statistical properties of language. Can these attribution features help infants acquire the ability to learn quickly, or serve as a useful pre-training course for later learning? And did the same unequal distribution in other areas of experience, such as vision, also play a role in this development?

It should be clear that the present work is not a language test at all. Instead, it aims to simulate the supposed statistical properties of the language by recreating the heterogeneity in visual data, the Omniglot image.

The authors do not explain whether translating from one mode to another has any effect on the meaning of their work. Instead, they write that they look forward to expanding their work into many other aspects of the language.

They write: “The above results suggest interesting lines of future research”, including, “How do these data distribution properties interact with supervised loss versus reinforcement learning?” observer? using symbolic input, training on predicting the next or masked tokens, and the meanings of the words defined by their context?”

Source link

news7g05/28/2022

12 5 minutes read