PODCAST · technology

AI: Origins

by mcgrof

While we take AI for granted now, it's easy to forget it's unique history and haphazard advances. This reviews the principal concepts under which modern neural networks have built upon since the earliest known related papers.

Subscribe · 0 Bookmark

10

FIM: Filling in the Middle for Language Models

This 2022 academic paper explores Fill-in-the-Middle (FIM) capabilities in causal decoder-based language models, demonstrating that these models can learn to infill text effectively by simply rearranging parts of the training data. The authors propose a method where a middle section of text is moved to the end of a document during training, showing this data augmentation does not negatively impact the model's original left-to-right generative ability. The research highlights the efficiency of FIM training, suggesting it should be a default practice, and offers best practices and hyperparameters for optimal performance, particularly noting the superiority of character-level span selection and context-level FIM implementation. They also introduce new benchmarks to evaluate infilling performance, emphasizing the importance of sampling-based evaluations over traditional perplexity measures for gauging real-world utility.Source: https://arxiv.org/pdf/2207.14255

Aug 9, 2025

20m
9

BPE: Subword Units for Neural Machine Translation of Rare Words

This 2016 academic paper addresses the challenge of translating rare and unknown words in Neural Machine Translation (NMT), a common issue as NMT models typically operate with a fixed vocabulary while translation itself is an open-vocabulary problem. The authors propose a novel approach where rare and unknown words are encoded as sequences of subword units, eliminating the need for a back-off dictionary. They introduce an adaptation of the Byte Pair Encoding (BPE) compression algorithm for word segmentation, which allows for an open vocabulary using a compact set of variable-length character sequences. Empirical results demonstrate that this subword unit method significantly improves translation quality, particularly for rare and out-of-vocabulary words, for English-German and English-Russian language pairs. The paper compares various segmentation techniques, concluding that BPE offers a more effective and simpler solution for handling the open-vocabulary problem in NMT compared to previous word-level models and dictionary-based approaches.Source: https://arxiv.org/pdf/1508.07909

Aug 9, 2025

16m
8

Distributed Word and Phrase Representations

This 2013 paper introduces advancements to the continuous Skip-gram model, a method for learning high-quality distributed vector representations of words. The authors present extensions like subsampling frequent words and negative sampling to enhance vector quality and training speed. A significant contribution is the method for identifying and representing idiomatic phrases as single tokens, improving the model's ability to capture complex meanings. The paper demonstrates that these word and phrase vectors exhibit linear relationships, allowing for precise analogical reasoning through simple vector arithmetic. Overall, the research highlights improved efficiency and accuracy in learning linguistic representations, especially with large datasets, by optimizing the Skip-gram architecture.Source: https://arxiv.org/pdf/1310.4546

Aug 9, 2025

16m
7

Efficient Word Vectors for Large Datasets

This 2013 academic paper introduces two new model architectures, Continuous Bag-of-Words (CBOW) and Skip-gram, designed for efficiently computing continuous vector representations of words from vast datasets. The authors compare the quality and computational cost of these new models against existing neural network language models, demonstrating significant improvements in accuracy at a lower computational expense. A key focus is on preserving linear regularities between words, enabling the vectors to capture complex syntactic and semantic relationships that can be revealed through algebraic operations. The research highlights the scalability of these methods for large-scale parallel training, suggesting their potential to advance various Natural Language Processing (NLP) applications.Source: https://arxiv.org/pdf/1301.3781

Aug 9, 2025

12m
6

A Neural Probabilistic Language Model

This paper published in 2003 introduces a neural probabilistic language model designed to address the curse of dimensionality inherent in modeling word sequences. The authors propose learning a distributed representation for words, which enables the model to generalize from seen sentences to an exponential number of semantically similar, unseen sentences. This approach simultaneously learns word feature vectors and the probability function for word sequences using neural networks. The paper details the architecture of the neural network, the training process involving stochastic gradient ascent, and methods for parallel implementation to manage the computational demands of large datasets. Experimental results on two corpora demonstrate that this neural network approach significantly improves upon state-of-the-art n-gram models, particularly by leveraging longer word contexts.Source: https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Aug 8, 2025

6m
5

Softmax: Neural Networks and Maximum Mutual Information Estimation

The paper published in 1989, "Training Stochastic Model Recognition Algorithms as Networks can lead to Maximum Mutual Information Estimation of Parameters" by John S. Bridle, proposes a novel approach to pattern recognition, specifically improving Hidden Markov Models (HMMs) used in speech recognition. It focuses on discrimination-based training methods within neural networks (NNs). The paper demonstrates how modifying a multilayer perceptron's output layer to yield correct probability distributions, and replacing the standard squared error criterion with a probability-based score, is equivalent to Maximum Mutual Information (MMI) training. This method, when applied to a specially constructed network for stochastic model-based classifiers, offers a powerful way to train model parameters, exemplified by an HMM-based word discriminator called an "Alphanet." Ultimately, the research explores how NN architectures can embody the desirable traits of stochastic models and clarifies the relationship between discriminative NN training and MMI training of stochastic models.Source: https://proceedings.neurips.cc/paper_files/paper/1989/file/0336dcbab05b9d5ad24f4333c7658a0e-Paper.pdf

Aug 8, 2025

11m
4

Back-Propagating Errors for Visual and Stereo Recognition

The paper on backpropagation was published in 1986.The paper presents a collaborative research effort focusing on back-propagation as a method for learning representations within neural networks. One document, "Learning representations by back-propagating errors (1).pdf," introduces the theoretical framework and mathematical underpinnings of this learning algorithm, explaining how connection weights in a network are adjusted based on the error between actual and desired outputs. The other text appears to be an excerpt from "Letters to Nature" titled "Bilateral amblyopia after a short period of reverse occlusion in kittens," which, while seemingly disparate in its title, likely contributes an applied example or a related biological context to the discussion of learning and neural pathways, possibly illustrating the plasticity of neural systems. Together, they offer insights into both the computational mechanics and potential real-world implications or biological analogues of back-propagation.

Aug 8, 2025

13m
3

The Parallel Distributed Processing Perspective

This paper published in 1986 introduces the concept of Parallel Distributed Processing (PDP) models, offering a new perspective on how human cognition works, contrasting it with traditional sequential processing. It explores how the brain handles complex tasks like perception, motor control, language understanding, and memory retrieval by simultaneously considering multiple, often ambiguous, pieces of information. The text provides concrete examples such as reaching for an object, skilled typing, stereoscopic vision, and word recognition to illustrate how interconnected processing units interact through excitatory and inhibitory signals to arrive at solutions. Furthermore, it touches upon the origins of PDP models, highlighting their physiological plausibility and their ability to learn and generalize spontaneously by adjusting connection strengths between units based on experience.Source: https://stanford.edu/~jlmcc/papers/PDP/Chapter1.pdf

Aug 8, 2025

27m
2

The Perceptron: A Theory of Statistical Separability

The 1958 paper on Perceptrons, by Marvin L. Minsky and Seymour A. Papert, offers an expanded edition exploring artificial intelligence, particularly pattern recognition, and learning through linear parallel predicates and geometrical theory of linear inequalities. It discusses the historical development of neural networks and connectionism from the 1940s through the 1980s, providing mathematical strategy and analyzing perceptrons and pattern recognition. Complementing this, Frank Rosenblatt's article, "The Perceptron: A Theory of Statistical Separability in Cognitive Systems," introduces the concept of the perceptron as a brain model, focusing on sensory information storage and recognition. Rosenblatt differentiates his approach from symbolic logic and digital computer models, emphasizing a statistical approach to learning curves in neural networks. Both sources contribute significantly to the foundational understanding of early artificial intelligence and machine learning.Sources:1) https://rodsmith.nz/wp-content/uploads/Minsky-and-Papert-Perceptrons.pdf2) https://www.ling.upenn.edu/courses/cogs501/Rosenblatt1958.pdf

Aug 8, 2025

19m
1

A Logical Calculus of Ideas Immanent in Nervous Activity

Perhaps the first related papers influencing the rise of the design of neural networks, published in 1943! The paper, "A Logical Calculus of the Ideas Immanent in Nervous Activity," is a foundational paper in fields such as cognitive science and artificial intelligence. This work models neural networks as discrete-time systems where neurons have "all-or-none" states (firing or not firing) and fire if input exceeds a threshold. The paper explores how logical propositions can represent neural activity, distinguishing between nets without and with circles (recurrent connections). It establishes that neural nets can compute Turing-computable numbers, linking the biological brain's operations to mathematical logic and computational theory.

Aug 8, 2025

18m

Type above to search every episode's transcript for a word or phrase. Matches are scoped to this podcast.

Searching…

We're indexing this podcast's transcripts for the first time — this can take a minute or two. We'll show results as soon as they're ready.

No matches for "" in this podcast's transcripts.

Showing of matches

No topics indexed yet for this podcast.

Loading reviews...

Share your thoughts

ABOUT THIS SHOW

HOSTED BY

mcgrof

FIM: Filling in the Middle for Language Models

BPE: Subword Units for Neural Machine Translation of Rare Words

Distributed Word and Phrase Representations

Efficient Word Vectors for Large Datasets

A Neural Probabilistic Language Model

Softmax: Neural Networks and Maximum Mutual Information Estimation

Back-Propagating Errors for Visual and Stereo Recognition

The Parallel Distributed Processing Perspective

The Perceptron: A Theory of Statistical Separability

A Logical Calculus of Ideas Immanent in Nervous Activity

Authentication Required