Artificial Discourse Podcast - All Episodes

41

Stronger Models are NOT Stronger Teachers for Instruction Tuning

This research paper investigates the impact of different language models (LLMs) used as "teachers" to generate synthetic responses for instruction tuning. The authors demonstrate a surprising phenomenon they call the "Larger Models' Paradox," where larger and supposedly "stronger" teacher models do not always lead to improved instruction-following abilities in smaller base models. They propose a novel metric called Compatibility-Adjusted Reward (CAR) to better predict the effectiveness of teacher models, taking into account the compatibility between the teacher and the base model being fine-tuned. The study challenges the common assumption that larger LLMs are always better teachers and suggests that a more nuanced understanding of compatibility is needed for successful instruction tuning.

Nov 25, 2024

13m

40

Large Language Models Can Self-Improve in Long-context Reasoning

This research paper investigates the potential for large language models (LLMs) to self-improve in long-context reasoning, which involves processing and understanding complex information spread across long stretches of text. The authors propose a novel approach called SEALONG that leverages the LLMs' ability to generate multiple outputs for a given question and then scores these outputs using a method called Minimum Bayes Risk (MBR). The MBR approach prioritizes outputs that align better with each other, thereby filtering out outputs that might be incorrect or hallucinatory. SEALONG then uses these high-scoring outputs for further training, either through supervised fine-tuning or preference optimization. The authors demonstrate through extensive experiments that SEALONG significantly improves the long-context reasoning performance of LLMs without requiring expert model annotations or human labeling.

Nov 22, 2024

11m

39

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models, introduces a new method for generating 3D models using large language models (LLMs). The authors address the challenge of tokenizing 3D mesh data for LLMs by representing the mesh data as plain text using the OBJ file format, a standard text-based format for 3D models. This approach allows for direct integration with LLMs without modifying the vocabulary or tokenizers, minimizing additional training overhead. The study then introduces LLAMA-MESH, a fine-tuned LLaMA model that can generate 3D meshes from textual prompts, produce interleaved text and 3D mesh outputs, and understand and interpret 3D meshes. LLAMA-MESH achieves comparable mesh generation quality to models trained from scratch while maintaining strong text generation abilities, demonstrating the potential for LLMs to become universal generative tools for multiple modalities.

Nov 21, 2024

18m

38

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

The researchers introduce LLaVA-o1, a vision language model designed to perform structured reasoning by breaking down problem-solving into four distinct stages: summary, caption, reasoning, and conclusion. They compiled a new dataset, LLaVA-o1-100k, and proposed a stage-level beam search method to improve model performance during inference. Experimental results demonstrate that LLaVA-o1 outperforms existing open-source and even some closed-source models on multimodal reasoning benchmarks, emphasizing the effectiveness of its structured reasoning approach.

Nov 20, 2024

10m

37

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

The BlueLM-V-3B, a multimodal large language model (MLLM) designed specifically for mobile devices. The researchers address the challenges of deploying large models on mobile phones, such as limited memory and processing power, by implementing a novel algorithm and system co-design approach. This includes a dynamic resolution scheme that optimizes image processing and a token downsampler that reduces the number of image tokens to improve inference speed. The paper emphasizes BlueLM-V-3B's superior performance compared to other models of similar size and its high deployment efficiency on mobile devices.

Nov 19, 2024

13m

36

CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation

CORAL, a novel benchmark dataset for evaluating Retrieval-Augmented Generation (RAG) systems in a multi-turn conversational setting. The authors highlight the limitations of existing datasets in assessing conversational RAG and detail CORAL's unique features, including open-domain coverage, knowledge intensity, free-form responses, topic shifts, and citation labeling. They explain how CORAL is derived from Wikipedia, automatically converting its content into conversational formats, and outline the three core tasks it supports: conversational passage retrieval, response generation, and citation labeling. The authors present a unified framework for evaluating conversational RAG methods and report on experiments conducted on CORAL, showcasing the performance of different conversational search and generation models.

Nov 13, 2024

27m

35

A Survey of Small Language Models

This research paper surveys small language models (SLMs) and explores their applications, design, training, and model compression techniques. The authors explain that while large language models (LLMs) have proven effective, their resource demands have led to the development of SLMs, which are more efficient and can be deployed on a wider range of devices. The paper examines various techniques to optimize SLMs, including lightweight model architectures, efficient self-attention mechanisms, and model compression strategies such as pruning, quantization, and knowledge distillation. The authors discuss the challenges associated with SLMs, such as hallucination, bias, and energy consumption, and offer suggestions for future research. The goal of this work is to provide a comprehensive resource for researchers and practitioners working with small language models.

Nov 12, 2024

21m

34

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

This research explores whether transformers, a type of neural network architecture, can learn to reason implicitly over knowledge. The authors find that transformers can learn to reason implicitly, but only through a phenomenon called grokking, where training extends far beyond overfitting. The study investigates two reasoning types: composition and comparison. They find that while the transformers generalize well on in-distribution examples for both types, they struggle with out-of-distribution generalization for composition but succeed for comparison. Through mechanistic analysis of the model’s internals, they discover that different circuits are formed during grokking for each reasoning type, which explains the varying levels of systematicity. The authors also demonstrate the potential of parametric memory for complex reasoning tasks with large search spaces, showing that a fully grokked transformer can achieve near-perfect accuracy, while state-of-the-art LLMs with non-parametric memory fail.

Nov 11, 2024

12m

33

The Llama 3 Herd of Models

This research paper details the development of Llama 3, a large language model with improved capabilities in language understanding, code generation, mathematical reasoning, and multimodality. The paper emphasizes the importance of high-quality data, scaling up compute power, and using simple, efficient methods to achieve optimal results. It also explores the integration of vision and speech capabilities into Llama 3, highlighting the benefits of a compositional approach. The paper concludes with a discussion of safety measures implemented in Llama 3 to mitigate potential risks and ensure responsible use of the model.

Nov 10, 2024

26m

32

Kolmogorov-Arnold Network (KAN)

Unlike traditional Multi-Layer Perceptrons (MLPs), which have fixed activation functions on nodes, KANs have learnable activation functions on edges. This seemingly simple change allows KANs to outperform MLPs in terms of accuracy and interpretability, particularly for small-scale artificial intelligence and scientific tasks. The text explores the mathematical foundations of KANs, highlighting their ability to overcome the curse of dimensionality and achieve faster neural scaling laws than MLPs. Additionally, the text showcases KANs' potential for scientific discovery by demonstrating their effectiveness in uncovering mathematical relations in knot theory and identifying phase transition boundaries in condensed matter physics.

Nov 9, 2024

15m

31

MMIE: MASSIVE MULTIMODAL INTERLEAVED COMPREHENSION BENCHMARK FOR LARGE VISION-LANGUAGE MODELS

The document describes the development of MMIE, a large-scale benchmark designed to evaluate the performance of Large Vision-Language Models (LVLMs) in interleaved multimodal comprehension and generation tasks. MMIE comprises a dataset of 20,000 meticulously curated multimodal queries across various domains, including mathematics, coding, and literature, which are designed to challenge LVLMs to produce and interpret both images and text in arbitrary sequences. The authors also propose a reliable automated evaluation metric for MMIE, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria. Extensive experiments demonstrate the effectiveness of the benchmark and metrics, revealing significant room for improvement in the development of interleaved LVLMs. The paper provides detailed insights into the benchmark's construction, evaluation methods, and error analysis, offering valuable guidance for future research in multimodal learning.

Nov 8, 2024

18m

30

THINKING LLMS: GENERAL INSTRUCTION FOLLOWING WITH THOUGHT GENERATION

This research paper proposes a novel method called Thought Preference Optimization (TPO) to train large language models (LLMs) to "think" before responding to user instructions. TPO utilizes a preference-based training framework where LLMs generate internal thoughts alongside their responses, and these thoughts are then optimized based on the quality of the resulting responses. The authors argue that this approach, unlike previous methods relying on direct supervision, allows LLMs to develop thinking abilities for a broader range of tasks beyond traditional reasoning and problem-solving. They demonstrate the effectiveness of TPO on benchmark datasets and observe that LLMs trained with TPO show improvements even in non-reasoning categories like language and translation, marketing, and health, highlighting the potential for thinking-based LLMs in diverse applications.

Nov 7, 2024

11m

29

VIT-LENS: Towards Omni-modal Representations

The paper, "VIT-LENS: Towards Omni-modal Representations," introduces a novel approach to enable Artificial Intelligence (AI) agents to perceive information from various modalities beyond just vision and language. It proposes a method that leverages a pre-trained visual transformer (ViT) to efficiently encode information from diverse modalities, such as 3D point clouds, depth, audio, tactile, and electroencephalograms (EEG). By aligning these modalities with a shared embedding space, VIT-LENS unlocks a range of capabilities for AI agents, including any-modality captioning, question answering, and image generation. The paper presents extensive experimental results demonstrating that VIT-LENS achieves state-of-the-art performance on various benchmark datasets and outperforms prior methods in understanding and interacting with diverse modalities.

Nov 6, 2024

17m

28

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

This research paper proposes a new method for efficiently training linear transformers, which are a type of neural network that uses linear attention to process sequences of data. Unlike traditional transformers, which have quadratic complexity in sequence length, linear transformers can process long sequences in linear time, making them more efficient for certain tasks. However, existing linear transformers have been shown to struggle with tasks that require long-range dependencies or the ability to retrieve information from a large context. The authors address this limitation by introducing a novel algorithm called DeltaNet, which utilizes a delta rule-like update to improve associative recall over long contexts. DeltaNet is parallelized across sequence length using a memory-efficient representation for computing products of Householder matrices, making it suitable for training on modern hardware. The authors demonstrate that DeltaNet outperforms other linear-time baselines, particularly on recall-intensive tasks, and that DeltaNet can also be effectively combined with other types of attention mechanisms to create hybrid models that achieve even better performance.

Nov 5, 2024

13m

27

How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers

This research explores how the architecture of pre-trained language models influences their base capabilities, specifically focusing on the FFN-Wider Transformer architecture. The study identifies a key factor in model performance: the contribution ratio of the Multi-Head Attention (MHA) layer, which acts as a combination function that reflects the model's ability to combine linguistic features. The authors demonstrate that FFN-Wider Transformers reduce the contribution ratio of this combination function, leading to a decline in base capabilities. To address this issue, they propose a Combination Enhanced Architecture (CEA) that redistributes the wider FFN layer, enhancing the combination function and ultimately improving base capabilities. The effectiveness of CEA is further validated by its successful application to Mixture of Experts (MoE) Transformers, highlighting its potential for broader architecture improvement.

Nov 4, 2024

18m

26

RLEF: GROUNDING CODE LLMS IN EXECUTION FEEDBACK WITH REINFORCEMENT LEARNING

This research paper proposes a new method called Reinforcement Learning from Execution Feedback (RLEF) to improve the ability of large language models (LLMs) to generate code that successfully completes tasks. The authors demonstrate the effectiveness of RLEF by training LLMs on a challenging competitive programming benchmark called CodeContests. RLEF trains the models to iteratively generate code based on the feedback received from running their code against test cases. The results show that RLEF significantly improves solve rates and reduces the number of code samples needed compared to previous approaches, achieving state-of-the-art performance. The paper also investigates the inference-time behavior of RLEF-trained LLMs, highlighting their ability to effectively learn from feedback and make targeted improvements over multiple code generations.

Nov 3, 2024

24m

25

Paraphrase Types Elicit Prompt Engineering Capabilities

This research paper investigates how variations in the phrasing of prompts impact the performance of large language models (LLMs) across 120 tasks and five models. The study systematically analyzes six families of paraphrase types, including morphology, syntax, lexicon, lexico-syntax, discourse, and others, to determine their influence on model outputs. The findings demonstrate a potential for significant performance gains when prompts are adapted using specific paraphrase types, particularly morphology and lexicon changes. The research also considers factors like prompt complexity, temperature, and proximity to training data, concluding that smaller models are more sensitive to paraphrase changes and can potentially achieve comparable performance to larger models through prompt engineering.

Nov 2, 2024

8m

24

LaMA-Omni

LLaMA-Omni, designed to improve the seamless interaction between speech and large language models (LLMs). This model integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder, allowing it to generate text and speech responses directly from speech instructions with minimal latency. To enhance the model's performance, the authors create a speech instruction dataset called InstructS2S-200K containing 200,000 speech instructions and corresponding speech responses. Experimental results demonstrate that LLaMA-Omni provides superior responses in both content and style compared to previous speech-language models, achieving a response latency of 226 milliseconds. Furthermore, the model's training process is efficient, requiring less than 3 days on 4 GPUs.

Nov 1, 2024

20m

23

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models," details the development of a new family of multimodal language models (VLMs) called Molmo. Molmo is notable for its open-weight and open-data approach, meaning the model's weights, training data, and code are publicly available. This contrasts with the current trend of proprietary VLMs which keep their models closed. Molmo achieves state-of-the-art performance by utilizing a novel image captioning dataset called PixMo, collected from human annotators using speech-based descriptions. This approach avoids reliance on synthetic data generated by proprietary systems, enabling the creation of performant VLMs without the need for distilling closed models. The authors highlight Molmo's potential for various tasks, including question answering and image-based navigation.

Oct 31, 2024

18m

22

Low-Rank Adaptation (LoRA)

This technical paper proposes a novel technique called Low-Rank Adaptation (LoRA) for adapting large language models (LLMs) to specific downstream tasks. LoRA addresses the challenge of fine-tuning LLMs, which requires updating all model parameters, by injecting low-rank decomposition matrices into each layer of the Transformer architecture. This significantly reduces the number of trainable parameters, resulting in a substantial decrease in storage requirements, memory usage, and training time. The paper shows that LoRA performs comparably or even better than fine-tuning on various tasks, including natural language understanding (NLU) and generation (NLG), while providing additional benefits such as efficient task switching and lower hardware barrier to entry. The paper concludes by investigating the low-rank structure of model updates, providing insights into the effectiveness of LoRA and the underlying mechanisms of model adaptation.

Oct 30, 2024

16m

21

MCTrack: A Unified 3D Multi-Object Tracking Framework for Autonomous Driving

MCTrack is a new method for tracking multiple 3D objects, particularly designed for autonomous driving. The authors claim that MCTrack outperforms existing methods across popular datasets like KITTI, nuScenes, and Waymo. The paper also standardizes the format of perception results across various datasets, making it easier for researchers to focus on algorithm development. Additionally, the paper proposes novel evaluation metrics that assess the motion information output by tracking systems, such as velocity and acceleration, which is crucial for downstream tasks like trajectory prediction and planning.

Oct 29, 2024

15m

20

synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

This paper proposes Synatra, a system for generating large amounts of training data for digital agents. The goal is to overcome the problem of expensive human annotation by using indirect knowledge like online tutorials and random web pages as input. Synatra leverages LLMs to transform this indirect knowledge into direct demonstrations in the form of action sequences, which are then used to fine-tune an LLM for web navigation tasks. The paper presents empirical results showing that agents trained with Synatra outperform other models of comparable size, even surpassing GPT-3.5 on certain benchmarks. However, the authors also acknowledge limitations, such as the potential for overfitting to specific formats and the need to address computational costs.

Oct 28, 2024

10m

19

Inheritune: Training Smaller Yet More Attentive Language Models

This research paper investigates the phenomenon of "lazy layers" in large language models (LLMs). Lazy layers occur when deeper layers in LLMs lose the ability to learn meaningful information, leading to a decline in model performance. The authors introduce a new training technique called Inheritune, which addresses this issue by inheriting the initial layers of a larger, pre-trained model and gradually growing the smaller model until it matches or surpasses the performance of the original model. Experiments show that Inheritune effectively trains smaller, high-performing models, demonstrating its potential to make LLM training more efficient and accessible. The paper also analyzes the impact of Inheritune on various model sizes and data regimes, highlighting its efficiency and potential for developing high-quality models even in low-data settings.

Oct 27, 2024

15m

18

Geometric Structure and Polynomial-time Algorithm of Game Equilibria

This research paper proposes a polynomial-time approximation scheme (PTAS) for finding perfect equilibria in dynamic games. This is a significant contribution to game theory because it has long been an open question whether such an algorithm exists. The authors introduce a new geometric object called the "equilibrium bundle," which allows them to formalize perfect equilibria as zero points of its canonical section. The paper then presents a hybrid algorithm combining dynamic programming and an interior point method that iteratively searches for perfect equilibria on the equilibrium bundle. The algorithm achieves a weak approximation in fully polynomial time, meaning that it can find a policy that is close to an actual perfect equilibrium, and it also implies that the complexity class PPAD, previously believed to contain intractable problems, actually has efficient solutions.

Oct 26, 2024

23m

17

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

MLE-bench is a benchmark that evaluates the performance of AI agents on machine learning engineering tasks. The benchmark is comprised of 75 real-world Kaggle competitions, each with a dataset, description, and grading code. The authors evaluated various language models and agent frameworks on MLE-bench, finding that the best-performing agent achieved at least the level of a Kaggle bronze medal in 16.9% of the competitions. The paper discusses various ways to improve agent performance, such as increasing the number of attempts and the amount of compute available. It also explores potential contamination issues that might affect the benchmark's results. The benchmark is open-source and aims to promote research in understanding the capabilities of agents for automating ML engineering.

Oct 25, 2024

12m

16

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

This research paper investigates the mathematical reasoning abilities of large language models (LLMs) and finds that their performance on mathematical problems is not as robust as initially thought. The authors introduce a new benchmark, GSM-Symbolic, which generates diverse versions of math problems to assess LLMs' reasoning skills more thoroughly. Their findings indicate that LLMs struggle to handle variations in numerical values, exhibit a performance decline with increased question complexity, and are vulnerable to irrelevant information within a problem, suggesting their reasoning capabilities might be based on pattern matching rather than true logical understanding. This highlights the limitations of current LLMs in performing genuine mathematical reasoning and emphasizes the need for further research to develop more robust and reliable models.

Oct 24, 2024

14m

15

Differential Transformer

The research paper introduces the Differential Transformer, a new architecture for large language models that aims to improve the performance of these models by reducing the amount of attention they pay to irrelevant information. This architecture accomplishes this through a differential attention mechanism that calculates attention scores as the difference between two separate attention maps. This process effectively cancels out noise in the attention scores, encouraging the model to focus on more relevant information. The paper highlights the potential benefits of this architecture through various experiments, showcasing its superior performance in tasks like long-context modeling, key information retrieval, and in-context learning, while also mitigating issues like hallucination and activation outliers.

Oct 23, 2024

17m

14

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

This technical report from Microsoft introduces phi-3, a new series of language models (LLMs) designed for various tasks. The core of the report focuses on phi-3-mini, a small but highly capable LLM that rivals models like Mixtral and GPT-3.5 in performance despite being compact enough to run locally on a smartphone. This achievement is attributed to the use of optimized training data that prioritizes knowledge quality and reasoning ability over raw data quantity. The report also presents larger models in the phi-3 series, including models with multilingual and long-context capabilities, as well as a multimodal model called phi-3.5-Vision, capable of processing both text and images. The report highlights the models' strong performance on various benchmarks and emphasizes their safety and alignment with Microsoft's Responsible AI principles.

Oct 4, 2024

12m

13

YOLO: You Only Look Once: Unified, Real-Time Object Detection

"You Only Look Once: Unified, Real-Time Object Detection", introduces YOLO, a novel approach to object detection. This method frames object detection as a regression problem, allowing a single neural network to predict bounding boxes and associated class probabilities directly from full images in a single evaluation. YOLO is incredibly fast, processing images at 45 frames per second, making it suitable for real-time applications. Compared to existing systems, YOLO exhibits a lower rate of false positives and demonstrates strong generalization to new domains, making it highly suitable for applications such as self-driving cars and assistive devices.

Oct 4, 2024

7m

12

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

This research paper introduces Whisper, a speech recognition system trained on a massive, weakly supervised dataset of 680,000 hours of audio. The paper argues that scaling weakly supervised training has been underappreciated in speech recognition and that Whisper's robust, zero-shot performance demonstrates its ability to generalize well across different domains, languages, and tasks, even surpassing human accuracy in some areas. The authors explore the system's scaling properties, both in terms of model size and dataset size and analyze the impact of multitasking and multilingual training. They also discuss Whisper's performance on language identification and its robustness to noise. The paper concludes with a discussion of potential limitations and areas for future work.

Oct 4, 2024

10m

11

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

WaveNet, a deep neural network designed to generate raw audio waveforms. The paper highlights WaveNet's ability to produce audio signals with unprecedented naturalness, surpassing the performance of existing text-to-speech systems. Key to WaveNet's success is the use of dilated causal convolutions, which enable the model to capture long-range temporal dependencies in audio data. The authors demonstrate WaveNet's versatility by showcasing its effectiveness in multi-speaker speech generation, music modeling, and speech recognition tasks. They also discuss the potential of WaveNet as a generic framework for tackling various audio generation applications.

Oct 4, 2024

9m

10

LLaMA: Open and Efficient Foundation Language Models

The paper introduces LLaMA, a series of open-source foundation language models ranging in size from 7B to 65B parameters, trained on trillions of tokens from publicly available datasets. LLaMA-13B surpasses GPT-3 on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. The authors demonstrate that training state-of-the-art models with publicly available data is possible, and argue that the release of these models will accelerate the development of LLMs. They further highlight the importance of responsible AI practices by examining the biases and toxicity encoded in their models.

Oct 4, 2024

9m

9

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

This study describes the AlphaZero algorithm, a general-purpose reinforcement learning algorithm that can achieve superhuman performance in challenging domains. It surpasses the capabilities of traditional game-playing programs in chess, shogi, and Go, demonstrating its ability to learn from scratch and master complex games without relying on handcrafted domain knowledge. AlphaZero utilizes deep neural networks and Monte-Carlo tree search (MCTS) to guide its gameplay, focusing on the most promising variations during its search. The study contrasts AlphaZero's approach to the widely used alpha-beta search, which focuses on evaluating large numbers of positions. Additionally, it analyzes AlphaZero's chess knowledge, finding that it independently discovered and frequently played common human openings, further solidifying its mastery of the game.

Oct 4, 2024

9m

8

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

This research paper introduces a new language representation model called BERT (Bidirectional Encoder Representations from Transformers). BERT's key innovation is its ability to learn deep bidirectional representations from unlabeled text, enabling it to outperform existing language models on a wide range of natural language processing tasks, including question answering, language inference, and sentiment analysis. The authors demonstrate that BERT achieves state-of-the-art results on eleven NLP benchmarks, outperforming previous models by substantial margins. They also perform ablation studies to investigate the contributions of different aspects of BERT's architecture and training process.

Oct 4, 2024

8m

7

Deep Residual Learning for Image Recognition

The authors demonstrate that deep residual learning overcomes the problem of vanishing/exploding gradients that often hinders the training of very deep networks by explicitly letting stacked layers fit a residual mapping. This technique enables the training of extremely deep networks, leading to significant accuracy gains in various tasks, including ImageNet classification, object detection on PASCAL VOC and MS COCO, and ImageNet localization. The paper provides comprehensive experimental evidence and analysis to support the effectiveness of the proposed approach, highlighting its potential impact on future research in deep learning.

Oct 4, 2024

7m

6

Playing Atari with Deep Reinforcement Learning

"Playing Atari with Deep Reinforcement Learning," focuses on training deep convolutional neural networks to play Atari 2600 games using reinforcement learning. The authors use a novel approach called Deep Q-learning, which combines Q-learning with experience replay, a technique that allows the agent to learn from past experiences and improve its performance. This paper explores the ability of deep learning models to learn from raw visual inputs and overcome the challenges associated with reinforcement learning tasks, ultimately achieving performance that surpasses or approaches that of human experts on several Atari games.

Oct 4, 2024

11m

5

Generative Adversarial Nets by Goodfellow et al.

The research paper outlines a new approach to training generative models called Generative Adversarial Nets (GANs). GANs utilize a minimax two-player game where a generative model (G) learns to produce realistic data samples, while a discriminative model (D) learns to distinguish between real and generated samples. Through this competitive process, G improves its ability to create convincing data, and D becomes more adept at identifying fabricated samples. The paper delves into the theoretical underpinnings of GANs, proving their ability to recover the data distribution under ideal circumstances, and showcasing experimental results on image datasets like MNIST and CIFAR-10. The authors conclude by discussing the advantages and disadvantages of GANs compared to other generative models, and highlighting potential extensions for future research.

Oct 4, 2024

8m

4

LeNet - Handwritten Digit Recognition with a Back-Propagation Network

This paper describes the application of a back-propagation network to handwritten digit recognition. The authors demonstrate how a network architecture, constrained by geometric knowledge, can achieve high accuracy in classifying digits without extensive preprocessing. The network, trained on a real-world dataset of handwritten zip codes, achieves a 1% error rate with a 9% rejection rate, showing promising results in this domain. The paper also highlights the network's efficient learning process and its potential for real-time implementation on commercial digital signal processing hardware.

Oct 4, 2024

9m

3

AlexNet - ImageNet Classification with Deep Convolutional Neural Networks

The research paper "ImageNet Classification with Deep Convolutional Neural Networks" details the development and training of a large-scale convolutional neural network (CNN) for image classification. The authors, Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, present a groundbreaking architecture that achieved state-of-the-art results on the ImageNet dataset, a challenging benchmark in computer vision. The paper highlights various architectural innovations including the use of rectified linear units (ReLUs), data augmentation techniques, and a novel regularization method called dropout. The authors also discuss the training process, including the use of multiple GPUs and the optimization of the convolutional operation. This paper significantly advanced the field of deep learning, demonstrating the power of deep CNNs for object recognition.

Oct 4, 2024

10m

2

Were RNNs All We Needed?

This research paper revisits the traditional Recurrent Neural Networks (RNNs) – specifically, LSTMs and GRUs – and shows how to adapt them for modern parallel training. The authors demonstrate that by removing certain dependencies within the RNN structure, these models can be trained using the Parallel Scan algorithm, making them significantly faster than their traditional counterparts. The paper then compares the performance of these simplified LSTMs and GRUs (minLSTMs and minGRUs) to recent state-of-the-art sequence models in several tasks, including Selective Copying, Reinforcement Learning, and Language Modeling. The results show that the minLSTMs and minGRUs achieve comparable or better performance than other models while being far more efficient, suggesting that RNNs might be a viable option even in the era of Transformers.

Oct 4, 2024

8m

1

Attention is all you need

Attention is all you need: The Transformer is a new network architecture based solely on attention mechanisms that excel in sequence transduction tasks like language modelling and machine translation. Unlike traditional recurrent models, the Transformer allows for parallelization during training, leading to faster training times, especially with longer sequences. Notably, the Transformer utilizes self-attention, which computes a representation of a sequence by relating different positions within the sequence itself. This mechanism enables the model to process information from different representation subspaces and learn long-range dependencies more effectively than recurrent or convolutional layers. Empirical results demonstrate that the Transformer surpasses previous state-of-the-art models in translation quality and efficiency. Moreover, the Transformer demonstrates promising generalizability by achieving competitive results in English constituency parsing, a task that poses unique challenges due to structural constraints and length discrepancies between input and output.

Oct 4, 2024

15m

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Large Language Models Can Self-Improve in Long-context Reasoning

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation

A Survey of Small Language Models

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

The Llama 3 Herd of Models

Kolmogorov-Arnold Network (KAN)

MMIE: MASSIVE MULTIMODAL INTERLEAVED COMPREHENSION BENCHMARK FOR LARGE VISION-LANGUAGE MODELS

THINKING LLMS: GENERAL INSTRUCTION FOLLOWING WITH THOUGHT GENERATION

VIT-LENS: Towards Omni-modal Representations

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers

RLEF: GROUNDING CODE LLMS IN EXECUTION FEEDBACK WITH REINFORCEMENT LEARNING

Paraphrase Types Elicit Prompt Engineering Capabilities

LaMA-Omni

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Low-Rank Adaptation (LoRA)

MCTrack: A Unified 3D Multi-Object Tracking Framework for Autonomous Driving

synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

Inheritune: Training Smaller Yet More Attentive Language Models

Geometric Structure and Polynomial-time Algorithm of Game Equilibria

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Differential Transformer

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

YOLO: You Only Look Once: Unified, Real-Time Object Detection

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

LLaMA: Open and Efficient Foundation Language Models

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Deep Residual Learning for Image Recognition

Playing Atari with Deep Reinforcement Learning

Generative Adversarial Nets by Goodfellow et al.

LeNet - Handwritten Digit Recognition with a Back-Propagation Network

AlexNet - ImageNet Classification with Deep Convolutional Neural Networks

Were RNNs All We Needed?

Attention is all you need

Authentication Required