Source : A Brief History of AI with Deep Learning | by LM Po | Sep, 2024 | Medium

Artificial intelligence (AI) and deep learning have seen remarkable progress over the past several decades, transforming fields like computer vision, natural language processing, and robotics. This article provides an overview of key milestones in the history of AI with use of deep learning, from early neural network models to modern large language models and multimodal AI systems.

1. The Birth of Artificial Intelligence (1956)

The concept of Artificial Intelligence (AI) has been around for centuries, but the modern field of AI as we know it today began to take shape in the mid-20th century. The term “Artificial Intelligence” was first coined in 1956 by John McCarthy, a computer scientist and cognitive scientist, at the Dartmouth Summer Research Project on Artificial Intelligence.

The Dartmouth conference is often considered the birthplace of AI as a field of research. The conference brought together a group of computer scientists, mathematicians, and cognitive scientists to discuss the possibility of creating machines that could simulate human intelligence. The attendees included notable figures such as Marvin Minsky, Nathaniel Rochester, and Claude Shannon.

Marvin Minsky, Claude Shannon, Ray Solomonoff and other scientists at the Dartmouth Summer Research Project on Artificial Intelligence (Photo: Margaret Minsky) Source

1.1 The Evolution of AI: From Rule-Based Systems to Deep Learning

The evolution of AI began in the 1950s with the development of algorithms for tasks like chess and problem-solving, with the first AI program, Logical Theorist, created in 1956. The 1960s and 1970s introduced rule-based expert systems, such as MYCIN, which could assist in complex decision-making processes. The 1980s saw the emergence of machine learning, which enabled AI systems to learn from data and improve over time, laying the foundation for modern deep learning techniques.

Today, the majority of cutting-edge AI technologies are driven by deep learning techniques, which have transformed the landscape of AI. Deep learning, a specialized branch of machine learning, leverages artificial neural networks with multiple layers to extract complex features from raw input data. In this article, we will explore the history of AI, highlighting the role of deep learning in its evolution.

2. Early Artificial Neural Networks (1940s — 1960s)

2.1 Muclloch-Pitts Neuron (1943)

The concept of neural networks dates back to 1943, when Warren McCulloch and Walter Pitts proposed the first artificial neuron model. The McCulloch-Pitts (MP) neuron model was a groundbreaking simplification of biological neurons. This model laid the foundation for artificial neural networks by aggregating binary inputs and making decisions based on this aggregation using a threshold activation function, resulting in a binary output {0, 1}.

This simplified model captures the essence of neuronal behavior —
receiving multiple inputs, integrating them, and producing a binary
output based on whether the integrated signal exceeds a threshold.
Despite its simplicity, the MP neuron model was capable of
implementing basic logical operations, demonstrating the potential
of neural computation.

2.2 Rosenblatt’s Perceptron Model (1957)

In 1957, Frank Rosenblatt introduced the Perceptron, a single-layer neural network capable of learning and recognizing patterns. The Perceptron model is a more general computational model than the MP neuron, designed to process real-valued inputs and adjust weights to minimize classification errors.

Rosenblatt also developed a supervised learning algorithm for the Perceptron, which allows the network to learn directly from training data.

Rosenblatt’s ambitious claims about the Perceptron’s capabilities, including its potential to recognize individuals and translate speech between languages, generated considerable public interest in AI during that period. The Perceptron model and its associated learning algorithm marked significant milestones in the evolution of neural networks. However, a critical limitation soon became apparent: the Perceptron’s learning rule was unable to converge when presented with non-linearly separable training data.

2.3 ADALINE (1959)

In 1959, Widrow and Hoff introduced ADALINE (Adaptive Linear Neuron aka Delta Learning Rule), an improvement over the Perceptron learning rule. ADALINE addressed limitations like binary output and noise sensitivity and could learn and converge on non-linearly separable data, a major breakthrough in neural network development.

Key features of ADALINE include:

Linear Activation Function: Unlike the Perceptron’s step
function, ADALINE uses a linear activation function, making it
suitable for regression tasks and continuous outputs.
Least Mean Squares (LMS) Algorithm: ADALINE employs the
LMS algorithm, which minimizes the mean squared error
between predicted and actual outputs, providing a more
efficient and stable learning process.
Adaptive Weights: The LMS algorithm adjusts the weights
adaptively based on the error in the output, enabling ADALINE
to learn and converge effectively, even in the presence of noise.

The introduction of ADALINE marked the start of the First Golden Age of Neural Networks, overcoming limitations of Rosenblatt’s Perceptron Learning. This breakthrough enabled efficient learning, continuous outputs, and adaptation to noisy data, sparking a wave of innovation and rapid progress in the field.

However, like the Perceptron, ADALINE was still limited to linearly separable problems and could not solve more complex and non-linear tasks. This limitation would later be highlighted by the XOR problem, leading to the development of more advanced neural network architectures.

2.4 The XOR Problem (1969)

In 1969, Marvin Minsky and Seymour Papert highlighted a critical limitation of the single-layer Perceptron in their book “Perceptrons.” They demonstrated that the Perceptron was incapable of solving the Exclusive OR (XOR) problem, a simple binary classification task, due to its linear decision boundary. The XOR problem is not linearly separable, meaning no single linear boundary can correctly classify all input patterns.

This revelation underscored the need for more complex neural network architectures capable of learning non-linear decision boundaries. The exposure of the Perceptron’s limitations led to a loss of confidence in neural networks and a shift towards symbolic AI methods, marking the start of the “First Dark Age of Neural Network” from the early 1970s to the mid-1980s.

However, the insights gained from solving the XOR problem led researchers to recognize the need for more complex models that could capture non-linear relationships. This realization ultimately led to the development of the Multilayer Perceptron and other advanced neural network models, setting the stage for the resurgence of neural networks and deep learning in later decades.

3. The Multilayer Perceptron (1960s)

The Multilayer Perceptron (MLP) was introduced in the 1960s as an improvement over the single-layer Perceptron. It consists of multiple layers of interconnected neurons, allowing it to address the limitations of the single-layer model. Soviet scientists A. G. Ivakhnenko and V. Lapa made significant contributions to the development of the MLP, building upon the foundational work of the Perceptron.

3.1 Hidden Layers

The addition of hidden layers allows the MLP to capture and represent complex, non-linear relationships in the data. These hidden layers significantly enhance the network’s learning capabilities, enabling it to solve problems that are not linearly separable, such as the XOR problem.

3.2 Historical Context and Challenges of MLPs

The MLP marked a significant advancement in neural network research, demonstrating the potential of deep learning architectures for solving complex problems. In the 1960s and 1970s, however, the development of MLPs was hindered by several challenges:

Lack of Training Algorithms: Early MLP models lacked efficient
training algorithms that could effectively adjust the weights of
the network. The absence of backpropagation made it difficult
to train deep networks with multiple layers.
Computational Limitations: The computational power available
at the time was insufficient to handle the complex calculations
required for training deep neural networks. This limitation
slowed the progress of MLP research and development.

The first Dark Age of Neural Networks ended in 1986 with the rediscovery and publication of the backpropagation algorithm, starting the Second Golden Age of Neural Networks.

4. Backpropagation (1970s-1980s)

In 1969, the XOR problem highlighted the limitations of perceptrons (single-layer neural networks). Researchers realized that multi-layer neural networks could overcome these limitations, but they lacked a practical algorithm to train these complex networks. It took 17 years for the backpropagation algorithm to be developed, enabling neural networks to theoretically approximate any function. Interestingly, it was later discovered that the algorithm had actually been invented before its publication. Today, backpropagation is a fundamental component of deep learning, having undergone significant advancements and refinements since its inception in the 1960s and 1970s.

4.1 Early Developments (1970s)

Seppo Linnainmaa (1970): Introduced the concept of automatic differentiation, which is a key component of the backpropagation algorithm.
Paul Werbos (1974): Proposed using the chain rule of calculus to compute the gradient of the error function with respect to the network’s weights, enabling the training of multilayer neural networks.

4.2 Refinement and Popularization (1980s)

David Rumelhart, Geoffrey Hinton, and Ronald Williams (1986): Presented backpropagation as a practical and efficient method for training deep neural networks, demonstrating its application to various problems.

4.3 Key Features of Backpropagation:

Gradient Descent: Backpropagation is used in conjunction with gradient descent to minimize the error function. The algorithm computes the gradient of the error with respect to each weight in the network, allowing the weights to be updated iteratively to reduce the error.
Chain Rule: The core of the backpropagation algorithm is the application of the chain rule of calculus. This rule allows the gradient of the error to be decomposed into a series of partial derivatives, which can be computed efficiently through a backward pass through the network.
Layered Computation: Backpropagation operates in a layer-by-layer manner, starting from the output layer and working backward to the input layer. This layered computation ensures that the gradients are propagated correctly through the network, enabling the training of deep architectures.

4.5 Universal Approximation Theorem (1989)

The Universal Approximation Theorem, proposed by George Cybenko in 1989, provided a mathematical foundation for the capabilities of multilayer neural networks. The theorem states that a feedforward neural network with a single hidden layer can approximate any continuous function to an arbitrary degree of accuracy, given sufficient neurons and using non-linear activation functions. This theorem underscores the power and flexibility of neural networks, making them suitable for a wide range of applications.

A multilayered neural network with a single hidden layer can approximate any continuous function to any desired precision, enabling complex problem-solving in various domains.

4.6 Second Golden Age (Late 1980s — Early 1990s)

The development of Backpropagation and the Universal Approximation Theorem (UAT) marked the beginning of a second golden age for neural networks. Backpropagation provided an efficient method for training multilayer neural networks, enabling researchers to train deeper and more complex models. The UAT provided theoretical justification for the use of multilayer neural networks and bolstered confidence in their ability to solve complex problems. This period, spanning the late 1980s and early 1990s, saw a resurgence of interest and significant advancements in the field.

4.7 Second Dark Ages (Early 1990s — Early 2000s)

However, the field of neural networks experienced a “ second dark age” between the early 1990s and early 2000s due to several factors:

The rise of Support Vector Machines (SVMs), which offered a mathematically elegant approach to classification and regression tasks.
Computational limitations, as training deep neural networks was still time-consuming and hardware-intensive.
Overfitting and generalization issues, where early neural networks performed well on training data but poorly on unseen data, making them less reliable for practical applications.

These challenges led many researchers to shift their focus away from neural networks, contributing to a period of stagnation in the field.

4.8 Resurgence as Deep Learning (Late 2000s — Present):

The field of neural networks experienced a resurgence in the late 2000s and early 2010s, driven by advancements in:

Deep learning architectures (CNNs, RNNs, Transformers, Diffusion Models)
Hardware (GPUs, TPUs, LPUs)
Large-scale datasets (ImageNet, COCO, OpenWebText, WikiText, etc.)
Training algorithms (SGD, Adam, dropout)

These advancements led to significant breakthroughs in computer vision, natural language processing, speech recognition, and reinforcement learning. The Universal Approximation Theorem, combined with practical advancements, has paved the way for the widespread adoption and success of deep learning techniques.

5. Convolutional Neural Networks (1980s – 2010s)

Convolutional Neural Networks (CNNs) have dramatically transformed the landscape of deep learning, particularly in the fields of computer vision and image processing. Their evolution from the 1980s to the 2010s reflects significant advancements in architecture, training techniques, and applications.

5.1 Early Developments (1989–1998)

The concept of CNNs was first introduced in the 1980s by Kenji Fukushima, who proposed the Neocognitron, a hierarchical neural network that mimicked the structure of the human visual cortex. This pioneering work laid the foundation for the development of CNNs. In the late 1980s and early 1990s, Yann LeCun and his team further developed CNNs, introducing the LeNet-5 architecture, which was specifically designed for handwritten digit recognition.

5.2 Key Components of CNNs

CNNs are constructed by three key components:

Convolutional Layers: These layers automatically learn spatial hierarchies of features from input images by applying a set of learnable filters.
Pooling Layers: Pooling layers reduce the spatial dimensions of the input, enhancing robustness to variations and decreasing computational load.
Fully Connected Layers: Following convolutional and pooling layers, fully connected layers are used for classification tasks, integrating features learned from previous layers.

5.3 Key Features of CNNs

Local Receptive Fields: CNNs use local receptive fields to capture local patterns in the input data, making them highly effective for image and visual tasks.
Shared Weights: The use of shared weights in convolutional layers reduces the number of parameters in the network, making it more efficient and easier to train.
Translation Invariance: Pooling layers introduce translation invariance, allowing the network to recognize patterns regardless of their position in the input image.

5.4 The Rise of CNNs: AlexNet’s Impact (2012)

In 2012, a major milestone was reached in the development of CNNs, as AlexNet emerged victorious in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), achieving a significant margin of victory and marking a significant breakthrough in image classification.

The ILSVRC is an annual image recognition benchmark that evaluates algorithms on a dataset of over 10 million annotated images, categorized into 1000 classes. AlexNet’s innovations included:

ReLU Activation Functions: Introduced to overcome issues with traditional activation functions, ReLU enabled faster training
and improved performance.
Dropout Regularization: This technique reduced overfitting by randomly dropping units during training.
Data Augmentation: Enhancements to the training dataset improved generalization by artificially increasing the diversity of the training data.

AlexNet’s success marked a turning point in CNN development, paving the way for further advancements in image classification and object detection.

AlexNet Unleashes the Third Golden Age of Neural Networks

The current golden age (2010s-present) is marked by the convergence of deep learning, big data, and powerful computing platforms. This era has seen remarkable breakthroughs in image recognition, natural language processing, and robotics. Ongoing research continues to push the boundaries of AI capabilities.

5.5 Subsequent Architectures

Following AlexNet, several influential architectures emerged:

VGGNet (2014): Developed by the Visual Geometry Group at Oxford, VGGNet emphasized deeper architectures with smaller convolutional filters (3×3), achieving remarkable accuracy.
GoogLeNet/Inception (2014): Introduced inception modules that allowed the network to capture multi-scale features efficiently.
ResNet (2015): Residual Networks introduced skip connections, enabling the training of very deep networks while mitigating the vanishing gradient problem.

5.6 Applications of CNNs

The advancements in CNNs have revolutionized various fields:

Computer Vision: CNNs have become the backbone of modern computer vision, enabling breakthroughs in image classification, object detection, and semantic segmentation.
Medical Imaging: CNNs are utilized for tasks such as disease diagnosis, tumor detection, and image-guided surgery, significantly improving diagnostic accuracy.
Autonomous Vehicles: CNNs are integral to the perception systems of self-driving cars, allowing them to interpret and respond to their surroundings.

The journey of CNNs from their inception to their current status as a cornerstone of deep learning illustrates their profound impact on AI. The success of CNNs has also paved the way for further advancements in deep learning and has inspired the development of other specialized neural network architectures, such as RNNs and Transformers. The theoretical foundations and practical innovations in CNNs have contributed significantly to the widespread adoption and success of deep learning techniques across various domains.

6. Recurrent Neural Networks (1986–2017)

Recurrent Neural Networks (RNNs) emerged as a powerful architecture for handling sequential and temporal data. Unlike feedforward neural networks, RNNs are designed to process sequences of inputs, making them particularly effective for tasks such as language modeling, time series forecasting, and speech recognition.

6.1 Early Developments (1980s-1990s)

The concept of RNNs dates back to the 1980s, with pioneers like John Hopfield, Michael I. Jordan, and Jeffrey L. Elman contributing to the development of these networks. The Hopfield network, introduced by John Hopfield in 1982, laid the groundwork for understanding recurrent connections in neural networks. Jordan networks and Elman networks, proposed in the 1980s and 1990s, respectively, were early attempts to capture temporal dependencies in sequential data.

6.2 LSTM, GRU and Seq2Seq Models (1997 — 2014)

Long Short-Term Memory (LSTM) Networks (1997): Sepp Hochreiter and Jürgen Schmidhuber introduced the Long ShortTerm Memory (LSTM) network, which addressed the vanishing gradient problem in traditional RNNs. LSTMs use a gating mechanism to control the flow of information, allowing them to capture long-term dependencies in sequential data.
Gated Recurrent Units (GRUs) (2014): Kyunghyun Cho et al. proposed Gated Recurrent Units (GRUs), a simplified version of LSTMs that also use a gating mechanism to control the flow of information. GRUs have fewer parameters than LSTMs and are often faster to train.

Sequence-to-Sequence Models (Seq2Seq) (2014): Ilya Sutskever and his team introduced the Seq2Seq model, which uses an encoder-decoder architecture to map input sequences to output sequences. This model has been widely used for tasks such as machine translation, speech recognition, and text summarization.

6.3 Key Features of RNNs Recurrent Connections:

RNNs use recurrent connections to maintain a hidden state that captures information from previous time steps. This allows the network to model temporal dependencies in sequential data.

Backpropagation Through Time (BPTT): RNNs are trained using a variant of backpropagation called Backpropagation Through Time (BPTT), which unfolds the recurrent network over time and applies the standard backpropagation algorithm to the unfolded network.
Gating Mechanisms: Advanced RNN architectures, such as LSTMs and GRUs, use gating mechanisms to control the flow of information, helping to mitigate the vanishing gradient problem and enable the network to capture long-term dependencies.

6.4 RNN Applications

RNNs have had a significant impact on various fields, including:

Natural Language Processing: RNNs have revolutionized the field of natural language processing, enabling significant advancements in tasks such as language modeling, machine translation, sentiment analysis, and text generation.
Speech Recognition: RNNs are widely used in speech recognition systems, where they model the temporal dependencies in spoken language to convert speech signals into text.
Time Series Forecasting: RNNs are effective for time series forecasting, where they model the temporal dependencies in sequential data to predict future values.

6.5 Challenges of RNNs

Despite their success, RNNs face several challenges:

Vanishing and Exploding Gradients: Traditional RNNs struggle with these issues, although LSTMs and GRUs provide some solutions.
Computational Complexity: Training RNNs can be resourceintensive, especially with large datasets.
Parallelization: The sequential nature of RNNs complicates parallel training and inference processes.

The success of RNNs has paved the way for further advancements in deep learning and has inspired the development of other specialized neural network architectures, such as Transformers, which have achieved state-of-the-art performance in various sequential data tasks. The theoretical foundations and practical innovations in RNNs have contributed significantly to the widespread adoption and success of deep learning techniques across various domains.

7. Transformers (2017-Present)

Transformers have transformed the landscape of deep learning with their superior ability to handle sequential data, becoming pivotal in many fields from natural language processing (NLP) to computer vision.

7.1 Introduction of Transformers (2017)

The Transformer model was introduced by Vaswani et al. (2017) in the seminal paper “Attention is All You Need.” This model abandoned traditional sequential processing of RNNs for a self-attention mechanism, allowing for parallel processing and better handling of long-range dependencies.

7.2 Key Features of Transformers

Self-Attention Mechanism: Allows each position in the sequence to attend to all positions, capturing context with greater flexibility than RNNs or LSTMs.
Parallelization: Enhances training speed by processing all input data simultaneously, a stark contrast to the sequential nature of RNNs.
Encoder-Decoder Structure: Both encoder and decoder stacks utilize layers of self-attention and feed-forward neural networks, with positional encodings to maintain sequence order.

Original Transformer Architecture with Encoder-Decoder structure and Multi-Head Attentions.

7.3 Transformer-based Language Models (2017 — Present)

BERT (2018): Bidirectional Encoder Representations from Transformers, an encoder-only transformer, revolutionized NLP with pre-training on masked language modeling and next sentence prediction.
T5 (2019): Text-to-Text Transfer Transformer, an encoder-decoder transformer, reframed NLP tasks into a text-to-text format, simplifying model architecture and training.

OpenAI’s GPT Series:

GPT (2018): Generative Pre-trained Transformer, an autoregressive decoder-only transformer, introduced by OpenAI, focused on predicting the next word in text sequences, demonstrating impressive language understanding and generation capabilities.
GPT-2 (2019): Significantly larger than its predecessor, it showed emergent capabilities like zero-shot task performance, raising discussions on AI’s potential misuse due to its ability to generate coherent, though sometimes misleading, text.
GPT-3 (2020): With 175 billion parameters, GPT-3 further expanded the scope of what’s possible with language models, excelling in tasks with minimal fine-tuning, known as few-shot learning. As a decoder-only transformer, GPT-3’s autoregressive architecture enables it to generate text one word at a time, conditioned on the previous words in the sequence.

The Autoregressive Language Model architecture of GPT is designed to predict the next token in a sequence based on the previous tokens inputted.

ChatGPT (2022): A fine-tuned version of a model in the GPT-3.5 series, optimized for conversational engagement, demonstrating the power of instruction tuning to align model responses with user intent.

Advanced Large Language Model (LLM) training pipelines involve a combination of pre-training, instruction tuning, and preference alignment using Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO).

7.4 Other Well-known Large Language Models (LLMs)

The landscape of large language models (LLMs) has been significantly enriched by various prominent models, each offering unique capabilities and advancements in artificial intelligence. Here’s an updated overview of some well-known LLMs:

Anthropic’s Claude (2022): Prioritizes safety and ethical considerations in AI outputs, aiming to align with human values.
Meta’s LLaMA (2023): Offers models of varying sizes for different computational needs, with impressive results on natural language processing benchmarks.
Mistral.AI’s Mistral (2023): Balances high performance and resource efficiency, ideal for real-time applications, with a focus on open-source AI solutions.
Alibaba’s Qwen (2023): Creates high-quality bilingual AI models for English and Chinese, facilitating cross-lingual applications and encouraging innovation.
Microsoft’s Phi (2023): Emphasizes versatility and integration across various applications, with advanced training techniques for contextual understanding and user interaction.
Google’s Gemma Series (2024): Lightweight, state-of-the-art open models for diverse applications, including text generation, summarization, and extraction, with a focus on performance and efficiency.

https://www.analyticsvidhya.com/blog/2023/07/build-your-own-large-language-models/

https://medium.com/towards-data-science/fine-tune-llama-3-1-ultra-efficiently-with-unsloth-7196c7165bab

8. Multimodal Models (2023-Present)

8.1 GPT-4V (2023) and GPT-4-o (2024)

GPT-4V (2023) marked a significant step in AI development by integrating multimodal capabilities into the already powerful text-based model. It can process and generate content not only from text but also from images, laying the groundwork for more comprehensive AI interactions.

GPT-4-o (2024), an evolution from GPT-4V, brings enhanced multimodal integration with sophisticated contextual understanding. It improves upon its predecessor by offering better coherence across different media, advanced image generation from text prompts, and refined reasoning based on visual inputs. Additionally, GPT-4-o includes advanced training mechanisms for ethical alignment, ensuring that its outputs are not only accurate but also responsible and aligned with human values.

Live demo of GPT-4o realtime translation.

8.2 Google’s Gemini (2023-present)

Gemini Pro (2023): Google’s Gemini introduces a family of models designed for multimodal tasks, integrating text, images, audio, and video processing. Gemini Pro, in particular, stands out for its scalability and efficiency, making advanced AI accessible for various applications from real-time analytics to complex content generation across different media formats.
Gemini’s Multimodal Capabilities: Gemini models, including Ultra and Nano versions for different scale applications, are engineered to perform tasks that require understanding across multiple data types. They excel in tasks like video summarization, multimodal translation, and interactive learning environments, demonstrating Google’s commitment to advancing AI’s role in multimedia contexts.

The capabilities of multimodal AI | Gemini Demo

8.3 Claude 3.0 and Claude 3.5 (2023-present)

Claude 3.0 (2023) introduced by Anthropic, this model focuses on enhancing the safety and reliability of AI responses, with improvements in contextual understanding and ethical considerations. It’s designed to be more conversational and helpful while maintaining strict adherence to avoiding harmful or biased outputs.
Claude 3.5 (2024) further refines the capabilities of Claude 3.0, offering better performance in complex tasks, increased efficiency in processing, and even more nuanced handling of user requests. This version also emphasizes multimodal interactions, although it primarily excels in textual and logical tasks, with emerging capabilities in handling visual or other sensory inputs for a more integrated user experience.

8.4 LLaVA (2023)

LLaVA (Large Language and Vision Assistant) represents an innovative approach to multimodal AI, combining language understanding with visual processing. Developed in 2023, LLaVA can interpret images and relate them to textual content, enabling it to answer questions about images, describe visual content, or even generate text based on visual cues. Its architecture leverages the strengths of transformer models to achieve state-of-the-art performance in tasks that require both visual and linguistic understanding. This model is particularly noted for its open-source nature, encouraging further research and development in multimodal AI applications.

These models collectively signify a shift towards AI systems that not only understand and generate text but also interpret and create content across various modalities, mirroring human cognitive abilities more closely. This evolution in AI models fosters applications that are more interactive, intuitive, and capable of handling real-world scenarios with a blend of different sensory inputs, thereby expanding the horizon of what AI can achieve in daily life, research, and industry applications.

9. Diffusion Models (2015-Present)

Diffusion models have risen as an influential category of generative models, providing a fresh methodology for creating high-fidelity samples from intricate data distributions. Their approach contrasts with traditional models like GANs and VAEs by employing a progressive denoising technique, which has excelled in numerous applications.

9.1 Introduction of Diffusion Models (2015)

The groundwork was laid by Sohl-Dickstein et al. (2015) with their paper introducing diffusion models. They conceptualized a generative process where reversing a gradual noise addition could transform noise back into structured data.

9.2 Key Features of Diffusion Models

Denoising Process: These models add noise in steps (forward process) and learn to reverse this (backward process), effectively denoising to produce samples.
Markov Chain: Both processes are structured as Markov chains, with each forward step adding Gaussian noise, which the model learns to remove in reverse.
Training Objective: The aim is to minimize the difference between predicted and actual noise at each step, optimizing a form of the evidence lower bound (ELBO).
Stability and Robustness: They offer better stability than GANs, avoiding issues like mode collapse, thus consistently generating diverse, high-quality outputs.

9.3 Advancements of Diffusion Models (2020-Present)

Denoising Diffusion Probabilistic Models (DDPM) (2020):
Refined the diffusion process, setting new benchmarks in image synthesis.
Denoising Diffusion Implicit Models (DDIM) (2021): Enhanced
efficiency with non-Markovian sampling, making the generative process more flexible.
Score-Based Generative Model through Stochastic Differential
Equations (2021): Utilized stochastic differential equations for efficient sample generation.
Latent Diffusion Model (2022): Became the foundation for popular text-to-image generation systems like Stable Diffusion, significantly advancing the field of AI-generated imagery and paving the way for more accessible and efficient generative AI tools.

Architecture of the Latent Diffusion Model

9.3 Text-to-Image Generation

Models like DALL-E 3 and Stable Diffusion 3 excel in generating high-quality images from textual descriptions, with DALL-E 3 providing detailed and accurate visuals and Stable Diffusion offering an open-source alternative that democratizes access to image generation technology.

https://generativeai.pub/dall-e-3-vs-midjourney-5-2-vs-stable-xl-same-prompt-different-resultsa68ae19b223e

FLUX.1 (2024): Black Forest Lab has unveiled FLUX.1, an advanced diffusion model for AI image generation, offering exceptional speed, quality, and prompt adherence. Available in three versions — Schnell, Dev, and Pro — FLUX.1 leverages innovative techniques such as Rectified Flow Transformers to produce highly photorealistic images. FLUX.1 can generate text and handle details like fingers and toes — everything needed for a good image generator.

An image generated by FLUX.1 Shenell model with a simple of “*a coffee cup that says FLUX.1 on the side*”. A high-quality image of a coffee cup with the text “FLUX.1” clearly visible, demonstrating FLUX.1’s ability to generate text. (source)

DreamBooth (2022): Enables training diffusion models on a few images of a specific subject, allowing for personalized image generation.
LoRA (2022): Stands for Low-Rank Adaptation, a technique that allows fine-tuning diffusion models with minimal additional parameters, making it easier to adapt models to specific tasks or datasets.

Qualitative comparison of single-concept generation. Reference images for each concept is shown in the left column. LoRA-based method outperforms Custom Diffusion in terms of fidelity. Furthermore, Orthogonal Adaptation and SBoRA exhibit comparable performance to Mix-of-show, while also introducing orthogonal constraints that confer advantages in multi-concept scenarios.

ControlNet (2023): Conditions diffusion models on additional input like sketches or depth maps, providing more control over the generated images.

Stable Diffusion with ControlNet with Pose control (Source)

https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FXHWdrlSAga4%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DXHWdrlSAga4&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FXHWdrlSAga4%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtubeAnimatediff w Multi ControlNet | StableDiffusion.

Multi-SBoRA (2024): Multi-SBoRA is a new method for customizing diffusion models for multiple concepts. It uses orthogonal standard basis vectors to construct low-rank matrices for fine-tuning, allowing for regional and nonoverlapping weight updates that reduce cross-concept interference. This approach preserves the pre-trained model’s knowledge, reduces computational overhead, and enhances model flexibility. Experimental results show that Multi-SBoRA achieves optimal performance in multi-concept customization while maintaining independence and mitigating crosstalk effects.

Qualitative comparison of multi-concept generation. The result is separated in three cases: (1) character generation, (2) object generation, and (3) combined character and object generation. Reference images for each concept is shown in the top row. Methods lacking an orthogonal design, such as Custom Diffusion and Mix-of-show, exhibit significant loss of concept identities, particularly in characters with intricate facial features. Orthogonal Adaptation demonstrates improved preservation of identities but may compromise the model’s overall knowledge, leading to collapse. In contrast, our proposed methods achieve superior results, effectively preserving identities for each concept while ensuring more stable generation.

The trajectory of diffusion model research indicates a promising future, with potential for integrated models that combine strengths from various AI architectures while optimizing for speed and quality.

9.4 Text-to-Video: OpenAI Sora (2024)

OpenAI Sora is a new text-to-video generation model that expands the capabilities of OpenAI’s multimodal AI offerings. This model allows users to create videos from textual descriptions, effectively bridging the gap between text and dynamic visual content. Sora’s integration into the multimodal framework enhances the potential for creative applications, enabling users to generate rich multimedia content with minimal input. This development signifies a significant step toward more intuitive and interactive AI systems that can understand and generate complex forms of media.

https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FtRSdt5kmeW0%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DtRSdt5kmeW0&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FtRSdt5kmeW0%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtubeOpenAI Sora in Action: Tokyo Walk

10. Conclusion

The history of AI and deep learning is marked by significant progress and transformative innovations. From early neural networks to sophisticated architectures like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Transformers, and Diffusion Models, the field has revolutionized various domains.

Recent advancements have led to the development of Large Language Models (LLMs) and Large Multimodal Models (LMMs) such as OpenAI’s GPT-4o, Google’s Gemini Pro, Antropic’s Claude 3.5 Sonnet, and Meta’s LLaMA3.1, which demonstrate impressive natural language and multimodality capabilities. Additionally, breakthroughs in generative AI, including Text-to-Image and Text-toVideo generation models like Midjourney, DALL-E 3, Stable Diffusion, FLUX.1, and Sora, have expanded the creative potential of AI.

Diffusion models have also emerged as powerful generative models with diverse applications. As research continues to focus on developing more efficient, interpretable, and capable models, the impact of AI and deep learning on society and technology will only grow. These advancements are driving innovation in traditional fields and creating new possibilities for creative expression, problem-solving, and human-AI collaboration.

Category: Machine Learning

A Brief History of Training Data