diff --git a/article-neural-networks/README.md b/article-neural-networks/README.md new file mode 100644 index 0000000..356d812 --- /dev/null +++ b/article-neural-networks/README.md @@ -0,0 +1,46 @@ +# ANN, CNN, RNN, and Transformers — A Simple Guide + +This is a small article I wrote to explain how deep learning models grew over time, in a simple way that anyone can understand. + +The idea is simple: every new model was created because the old one had a real problem. Once you see this chain of problems and fixes, picking the right model for your task becomes much easier. + +## What's inside + +The article walks through the main neural network families step by step: + +- **Activation functions** → why they matter, and why ReLU and GELU changed everything +- **CNNs** → how they made image learning possible +- **RNNs** → how they added memory for sequences (and why they forget too fast) +- **LSTM and GRU** → how gates fixed the memory problem +- **Transformers** → why "attention" changed the whole field +- **BERT** → a quick look at how big pretrained models work today + +You also get: + +- 2 comparison tables (activations + architectures) so you can pick a model fast +- 3 small code examples (PyTorch + HuggingFace) showing how each idea looks in real production code +- 7 images to help visualize the concepts + +## Who is this for + +- Engineers who want to understand *why* each model exists, not just memorize names +- Students who are just starting with deep learning +- Anyone who wants a clean, simple overview before diving into papers or courses + +You don't need a math background. The article uses simple words, short sentences, and real examples. + + +## Main takeaway + +Deep learning didn't appear all at once. Each model fixed a real weakness of the one before it: + +- ANNs were too heavy for images → CNNs +- CNNs ignored order → RNNs +- RNNs forgot too fast → LSTM and GRU +- LSTMs were too slow → Transformers + +If you keep this chain in mind, you'll always know which model fits your task — and why. + +You can also read my Medium article on this topic here: + +[Open Medium Article](https://medium.com/p/6ad62a95f98f?postPublishedType=initial) diff --git a/article-neural-networks/article.md b/article-neural-networks/article.md new file mode 100644 index 0000000..34bea71 --- /dev/null +++ b/article-neural-networks/article.md @@ -0,0 +1,460 @@ +# ANN, CNN, RNN, and Transformers: Evolution and Key Innovations in Deep Learning + +## Introduction: Evolving Neural Network Architectures + +Artificial Neural Networks (ANNs) have changed a lot over time because people wanted models that work well with different types of data and solve older problems. + +The first common type was the feed-forward neural network (also called a multilayer perceptron). It uses layers of connected neurons to learn patterns, but it had two big weaknesses: + +- It was bad with big inputs like images (too many values to process). +- It struggled to remember long patterns over time, like in long sentences or long sequences of numbers. + +Because of these limits, new types of neural networks appeared: + +- **CNNs (Convolutional Neural Networks)** — for images and visual data, because they focus on small parts of an image and understand shapes, edges, and objects more efficiently. +- **RNNs (Recurrent Neural Networks)** — for sequences like text, speech, or time series, because they can use information from previous steps. +- **Transformers** — newer and faster, because they handle sequences without reading them step-by-step. + +At the same time, big progress in training methods and powerful hardware like GPUs and TPUs made it possible to train much larger and smarter models. + +In the next sections, we look at each architecture in a simple way, why activation functions matter, how LSTM and GRU improved RNNs, and why Transformers became a turning point with models like BERT. + +![Introduction](images/all.png) + +--- + +## Activation Functions: Fueling Deep Networks (and Vanishing Gradients) + +Activation functions help a neural network learn more than simple straight-line patterns. Without them, even a deep network would act like a basic model and wouldn't understand complex data. + +But the activation you choose can strongly affect how well and how fast training works. + +### Why old activations caused problems (Sigmoid and Tanh) + +In early neural networks, people often used sigmoid or tanh. The problem is that these functions can "flatten out" when the input is very large or very small. When that happens, the network receives almost no learning signal during training. + +So in deep networks, the learning signal becomes smaller and smaller as it moves backward through the layers. This leads to the famous vanishing gradient problem: + +- the first layers learn very slowly +- training becomes hard +- deep networks don't improve much + +### ReLU changed everything + +A big improvement came with ReLU (Rectified Linear Unit). ReLU is extremely simple: + +- if the input is negative → output is 0 +- if the input is positive → output is the same number + +When the input is positive, the learning signal stays strong, so the network can train much deeper without getting stuck. ReLU also makes many values become exactly 0, so the model ignores useless signals and calculations are faster. + +You can think of a ReLU neuron like a switch: + +- if something important is detected → it turns ON +- if not → it stays OFF (outputs 0) + +This works especially well in CNNs, where neurons detect things like edges, corners, shapes, and parts of objects. + +### ReLU has one weakness: dying ReLU + +Sometimes a neuron gets stuck outputting only 0 — it never turns ON and stops learning. This is called dying ReLU. Once a neuron dies, it may never recover, because its gradient becomes 0. + +### Fixing dying ReLU: Leaky ReLU and others + +- **Leaky ReLU** — instead of outputting 0 for negative inputs, it outputs a very small negative value, so the neuron still learns a little. +- **PReLU** — same idea, but the small negative slope is learned automatically. +- **ELU / SELU** — help the model train faster and more stable. + +### GELU: the activation used in BERT + +In modern Transformer models like BERT, a popular activation is GELU (Gaussian Error Linear Unit). You can think of GELU as a smoother and smarter ReLU. Instead of a hard ON/OFF switch, GELU behaves like: + +- strong positive values pass through +- values near zero pass partly +- negative values are reduced but not fully blocked + +So it gives the network a more gentle decision-making style, and it keeps some learning signal even for slightly negative inputs. This helps Transformers train more smoothly and often improves results. + +### Final idea + +- Sigmoid and tanh often caused weak learning in deep networks +- ReLU made deep training much easier and faster +- ReLU variants fixed common weaknesses +- GELU became popular in modern models like BERT because it is smoother and more stable + +![Activation functions](images/activation_functions.png) + +### Quick comparison: Activation functions + +*A side-by-side look at the activations above, so you can pick one fast.* + +| Activation | Output range | Main strength | Main weakness | Where to use | +|---|---|---|---|---| +| Sigmoid | 0 to 1 | Smooth, easy to read as probability | Vanishing gradient in deep nets | Output layer for binary classification | +| Tanh | -1 to 1 | Zero-centered, stronger than sigmoid | Still saturates → vanishing gradient | Old RNNs, some hidden layers | +| ReLU | 0 to ∞ | Fast, no saturation on positive side, sparse outputs | Dying ReLU (neurons stuck at 0) | Default for CNNs and most hidden layers | +| Leaky ReLU | small negative to ∞ | Fixes dying ReLU with a small slope | Extra hyperparameter to tune | When ReLU neurons keep dying | +| PReLU | learned negative to ∞ | Slope is learned by the network | Slightly more parameters | When you want the model to learn the slope | +| ELU / SELU | ≈ -1 to ∞ | Smooth negatives, faster and more stable | More compute than ReLU | Deep networks needing stable training | +| GELU | smooth around 0 | Smoother than ReLU, keeps small signal for negatives | Heavier to compute than ReLU | Modern Transformers (BERT, GPT) | + +--- + +## Convolutional Neural Networks (CNNs): Local Patterns and Visual Understanding + +As neural networks started working with images and other grid-like data, CNNs appeared as a much better solution. They made training faster, cheaper, and far more accurate. + +### Why normal ANNs are not good for images + +A normal fully-connected network treats an image like a huge list of numbers. Every neuron connects to every pixel, so it needs an enormous number of weights. + +Example: a 1000×1000 RGB image has about 3 million pixel values. If just one neuron connects to all of them, it needs 3 million weights — impossible when we want many neurons and many layers. + +Even worse, fully-connected layers don't understand the structure of an image. Nearby pixels are related, edges are formed locally, and position matters a lot. A fully-connected model is heavy and also kind of blind to how images actually work. + +### What CNNs do differently + +CNNs solve the problem with 2 simple but powerful tricks. + +### 1) Look at small areas, not the whole image + +Instead of connecting a neuron to the entire picture, CNNs look at small patches like 3×3 or 5×5. This is smart because important patterns (like edges) are usually local. So a CNN learns small edges, corners, and curves. + +### 2) Reuse the same pattern detector everywhere + +CNNs reuse the same set of weights across the image. This is called weight sharing. In simple words: one filter learns a pattern once (example: vertical edge) and then checks for it everywhere in the image. + +A 5×5 filter only needs 25 weights, and those 25 weights are reused across the whole image. That's why CNNs are so efficient. + +### Why CNNs are good when objects move + +In real life, the same object can appear on the left, right, top, or bottom. CNNs handle this naturally because filters slide across the image. So if a CNN learns "this looks like a wheel", it will detect it anywhere in the picture. + +Many CNNs also use pooling, which slightly reduces image size and makes the network more stable to small shifts. + +### How CNNs learn from simple to complex + +- early layers learn simple things (edges, lines) +- middle layers learn shapes and textures +- deep layers learn object parts (eyes, wheels, doors) +- final layers learn full objects (car, face, cat) + +### Why CNNs became a breakthrough + +CNNs made deep learning on images realistic because they: + +- use far fewer weights than dense networks +- understand local image patterns +- handle object movement better +- learn features automatically (you don't need to hand-code them) + +When GPUs became widely used, CNNs became even stronger. One famous moment was AlexNet (2012), which dominated a major image competition and made the whole world take deep learning seriously. + +### Real-world applications of CNNs + +- Image classification (what is in the photo?) +- Object detection (where is the object?) +- Segmentation (outline objects pixel by pixel) +- Face recognition +- Medical scans (X-ray, MRI) +- Self-driving cars (lanes, people, traffic signs) +- Security, satellite analysis, robotics + +In healthcare, CNNs can sometimes match expert performance in specific tasks, like spotting signs of disease in medical images. + +### CNN in practice + +A small example using a pretrained ResNet from `torchvision`. This is the same pattern used in production for image classification (e.g., product tagging, content moderation): load a pretrained model, preprocess the image, run inference, return the top class. + +```python +import torch +from torchvision import models, transforms +from PIL import Image + +model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT).eval() +preprocess = models.ResNet50_Weights.DEFAULT.transforms() + +img = preprocess(Image.open("cat.jpg").convert("RGB")).unsqueeze(0) +with torch.no_grad(): + probs = model(img).softmax(dim=1) + +top = probs.argmax(dim=1).item() +``` + +![CNN architecture](images/CNN.png) + +--- + +## Recurrent Neural Networks (RNNs): Remembering Sequences + +CNNs are great for images, but they don't solve problems where order matters. For text or audio, the sequence is the whole point. That's why RNNs were created. + +### Why we need RNNs + +In many real tasks, the data comes in a sequence: + +- **Language**: the meaning of a word depends on the words before it +- **Speech**: sounds make sense only when heard in order +- **Time series** (weather, stock prices, sensor data): the next value depends on past values + +### What makes an RNN different + +A normal neural network looks at data once and moves forward. An RNN is different because it has a kind of memory — a hidden state, updated step by step. + +At every moment t, the RNN uses: + +- the current input xₜ (example: the current word) +- what it remembers from the past hₜ₋₁ + +And produces a new memory hₜ, which is basically: *"What I know so far from the sequence."* So RNNs are like reading a sentence word-by-word while remembering what you read before. + +### Simple example + +Sentence: *"I grew up in Moldova, and now I live in Finland."* + +If the model reads the word "Finland", it should still remember who is speaking ("I") and what country was mentioned earlier ("Moldova"). That's exactly what the hidden state helps with. + +### The big problem: vanilla RNNs forget too fast + +In theory, RNNs can remember long information. In real life, basic RNNs work well only for short sequences. When the sequence gets long, they start to forget old information, so they often focus only on what happened recently. + +Why do they forget? To train an RNN, we look back through the whole sequence. The learning signal becomes smaller and smaller as it goes back in time, so the network can't learn long connections. This is the same vanishing gradient problem, just happening across time steps instead of layers. + +![RNN unrolled](images/RNN.png) + +--- + +## LSTM and GRU: Tackling Long-Term Dependencies in RNNs + +LSTM (Long Short-Term Memory) was created to fix the biggest weakness of basic RNNs: they forget important information too quickly. Instead of using only one memory value, an LSTM adds a real memory system, controlled by smart gates. It was introduced in 1997 by Hochreiter and Schmidhuber. + +### The main idea of LSTM + +An LSTM carries two things forward: + +- **Cell state (Cₜ)** → long-term memory. Like a notebook that keeps important information for a long time. +- **Hidden state (hₜ)** → short-term memory / current output. What the model is focusing on right now. + +The big win: the cell state can keep information almost unchanged for many steps, so the model can remember long context. + +### LSTM gates (basically decisions) + +Inside every LSTM step, there are 3 gates. Each gate decides what to do with information. + +### 1) Forget Gate (what to erase) + +This gate decides: do we still need this old information, or is it useless now? It outputs values between 0 and 1: 0 = delete, 1 = keep fully. So the model learns when to forget. Example: if the topic of a paragraph changes, the model should forget old details. + +### 2) Input Gate (what new info to save) + +This gate decides what new information from the current step is important enough to store. Example: when reading "Paris" in a sentence, the network may decide *"this is important — save it."* + +### 3) Output Gate (what to show now) + +This gate decides what part of the memory should affect the current output. So the model doesn't output everything it knows — it outputs only what is useful right now. + +### Why LSTM remembers better than RNN + +- It can keep memory stable +- It updates only when needed +- Learning signals don't disappear so fast during training + +This helped LSTMs learn patterns over dozens — sometimes hundreds — of steps, something basic RNNs could not do. + +### Why LSTMs became popular + +From the 2000s to early 2010s, LSTMs were the best solution for many sequence problems: speech recognition, language translation, text prediction, sentiment analysis, music generation. They worked well because they could keep meaning across longer text. + +### The downside: LSTMs are heavy + +LSTMs are powerful, but also slower, bigger, more complex, and use more parameters than a simple RNN. Training takes more time and memory. + +### GRU: a simpler and faster version (2014) + +To make things easier, researchers introduced GRU (Gated Recurrent Unit). A GRU keeps the same goal — remember long context, forget useless info — but with fewer parts: + +- **Update Gate** → decides how much old info to keep vs replace +- **Reset Gate** → decides how much past info to ignore when making new memory + +Advantages: fewer gates, fewer parameters, faster training. In many tasks, GRUs work almost as well as LSTMs — sometimes even better, especially when the dataset is smaller. + +### The big limitation of both LSTM and GRU + +Even though LSTM and GRU are strong, they share one major weakness: they read sequences step by step. + +- step 2 waits for step 1 +- step 3 waits for step 2 + +So you can't fully speed it up using parallel computing. This becomes a big problem when sequences are very long, models are very deep, or data is huge. Even LSTMs still struggle to keep everything important across extremely long sequences. + +Because of these limits, researchers wanted a new solution. That next big step was Transformers. + +### LSTM in practice + +A minimal PyTorch LSTM classifier. Same skeleton used in production for things like log-line anomaly detection, short-text intent classification, and time-series forecasting on sensor data: embed the input, run an LSTM, take the last hidden state, project to class scores. + +```python +import torch +import torch.nn as nn + +class LSTMClassifier(nn.Module): + def __init__(self, vocab_size, embed_dim=128, hidden_dim=256, num_classes=2): + super().__init__() + self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=0) + self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True) + self.fc = nn.Linear(hidden_dim, num_classes) + + def forward(self, x): + e = self.embed(x) + _, (h, _) = self.lstm(e) + return self.fc(h.squeeze(0)) + +model = LSTMClassifier(vocab_size=20000) +logits = model(torch.randint(1, 20000, (4, 50))) +``` + +![LSTM cell](images/LSTM_GRU.png) + +--- + +## Transformers: "Attention Is All You Need" + +In 2017, everything changed in sequence learning because of a paper with a bold title: **"Attention Is All You Need."** This paper introduced Transformers, which quickly became the new standard for working with text and other sequences. + +### Why Transformers were a big step forward + +Before Transformers, models like RNNs, LSTMs, and GRUs processed data step by step. If you have 1000 words, the model must go through 1000 steps in order. That's slow. Transformers removed this problem. + +### The main idea: Self-Attention + +Self-attention means each word can look directly at any other word in the sentence. Instead of reading the sentence one word at a time, the Transformer sees everything at once and decides: + +- which words matter most for this word? +- what should I focus on? + +Example sentence: *"The car that was parked near the house is red."* When the model reads "is red", it needs to know what it refers to → the car. Self-attention helps the model connect those words directly, even if they are far apart. + +### Why this solves the long-distance problem + +In RNN-based models, long-distance connections are hard because the information must pass through many steps. In Transformers, word 1 can connect to word 50 instantly, with no memory loss from passing through many time steps. So long sentences become much easier to understand. + +### Transformers are faster because they work in parallel + +With an RNN, you must compute step 1 → step 2 → step 3 — it cannot be fully parallel. With a Transformer, all words are processed in one shot, and modern GPUs can compute attention using large matrix operations efficiently. So training becomes dramatically faster, especially on long texts. + +### What a Transformer is made of (simple view) + +- **Encoder** — reads the input sequence and builds a strong representation +- **Decoder (optional)** — generates the output, used in tasks like translation + +Inside each layer you basically have self-attention, a small feed-forward network, and skip connections (which help stability). + +### Multi-head attention (why it's useful) + +Transformers don't use only one attention view. They use multiple attention heads: + +- one head might focus on grammar structure +- another might connect related words far away +- another might focus on meaning and context + +So the model learns multiple views of the same sentence. + +### The biggest problems Transformers solved + +- They don't forget long-range relationships like RNNs often do +- They train much faster, because they don't work step-by-step + +### Why Transformers became dominant + +Once Transformers showed strong results, researchers scaled them up. That led to BERT (great at understanding text), GPT models (great at generating text), Vision Transformers for images, audio models, and mixed text + image + audio models. + +### One downside (still important) + +Self-attention checks relationships between every pair of words, which can become expensive. For very long texts (thousands of tokens), it can require a lot of computation. That's why researchers keep improving Transformer efficiency. + +![Transformer](images/tranformers_1.png) + +--- + +## BERT: Bidirectional Transformers and Language Understanding + +BERT is a popular Transformer model made by Google in 2018. It became famous because it helped computers understand text much better, not just generate it. + +### What makes BERT special? + +BERT is **bidirectional**, meaning it reads a sentence using both the words before a word and the words after a word. So it understands context from both sides at the same time. + +Example: the word "bank" can mean a river bank or a financial bank. BERT figures out the correct meaning by looking at the full sentence. + +### How BERT is trained (in a simple way) + +BERT learns by doing two main tasks: + +- **Masked words** — some words are hidden, and BERT must guess them. Example: *"The ___ sailed across the ocean"* → BERT predicts "ship". +- **Next sentence check** — BERT gets two sentences and must decide if the second one is a real continuation of the first. + +This training helps BERT learn grammar, meaning, and connections between sentences. + +### Why BERT mattered + +Before BERT, many models read text only left-to-right. BERT showed that reading in both directions gives much stronger understanding. Also, developers don't need to train models from zero anymore: train BERT once on huge text, then fine-tune it for tasks like question answering, sentiment analysis, or named entity recognition. + +### BERT in practice + +The HuggingFace `pipeline` API is what most teams reach for first in production for NLP tasks like sentiment, classification, NER, or QA. Under the hood it loads a pretrained Transformer (here, a fine-tuned BERT), tokenizes the input, runs inference, and returns the label and confidence score. + +```python +from transformers import pipeline + +classifier = pipeline("sentiment-analysis", + model="distilbert-base-uncased-finetuned-sst-2-english") +``` + +![BERT](images/transformers_2.png) + +--- + + +## Quick comparison: ANN vs CNN vs RNN vs LSTM vs GRU vs Transformer + +*Each model was made to fix a real weakness of the one before it. This table is a fast cheat sheet.* + +| Model | Best for | Key idea | Main strength | Main weakness | +|---|---|---|---|---| +| ANN (MLP) | Tabular / small structured data | Fully-connected layers of neurons | Simple, easy to build, works on basic tasks | Too many weights for big inputs, no sense of order | +| CNN | Images, video, grid-like data | Small filters + weight sharing + pooling | Few parameters, learns local patterns, position-tolerant | Not made for sequences or order | +| RNN | Short sequences (text, speech, time series) | Hidden state passed step by step | Understands order, remembers recent context | Forgets long context (vanishing gradient over time) | +| LSTM | Long sequences with long-range meaning | Cell state + 3 gates (forget, input, output) | Remembers important info for many steps | Slow, heavy, still sequential | +| GRU | Sequences when you want LSTM-quality but lighter | 2 gates (update, reset) | Faster and simpler than LSTM, similar quality | Still sequential, can't fully use GPU parallelism | +| Transformer | Long text, translation, large datasets, multimodal | Self-attention over the whole sequence at once | Long-range context, parallel training, scales massively | O(n²) cost on very long sequences | + +--- + +## Conclusion + +The journey from simple ANNs to CNNs, RNNs, LSTMs/GRUs, and finally Transformers represents a continual quest to build neural networks that learn more efficiently and capture richer structure in data. + +Each new model type solved a real limitation: + +- **CNNs** made image learning possible by focusing on small areas and reusing the same filters across the image. +- **LSTMs and GRUs** improved sequence learning by adding memory and smart gates, so the model doesn't forget important information too quickly. +- **Transformers** changed everything by using attention, letting the model connect any part of a sequence with any other, and training much faster because it doesn't process data step by step. + +Other improvements helped all these models work better, like better activation functions (ReLU and later GELU), stronger training methods, and faster hardware like GPUs and TPUs. + +### Why this matters for developers + +For developers, this isn't just history — it's very useful in practice. When you understand why these models were created, you can: + +- choose the right model for your task +- train models more efficiently +- debug problems faster +- build better systems and pipelines + +Today, Transformers are the most dominant models in many areas, especially with big pre-trained models like BERT and GPT. But the field is still moving forward. Researchers are now working on: + +- making Transformers cheaper and faster (especially for long documents) +- building models that have better memory +- creating hybrid models that combine the best ideas from CNNs, RNNs, and Transformers + +Deep learning keeps evolving because every "best model" has limits — and the next breakthrough usually comes from solving those limits. diff --git a/article-neural-networks/images/CNN.png b/article-neural-networks/images/CNN.png new file mode 100644 index 0000000..afcf1ee Binary files /dev/null and b/article-neural-networks/images/CNN.png differ diff --git a/article-neural-networks/images/LSTM_GRU.png b/article-neural-networks/images/LSTM_GRU.png new file mode 100644 index 0000000..a405fa8 Binary files /dev/null and b/article-neural-networks/images/LSTM_GRU.png differ diff --git a/article-neural-networks/images/RNN.png b/article-neural-networks/images/RNN.png new file mode 100644 index 0000000..15184ec Binary files /dev/null and b/article-neural-networks/images/RNN.png differ diff --git a/article-neural-networks/images/activation_functions.png b/article-neural-networks/images/activation_functions.png new file mode 100644 index 0000000..a2ffbb6 Binary files /dev/null and b/article-neural-networks/images/activation_functions.png differ diff --git a/article-neural-networks/images/all.png b/article-neural-networks/images/all.png new file mode 100644 index 0000000..7d7ace2 Binary files /dev/null and b/article-neural-networks/images/all.png differ diff --git a/article-neural-networks/images/tranformers_1.png b/article-neural-networks/images/tranformers_1.png new file mode 100644 index 0000000..e2b7650 Binary files /dev/null and b/article-neural-networks/images/tranformers_1.png differ diff --git a/article-neural-networks/images/transformers_2.png b/article-neural-networks/images/transformers_2.png new file mode 100644 index 0000000..af20fc6 Binary files /dev/null and b/article-neural-networks/images/transformers_2.png differ