Neural Networks & LLMs: Interactive Learning Journey

Individual Neuron Processing

Explore how a single neuron processes inputs and weights

What's happening here?

A neuron is like a tiny calculator that takes inputs (data), multiplies each by a weight (importance), adds a bias (adjustment), and produces an output. Think of it like deciding whether to go outside:

Inputs: Weather temperature, rain probability
Weights: How much you care about each factor
Bias: Your general preference for staying indoors
Output: Final decision (0-1 scale, where 1 means "definitely go out")

The sigmoid function converts the sum into a smooth number between 0 and 1, making the decision gradual rather than harsh.

Input 1 1.0

Input 2 0.5

Weight 1 0.8

Weight 2 0.3

Bias 0.0

Output:

0.69

Forward Propagation

Watch data flow through network layers

What's happening here?

Forward propagation is like an assembly line in a factory. Data enters at one end and gets processed through multiple layers:

Input Layer: Raw data enters (like ingredients)
Hidden Layers: Each layer processes and transforms the data (like cooking steps)
Output Layer: Final result emerges (like the finished meal)

Each neuron in a layer receives outputs from the previous layer, processes them, and passes results forward. The animation speed lets you slow down or speed up this process to see how information flows through the network.

Animation Speed

Backpropagation

See how errors propagate backward to adjust weights

What's happening here?

Backpropagation is how neural networks learn from their mistakes. Think of it like getting feedback on a presentation:

Target Output: What the network should have predicted (the correct answer)
Error: Difference between what it predicted and what was correct
Learning Rate: How much to adjust based on the error (like how much you trust the feedback)
Weight Updates: The network adjusts its internal parameters to do better next time

The error flows backward through the network, telling each neuron how much it contributed to the mistake. Smaller learning rates mean more careful, gradual learning.

Target Output 0.8

Learning Rate 0.1

Error:

0.00

Parameter Tuning in Modern LLMs

Visualize how hundreds of billions of parameters get adjusted through gradient descent and mixture of experts

What's happening here?

Modern language models like ChatGPT have billions of parameters (weights and biases). Think of this like tuning a massive orchestra:

Total Parameters: Every "knob" in the model that can be adjusted (like 671 billion musicians)
Active Parameters (MoE): Only some musicians play at once (37 billion in this example)
Gradient Descent: The process of finding the best settings (like a conductor fine-tuning each section)
Loss Landscape: Visual representation of how "good" different parameter settings are

Mixture of Experts (MoE) is like having specialist musicians - only the relevant experts are activated for each piece of text, making the model more efficient.

Total Parameters

671B

Active Parameters (MoE)

37B

Updated This Step

2.1M

Loss

2.34

Gradient Descent Visualization

How Training Data Shapes Parameters

Observe parameter evolution across training epochs

What's happening here?

Training a neural network is like learning to recognize patterns by seeing many examples. Think of learning to identify cats:

Epochs: How many times the model has seen the entire dataset (like study sessions)
Dataset Size: How many examples the model learns from (1K = 1,000 images, 100K = 100,000 images)
Progress Bar: Shows how far through training the model is
Loss Curve: Shows how the model's accuracy improves over time

More data usually means better performance, but also longer training time. The model gradually adjusts its parameters to better recognize patterns in the training data.

Epoch: 0 / 100

Dataset Size

From Neural Networks to Transformers

Explore the evolution to transformer architecture

What's happening here?

Transformers are the architecture behind ChatGPT, Claude, and most modern AI. Think of the evolution like transportation methods:

Simple NN: Like walking - processes one piece at a time
RNN: Like a bicycle - can handle sequences but slowly, one step at a time
Attention: Like having GPS - can "look" at all parts of the input simultaneously
Transformer: Like a highway system - parallel processing with multiple attention "lanes"

Multi-head attention is like having multiple experts each focusing on different aspects of the text (grammar, meaning, context, etc.) all at the same time.

The Era of Thinking Models

How modern AI systems reason through problems step-by-step

What's happening here?

Thinking models represent the latest breakthrough in AI - they can "think" before responding, like a human pausing to consider a complex question:

Chain-of-Thought: The AI breaks down complex problems into smaller steps
Extended Reasoning: Models can "think" for thousands of tokens before answering
Tool Integration: Can use external tools (search, calculators) while reasoning
Self-Reflection: Models evaluate their own reasoning and correct mistakes

The performance metrics (like SWE-bench scores) show how well these models solve real-world coding and reasoning tasks. Higher scores mean better problem-solving ability.

2025's Leading Thinking Models

Claude Opus 4.1

64K token thinking capacity
Tool use during reasoning
7+ hour autonomous operation
SWE-bench: 72.5%

GPT-5 (OpenAI)

Unified reasoning + fast response
Real-time routing system
4.8% hallucination rate
SWE-bench: 74.9%

o3 (OpenAI)

Chain-of-thought reasoning
Codeforces Elo: 2727
Reinforcement learning refined
71.7% on complex tasks

Gemini 2.5 Pro

Deep Think mode
Massive context windows
Advanced reasoning chains
Multi-modal thinking

DeepSeek-R1 (New)

Dense reasoning architecture
Distilled from DeepSeek-V3 MoE
Specialized for complex reasoning
Cost-effective deployment

How Thinking Models Work

Problem Analysis: Break down complex queries into components
Reasoning Chains: Generate multiple solution pathways
Self-Reflection: Evaluate and refine reasoning steps
Tool Integration: Use external tools during thinking
Synthesis: Combine insights into coherent response

Interactive Parameter Playground

Experiment with parameters and see real-time effects

What's happening here?

This is your sandbox for building neural networks! Like using LEGO blocks to build different structures:

Hidden Layers: How many processing stages your network has (more layers = more complex patterns)
Neurons per Layer: How many "workers" in each stage (more neurons = more capacity to learn)
Learning Rate: How quickly the network adapts (too fast = unstable, too slow = learns slowly)
Batch Size: How many examples to look at before updating (like studying multiple questions before checking answers)

Watch the Loss decrease and Accuracy increase as your network learns! Try different settings to see what works best.

Network Architecture

Hidden Layers 2

Neurons per Layer 4

Training Parameters

Learning Rate 0.1

Batch Size 16

Live Training

Loss: 0.00

Accuracy: 0%

Modern Model Architectures & Scaling

From dense models to mixture of experts: efficiency meets scale

What's happening here?

The 2024-2025 breakthrough in AI isn't just about scale - it's about intelligent efficiency. Compare DeepSeek-V3 achieving GPT-4 performance at 10x less training cost:

Dense Architecture (Llama 3.1)

All parameters active: 405B working constantly
128K context window: Up from 8K in Llama 3
GQA attention: Grouped Query Attention
Edge deployment: More predictable performance
Training cost: $50-100M for 405B model

MoE Architecture (DeepSeek-V3)

Smart activation: 37B of 671B parameters
256+1 experts: Plus 3 all-expert layers
MLA attention: Multi-head Latent Attention
FP8 training: First successful at this scale
Revolutionary cost: Only $5.6M training cost

Key Insight: MoE isn't just bigger - it's fundamentally different. Like having specialist doctors instead of one generalist, each expert focuses on what they do best, achieving superior results with dramatically lower costs.

Llama 3.1 8B

8B parameters

Llama 3.1 70B

70B parameters

Llama 3.1 405B

405B parameters

DeepSeek-V3 (MoE)

671B (37B active)

Reasoning Ability

Language Understanding

Code Generation

Understanding Neural Networks & Modern AI Systems

Individual Neuron Processing

What's happening here?

Forward Propagation

What's happening here?

Backpropagation

What's happening here?

Parameter Tuning in Modern LLMs

What's happening here?

How Training Data Shapes Parameters

What's happening here?

From Neural Networks to Transformers

What's happening here?

The Era of Thinking Models

What's happening here?

2025's Leading Thinking Models

Claude Opus 4.1

GPT-5 (OpenAI)

o3 (OpenAI)

Gemini 2.5 Pro

DeepSeek-R1 (New)

How Thinking Models Work

Interactive Parameter Playground

What's happening here?

Network Architecture

Training Parameters

Live Training

Modern Model Architectures & Scaling

What's happening here?

Dense Architecture (Llama 3.1)

MoE Architecture (DeepSeek-V3)