
From Raw Genius to Useful Intelligence
Technology
How AI Really Learns
Today’s AI language models appear to be technological marvels – they write essays, program software, solve mathematical problems, and engage in natural conversations. But behind this seemingly effortless interface lies a complex, two-stage development process. This article decodes the fundamental dichotomy of AI training: pre-training and post-training – two fundamentally different phases that together enable the modern AI revolution.
The Fundamental Architecture: The Transformer
Before diving into training methods, it’s important to understand the underlying architecture. Modern language models are based on the Transformer – a neural network architecture introduced by Google in 2017. Simply put, a Transformer consists of repeated blocks of two main components:
- Attention mechanisms: Allow the model to establish relationships between words across arbitrary distances
- Feed-forward networks: Dense, fully connected neural networks that process information
This architecture has proven to be exceptionally scalable – the larger the model, the more powerful it becomes, following certain mathematical “scaling laws” – one of the reasons why models keep growing in size.
Phase 1: Pre-Training – The Foundation of Intelligence
What is Pre-Training?
At its core, pre-training is remarkably simple: the model learns to predict the next word (or “token”) in a sequence. This conceptual simplicity contrasts with the enormous resource intensity of the process.
Pre-training can be defined as:
- Autoregressive prediction: The model predicts the next element in a sequence
- Self-supervised learning: The model generates its own training signals from the data
- Massive data processing: Training on trillions of tokens from the internet
The Dimensions of Pre-Training
Pre-training occurs along several dimensions:
- Data volume: Modern models are trained on trillions of tokens, equivalent to petabytes of text data
- Model size: The number of parameters (adjustable weights) in the model, which can range from millions to trillions
- Computational power: Measured in Floating-Point Operations (FLOPs), often on the order of 10²⁵ for leading models
The Technical Challenges
Pre-training presents enormous technical challenges:
- Parallelization: Training must be distributed across thousands of GPUs
- Stability: Training must remain stable, without “loss spikes” – sudden deteriorations in model performance
- Optimization: Hyper-parameters must be carefully tuned, often in a “YOLO run” (You Only Live Once) – a single large training run that can last months
Architectural Innovations in Pre-Training
Recent advances in pre-training focus on efficiency. Two significant innovations stand out:
-
Mixture of Experts (MoE):
- Instead of activating all parameters for each token, only certain “experts” are activated
- Example: In a 600-billion-parameter model, only 37 billion parameters might be activated per token
- The sparsity factor (how many experts are activated) is crucial – DeepSeek uses 8 out of 256 experts
- This dramatically reduces computational requirements but requires complex routing mechanisms
-
Multi-Head Latent Attention (MLA):
- An innovation by DeepSeek that reduces memory requirements for the attention mechanism
- Uses mathematical approximations to reduce memory requirements by 80-90%
- Enables longer context windows and more efficient training
The Result: The Base Model
What emerges from pre-training is what’s known as a “base model” – a model that:
- Has extensive linguistic and world knowledge
- Can complete texts
- Can retrieve information
- But is not specifically aligned to be helpful, harmless, or honest
This base model can be viewed as a form of undirected intelligence – potentially powerful, but without specific purpose. Without further refinement, it’s not directly suitable for most practical applications.
Phase 2: Post-Training – The Refinement of Intelligence
The Evolution of Post-Training
Post-training has evolved dramatically. It began with simple Supervised Fine-Tuning and has evolved into complex Reinforcement Learning methods. This area is developing faster than pre-training and is often the site of the most exciting innovations.
The Three Main Approaches in Post-Training
1. Instruction Tuning / Supervised Fine-Tuning (SFT)
The most basic post-training technique is teaching the model to follow instructions and respond in a helpful format.
- Methodology: The model is shown example queries and ideal responses
- Data source: Often human-created question-answer pairs
- Goal: The model learns to understand requests and generate useful answers
- Result: An “instruction-tuned” model that can perform basic assistant tasks
2. Preference Fine-Tuning and RLHF
The next step is teaching the model to understand and fulfill human preferences – a technique that enabled the breakthrough for ChatGPT.
-
Methodology:
- Collecting human preferences between alternative model answers
- Training a “reward model” that models these preferences
- Training the main model with reinforcement learning to produce outputs preferred by the reward model
-
Innovation: Anthropic’s “Constitutional AI” extends this approach by using AI to evaluate other AI outputs
-
Criticism: Excessive RLHF can lead to “overly cautious” models that withhold useful information
3. Reinforcement Learning with Verifiable Rewards (RLVR)
The newest and most exciting development in post-training focuses on learning complex reasoning abilities through objectively verifiable tasks.
-
Methodology:
- The model generates multiple solution attempts for problems with verifiable answers
- The attempts are checked for correctness (e.g., by executing code or comparing with mathematical solutions)
- The model is rewarded for using successful solution strategies
-
Application areas: Particularly effective for:
- Mathematical problems
- Programming
- Logical reasoning
- Multi-step problem solving
-
Result: “Reasoning” models that can actively think through problems
The Emergence of Thinking: Chain-of-Thought Reasoning
One of the most fascinating developments in post-training is the emergence of Chain-of-Thought Reasoning – the ability of a model to think about a problem, check assumptions, step back, and explore different approaches.
Remarkably, this ability appears to be emergent – it isn’t explicitly taught to the model. Instead, it arises as a byproduct of reinforcement learning with verifiable rewards. The model discovers on its own that:
- Breaking down problems into steps helps
- Checking intermediate steps improves accuracy
- Different solution paths should be tried
This emergent ability is comparable to AlphaGo’s famous “Move 37” – a moment when AI systems develop strategies not directly learned from humans.
Resource Requirements: Pre-Training vs. Post-Training
A critical difference between pre-training and post-training lies in their resource requirements:
Pre-Training:
- Computational effort: Extremely high, requiring thousands of GPUs over weeks or months
- Data volume: Trillions of tokens from the entire internet
- Costs: Hundreds of millions of dollars for leading models
- Frequency: Relatively rare, large models are trained only every few months
Post-Training:
- Computational effort: Significantly lower, can be performed on a few hundred GPUs
- Data volume: Millions of examples, often of higher quality
- Costs: Millions instead of hundreds of millions
- Frequency: Can be continuously refined with new data
This asymmetry has important implications for the AI research landscape. Post-training innovations are more accessible to smaller teams and organizations, enabling a broader innovation space. This partly explains why breakthrough models like DeepSeek-R1 can be developed by companies with comparatively limited resources.
The Continuous Evolution of Training Methods
The boundary between pre-training and post-training is increasingly blurring. New approaches combine elements from both:
New Paradigms in Training
-
Continuous Training:
- Models are no longer viewed as completed entities
- Instead, continuous improvement through ongoing data collection and training
- Example: Auto-GPT and similar self-improving systems
-
Transfer of Reasoning:
- Reasoning abilities learned in one domain transfer to others
- A model that learns to think mathematically also improves its abilities in other areas
- Open question: How far can this transfer go?
-
Sandboxed Learning:
- Models learn through interaction with simulated environments
- Similar to how children learn through play
- Potential for exponential growth of capabilities
The Future of Training
The future of AI training points to a shift in priorities:
-
From Imitation to Exploration:
- Less focus on mimicking human texts
- More focus on independent exploration and learning
-
From Human Data to Self-Play:
- Human data becomes limiting
- Self-play and interaction become dominant
-
From Isolated Training Runs to Continuous Learning:
- The idea of the single, definitive training run becomes obsolete
- Continuous improvement becomes the standard
The Philosophical Implications
The dichotomy between pre-training and post-training raises profound philosophical questions:
1. The Nature of Knowledge vs. Reason
Pre-training imparts “knowledge” – facts, associations, and patterns from the world. Post-training imparts “reason” – the ability to apply this knowledge meaningfully. This distinction reflects an ancient philosophical debate: Is knowledge without the ability to apply it even valuable? Can reason exist without underlying knowledge?
2. Emergence vs. Design
The emergent reasoning abilities of modern AI systems challenge our understanding of design. We didn’t explicitly teach these systems how to think – they figured it out themselves. This points to a fundamental truth: complex abilities need not be meticulously constructed but can emerge from simpler underlying principles.
3. The Bitter Lesson
AI research has repeatedly experienced the “Bitter Lesson”: simple, scalable methods outperform clever, handcrafted approaches in the long run. This lesson applies to both pre-training and post-training – the most successful techniques are often those that contain the fewest human assumptions and scale best with more data and computational power.
4. Is This Really Thinking?
The chain-of-thought processes of modern reasoning models superficially resemble human thinking – but are they truly comparable? This question leads us to the limits of our understanding of cognition itself. If a system methodically explores problems, compares different solution paths, and corrects its own mistakes – is this fundamentally different from human thinking, or merely a different instance of the same basic process?
The Practical Implications
The distinction between pre-training and post-training has concrete practical implications:
1. Resource Allocation
Organizations must decide how to allocate their limited resources between these two phases. Pre-training offers fundamental improvements but is extremely expensive. Post-training can deliver major improvements at a fraction of the cost.
2. Specialization
Post-training enables specialization. A single base model can be refined into dozens of specialized models:
- Models for programming
- Medical assistants
- Creative writing assistants
- Reasoning-focused systems
3. Safety and Alignment
Post-training is the primary mechanism for aligning AI with human values and safety concerns. The balance between usefulness and safety is primarily achieved in post-training.
4. Democratization of AI
While pre-training remains in the hands of a few resource-rich organizations, post-training enables broader participation. This creates a more dynamic innovation space where smaller teams can also make significant contributions.
Conclusion: The Symbiosis of Pre-Training and Post-Training
Pre-training and post-training form a symbiotic relationship in AI development. Pre-training creates the raw potential – the undirected intelligence that emerges from massive amounts of data. Post-training channels and refines this potential, shaping it into useful, safe, and purposeful systems.
This dichotomy reflects, in some ways, human development itself: the early years are characterized by absorbing enormous amounts of information – language, cultural norms, basic facts about the world. Later development focuses on applying this knowledge purposefully, connecting it with values, and using it to solve meaningful problems.
As we look to the future, the boundaries between these phases will likely blur. Continuous learning, self-improvement, and emergent abilities will increasingly define what it means to be an intelligent system – whether human or artificial.
This article uses concepts and insights from current AI research, particularly in the context of recent developments such as DeepSeek-V3, DeepSeek-R1, OpenAI’s o1, and related models.