Neural networks can be designed to grow their own structure during training, adding neurons, layers, or connections as they learn, rather than running a fixed architecture from start to finish. When this works well, the network ends up with more capability than it started with and, ideally, genuinely better generalization. But 'self-assembling' here is a technical concept, not a biological one. It means the architecture or connectivity is dynamic and shaped by the training process itself, not that silicon is somehow mimicking brain tissue. This guide breaks down exactly how that growth happens, what limits it, and how you can build and test it yourself.
The Self-Assembling Brain: How Neural Networks Grow Smarter
What 'self-assembling brain' actually means for neural networks
The phrase borrows from biology, so it is worth being upfront about where the analogy holds and where it breaks down. Real brains self-assemble through guided cell division, axon pathfinding, synaptic pruning, and activity-dependent plasticity. But if you are wondering how fast brain cells grow, that biological process follows different constraints than network self-assembly how fast do brain cells grow. Neural networks borrow the spirit of that idea: instead of fixing every architectural decision before training starts, you let the training process discover or construct the right structure. Think of it less like embryonic development and more like a garden that trims its own dead branches while adding new shoots where sunlight hits.
In practice, 'self-assembly' covers several distinct things: growing the number of neurons or layers over time, discovering which connections should exist via pruning and regrowth cycles, searching over possible architectures automatically, and reallocating capacity toward the parts of the problem that need it most. None of these are literally biological, but they do share something important with how neurons grow and how nervous systems reorganize, which is that structure and function co-evolve rather than being designed separately in advance.
The metaphor earns its keep because it points at something real: static architectures are a design choice, not a law of nature, and relaxing that constraint opens up a genuinely interesting class of methods. Just do not expect a one-to-one mapping to neuroscience. Where the biology and the math overlap, this guide will say so. Where they diverge, it will say that too.
How networks actually grow: the core mechanisms

There are four main families of methods that let a neural network change its own structure during or between training runs. They operate at different levels of the architecture and have different computational costs and stability profiles.
Function-preserving expansion
Net2Net and the broader Network Morphism family let you take a trained network and morph it into a wider or deeper version while exactly preserving the function it computes at the moment of expansion. The idea is that you start small, train to convergence, then expand without throwing away what you learned. Net2WiderNet duplicates neurons with appropriate weight rescaling so the output does not change. Net2DeeperNet inserts identity-initialized layers in the same spirit. After expansion, continued training can then push beyond what the smaller network could represent. This is one of the cleanest practical approaches to incremental growth because the transition point is mathematically stable.
Dynamic sparse connectivity

Dynamic sparse training (DST) methods keep a sparse connectivity mask throughout training but update which connections are active at regular intervals. Sparse Evolutionary Training (SET) removes the weakest connections by magnitude and regrows new ones at random, iterating this grow-and-prune cycle during training. RigL goes further: it uses gradient information to decide where to regrow connections, making smarter choices about which new edges are worth adding. Critically, RigL enforces a fixed parameter count and fixed computational cost throughout training. The topology changes, but the budget does not grow. This is a key design principle: growth in structure does not have to mean growth in cost.
Architecture search and differentiable growth
Neural Architecture Search (NAS) automates the discovery of architecture itself. DARTS (Differentiable Architecture Search) makes this tractable by turning the discrete choice of 'which operation goes here' into a continuous relaxation: candidate operations are combined with learned architecture parameters optimized via softmax weighting, so you can use gradient descent on the architecture simultaneously with the weights. The result is that the network selects its own structure during training. This is arguably the closest thing to genuine self-assembly in the engineering sense. The catch is that DARTS is known to suffer from performance collapse when the continuous relaxation is discretized, and several follow-up papers exist specifically to stabilize it.
Progressive and modular expansion for continual learning
Progressive Neural Networks add an entirely new 'column' network for each new task and connect it to previous columns via lateral connections. Old columns are frozen, so previous knowledge cannot be overwritten. Dynamically Expandable Networks (DEN) take a softer approach: the network selectively grows neurons in response to new tasks during lifelong learning, deciding dynamically how much capacity each new task needs. Both methods frame structural growth as a solution to catastrophic forgetting, which is the tendency for a model to lose old knowledge when learning something new. The architecture change is the memory mechanism, not an add-on to it.
Gradual depth introduction

Gradual DropIn of Layers starts training a deep network as if it were shallow. Newly added layers are initially bypassed entirely, skipped in both the forward pass and backpropagation. They are then progressively incorporated as training proceeds, so the optimization never faces the instability of suddenly training a much deeper network from a bad initialization. An 'untrainable deep network starts as a trainable shallow network' is how the authors describe it. This is a direct parallel to how development in living organisms staggers the introduction of new structures rather than building everything simultaneously. If you are wondering how organisms grow a brain, the key idea is that biological development uses many interacting processes like cell division, differentiation, signaling, and wiring, which is different from neural networks growing structure during training how do organisms grow brainly.
How growth translates to 'smarter': from early formation to generalization
Adding capacity is not the same as getting smarter. A network can grow larger and overfit harder, which is the opposite of what you want. What actually produces better generalization is a combination of factors that growth can enable but does not automatically deliver.
Early in training, a small network is forced to learn the most compressible, generalizable features because it does not have room for anything else. When you expand the network at this point using function-preserving morphism or gradual layer introduction, you give the optimizer a head start: the new capacity is initialized in a state that already solves the problem approximately, so continued training refines rather than restarts. This is fundamentally different from training a large network from scratch, where early optimization can get stuck in regions that overfit.
The Lottery Ticket Hypothesis adds another angle. It suggests that inside any large network, there exist sparse subnetworks ('winning tickets') that, when reset to their initial values and retrained in isolation, match the performance of the full network. Iterative magnitude pruning (IMP) finds these by cycling through train, prune, and rewind steps. The insight for growth is that the right structure matters enormously, and discovering that structure, whether by searching, pruning, or dynamic regrowth, is where capability actually comes from.
Dynamic Capacity Networks add yet another mechanism: they adaptively assign the network's capacity across different parts of the input, so high-complexity regions get more processing and simple regions get less. This is capacity allocation, not just capacity addition, and it can improve generalization on tasks where difficulty varies across inputs.
The honest summary is that growth helps when it enables the optimizer to find better-structured solutions than it would find in a fixed architecture. It hurts when it just adds parameters that soak up noise in the training data.
What stops networks from growing indefinitely
This is where the parallel to biological growth science is most direct. Just as cells cannot divide forever without running into physical limits, and organisms stop growing when resource and structural constraints kick in, neural networks hit hard walls that make unlimited expansion counterproductive or impossible. Understanding these limits is as important as understanding the growth mechanisms themselves.
- Compute and memory budgets: Every added parameter costs FLOPs at training and inference time. Hardware memory caps mean you literally cannot fit an arbitrarily large model on a GPU or TPU. This is why methods like RigL enforce a fixed parameter count throughout training rather than letting the network grow without bound.
- Data scarcity: A larger network can represent more functions, including functions that fit the noise in a small dataset perfectly. Without enough data, growth leads to overfitting, not generalization. The network learns the training set rather than the underlying pattern.
- Double descent: Generalization risk is not monotonically decreasing as you grow a model. There is a 'interpolation threshold' region where test error peaks before falling again in the overparameterized regime. Growing a network through this zone naively can temporarily make things worse before they get better.
- Optimization instability: Adding neurons or layers mid-training can destabilize the loss landscape. Gradients can vanish in new deep layers or explode at transition points. Gradual DropIn and function-preserving morphism exist specifically to manage this.
- Catastrophic forgetting: When a growing network learns new tasks, it can overwrite the weights that encoded old tasks, destroying previous capability. Progressive Networks and DEN address this architecturally, but the problem does not go away by itself.
- DARTS performance collapse: Differentiable architecture search can select architectures that look good under the continuous relaxation but collapse in performance when discretized. This is a structural failure mode of the growth-via-search paradigm.
- Regularization constraints: Without explicit regularization (dropout, weight decay, early stopping, sparsity penalties), a growing network has every incentive to use its added capacity in ways that reduce training loss but harm generalization.
Building self-assembling models today: tools and workflow
Here is a practical workflow you can follow right now, organized from simplest to most involved. You do not need to start with the most complex approach.
Start with progressive widening or deepening
If you are new to dynamic architectures, the most beginner-friendly entry point is function-preserving expansion. Train a small baseline model, then use Net2Net-style operators to widen or deepen it before continuing training. You can implement this in PyTorch manually by duplicating weight rows or inserting identity layers, or use existing libraries that implement Network Morphism operators. The key discipline is to checkpoint before expansion, verify that pre- and post-expansion predictions match on a held-out batch, and then continue training while monitoring validation metrics.
Try dynamic sparse training with RigL
Google Research's open-source RigL repository provides end-to-end training code for dynamic sparse networks. You specify a target sparsity level, and RigL handles the grow-and-prune scheduling automatically. Because the parameter count stays fixed, your compute budget is predictable. A good first experiment is to train a dense baseline, train a static sparse baseline at the same sparsity (random mask set at initialization), and then train with RigL at the same sparsity. The static sparse baseline usually underperforms significantly because PyTorch notes that selecting a mask at initialization with random weights causes a significant accuracy drop. RigL typically closes or eliminates that gap. This trio of runs is an excellent controlled experiment.
Use NAS toolkits for architecture search
For differentiable architecture search, the ENAS-pytorch repository implements Efficient NAS with parameter sharing over a super-network, and DyNAS-T from Intel Labs provides a multi-objective NAS toolkit supporting accuracy, latency, and model size objectives with distributed search. Microsoft's NNI (Neural Network Intelligence) is a broader AutoML toolkit that wraps several NAS algorithms and handles the experiment management, which is useful if you want to try multiple search strategies without reimplementing each one. Start with a small search space and a toy dataset before scaling up, because NAS experiments are expensive to debug at full scale.
Continual learning with progressive or expandable networks
If your use case involves a sequence of tasks rather than a single dataset, Progressive Networks or DEN are the right starting point. Implement a simple two-task version first: freeze the first column after training on task 1, add a second column with lateral connections from the first, and train on task 2. Then evaluate on both tasks. This gives you a direct demonstration of the 'immune to forgetting' property that Progressive Networks claim, and you can measure it concretely with per-task accuracy before and after each expansion.
Recommended tools at a glance
| Method | Use case | Where to start |
|---|---|---|
| Net2Net / Network Morphism | Growing a trained model wider or deeper | Implement manually in PyTorch; verify function preservation on a held-out batch |
| RigL (google-research/rigl) | Dynamic sparse training with fixed compute budget | Clone the google-research/rigl repo; run the CIFAR-10 example first |
| Gradual DropIn | Adding depth stably mid-training | Implement bypass/skip logic with a sigmoid gate that opens over epochs |
| DARTS / ENAS-pytorch | Searching for architecture during training | Use carpedm20/ENAS-pytorch on CIFAR-10 with a small search space |
| DyNAS-T | Multi-objective NAS with latency/accuracy tradeoffs | Use IntelLabs/DyNAS-T with a MobileNet-style super-network |
| NNI | Managing NAS experiments across methods | Follow the NNI quickstart docs; use it as an experiment wrapper |
| Progressive Networks | Continual learning across task sequences | Implement two-column version manually; benchmark per-task accuracy |
| SRigL (condensed-sparsity) | Structured dynamic sparse training | Use calgaryml/condensed-sparsity for structured sparsity variants of RigL |
Testing whether growth is actually working

This is the part most people skip, and it is where a lot of 'self-assembling' experiments quietly fail. You need to distinguish between 'the network got bigger and validation accuracy went up' and 'the growth mechanism genuinely contributed something beyond just adding parameters.'
The baselines you must run
- Dense baseline at target size: Train the final (grown) architecture from scratch with random initialization. If the growth approach does not beat this, the whole pipeline is adding complexity for nothing.
- Static sparse baseline (for DST experiments): Set a random sparsity mask at initialization and train without updating it. This isolates the contribution of dynamic regrowth from the contribution of sparsity alone.
- Small-to-large baseline without function preservation: Expand the network mid-training but with random initialization for the new parameters rather than using Net2Net-style transfer. This tests whether the initialization transfer matters or if any expansion schedule would work.
- Compute-matched baseline: Make sure you are comparing at the same FLOP budget, not just the same final accuracy. A dense model trained for half as many steps at the grown model's cost is a fairer comparison than a dense model trained to convergence at the small model's cost.
Metrics and what they tell you
Track validation accuracy or loss throughout training, not just at the end, to see whether growth events cause instability spikes. For continual learning experiments, track per-task accuracy after every new task, not just average accuracy at the end. Systematic reviews of continual learning literature have noted that average accuracy can mask forgetting dynamics entirely: a network can maintain high average accuracy while completely forgetting its earliest tasks. For sparse training, report sparsity level, FLOP count, and parameter count explicitly alongside accuracy. For NAS experiments, report the final discretized architecture's performance, not just the performance under the continuous relaxation.
Ablations to run
- Remove the growth schedule entirely but keep the final architecture: does the result change?
- Fix the growth timing to different points in training: does early vs. late expansion matter?
- Disable lateral connections in a Progressive Network: how much transfer actually occurs?
- Replace RigL's gradient-based regrowth with random regrowth: is the gradient signal for topology selection doing real work?
- Run DARTS and report both the relaxed (continuous) performance and the discretized performance: the gap between these is a measure of how much DARTS's known collapse problem affects your specific experiment.
Common failure modes to watch for

- The grown model beats the small baseline but not the large baseline trained from scratch: growth helped warm-start but did not find a better structure.
- Validation accuracy improves but training accuracy improves much more: the added capacity is overfitting, not generalizing.
- DARTS selects skip-connections or parameter-free operations heavily: a known sign of performance collapse in the continuous relaxation.
- Per-task accuracy in continual learning looks fine on average but early tasks have degraded: catastrophic forgetting is happening and average accuracy is hiding it.
- DST training is unstable near the grow-and-prune schedule update steps: the regrowth interval or magnitude threshold needs tuning.
Where the 'self-assembling brain' framing breaks down
It is worth being direct about the limits of this framing before you invest deeply in it. Real biological self-assembly, including how neurons grow, how nervous systems develop, and how synaptic connections are refined by activity, involves mechanisms that have no current equivalent in machine learning. Axon pathfinding uses chemical gradients, not gradient descent. Synaptic pruning is driven by activity patterns and metabolic costs that are completely different from weight magnitude or gradient magnitude. The site's discussions of how neurons grow and how the nervous system changes and grows make clear that biological neural development is far more intricate and locally organized than any current ML training procedure.
What machine learning borrows from biology is an inspiration, not a mechanism. Dynamic sparse training is inspired by the sparsity of biological neural networks and the observation from network science about sparse connectivity, but the actual algorithms are engineering solutions, not simulations of biology. DARTS does not model axon guidance. Progressive Networks do not model synaptic consolidation. When the marketing around these methods leans heavily on the brain analogy, treat it as a useful metaphor for intuition, not a mechanistic claim.
The analogy also breaks down in terms of scale and locality. Biological brains grow with local rules: each neuron responds to its immediate environment. Current neural network training is inherently global, backpropagation requires computing gradients through the entire network simultaneously. Truly local, biologically plausible learning rules are an active research area but are not yet competitive with backpropagation on practical tasks. So when you design a 'self-assembling' ML system, you are designing a system where growth decisions are made using global information, which is both a strength (you can optimize globally) and a fundamental departure from the biological analogy.
One more honest note: the field of continual or lifelong learning, which is where the biological analogy to brain plasticity is most directly invoked, still struggles with catastrophic forgetting in ways that real brains generally do not. Progressive Networks sidestep forgetting by refusing to let old weights change, which is more like fossilization than plasticity. DEN grows its way around the problem. Neither solution resembles how a brain actually consolidates and generalizes across a lifetime of experience. That gap is real, it is widely acknowledged, and it is an open research problem.
None of this means the methods are not useful. They are. Dynamic sparse training, function-preserving morphism, and progressive expansion are all practical tools with real engineering value. Just build with them as the engineering approaches they are, and save the biological framing for intuition and analogy rather than mechanistic justification.
FAQ
How do I tell whether the model improved because of “self-assembly” and not just because it became larger?
Use a matched-capacity baseline. Compare against a model trained with the same final parameter count and similar compute, but with a fixed architecture (no growth events). Then check whether validation improves around the same training steps where growth occurs, not only at the end.
What can cause function-preserving expansion (Net2WiderNet/Net2DeeperNet) to fail in practice?
The main failure modes are incorrect weight rescaling (off by normalization factors) and mismatched layer semantics (for example, inserting layers with incompatible activations, batch norm behavior, or dimensionality assumptions). Always verify on a held-out batch that pre-expansion and post-expansion predictions match within a tight tolerance before resuming training.
For dynamic sparse training, how do I choose the target sparsity level without breaking learning?
Start with a small sweep of sparsity values (for example, from modest sparsity up to your desired target), and keep the total parameter budget constant across runs. If you see training loss spike right after regrow/prune updates, reduce the update frequency or move to lower sparsity until training stabilizes, then tighten toward your target.
Why do some dynamic sparse methods look worse than dense baselines even when they match parameter count?
Sparse masks can harm optimization even at the same parameter budget, because gradient flow depends on which connections exist at which time. Fixes include warming up with a better initialization strategy (or matching the dense training schedule more closely), tuning regrow strategy (random regrowth versus gradient-guided regrowth), and ensuring you use the same optimizer and learning-rate schedule across dense and sparse.
What’s the “performance collapse” issue in DARTS, and how can I detect it early?
Collapse often appears when the continuous relaxation is discretized, causing the chosen discrete architecture to perform much worse than the relaxed model. Detect it by periodically discretizing to the current best architecture during the search and evaluating that discrete candidate on a validation set, not just on the relaxed weights.
Can NAS methods be used with strict compute or latency constraints from the beginning?
Yes, but you need to include those constraints in the search objective (or as hard constraints in a multi-objective setup). If you only optimize accuracy during search and measure latency afterward, you frequently end up with architectures that do not meet deployment constraints. Plan the search space so candidate operations map cleanly to your target hardware constraints.
How do progressive networks differ from regular fine-tuning for continual learning?
Progressive networks freeze earlier columns to prevent overwriting, and they add new trainable columns plus lateral connections. Standard fine-tuning updates all weights, so it directly trades off old performance for new task performance. A practical test is to measure per-task accuracy after each new task, not just average accuracy.
What is a safe way to implement progressive expansion for more than two tasks?
Use a clear policy: freeze all previous columns after each task, add one new column per task, and define lateral connections from all earlier columns (or a capped subset) into the new one. Evaluate compute growth and memory, because lateral connectivity can make later tasks progressively more expensive.
In continual learning, why can average accuracy hide forgetting?
A model can maintain a high average by doing well on recent tasks while catastrophically failing on early ones. Track per-task metrics after every task and report forgetting explicitly, for example, as the drop in each task’s accuracy relative to its best value.
What metrics should I report for sparse growth experiments beyond accuracy?
Report sparsity, parameter count, and an appropriate compute metric such as FLOPs or effective FLOPs for your mask, plus training stability indicators like gradient norms or loss spikes around regrow/prune steps. Also state whether the sparsity pattern is fixed at inference or changes through training only.
What common mistake leads to “self-assembling” experiments that don’t replicate?
Hidden randomness and evaluation timing. Growth methods often depend on initialization, pruning/regrowth schedules, and discretization timing (for NAS). Use controlled seeds where possible, rerun multiple times, and log the exact step when the architecture changes so you can compare growth-related instabilities.
Can you combine these approaches, like NAS plus dynamic sparse training?
In principle, yes, but it multiplies the number of moving parts. A practical approach is to stage it: first validate the NAS pipeline with dense training, then add sparsity with a simple baseline (static sparse) before introducing dynamic regrowth. Otherwise, it’s hard to tell which component caused instability or performance gains.
Does “self-assembling” imply local or biologically plausible learning rules?
No. Many growth decisions use global signals like backpropagated gradients or validation loss. If you want biologically plausible locality, you need to use or research local learning rules (which are often not as effective as backprop for standard benchmarks), but most current self-assembly engineering today is still gradient-informed and global.
How Fast Do Brain Cells Grow and Regenerate
With numbers and timelines, explains neuron and glia growth vs neurogenesis and recovery after brain injury.


