Neuroevolution + PPO Transfer
Genetic Models as Experts for On-Policy RL
We explore how a population-based, genetic-evolution algorithm (GA) can serve as an effective "expert" or demonstrator for an on-policy reinforcement learning agent (PPO). In short, we use a basic GA to evolve neural controllers in a 3D game until they reach decent performance. Then we take the best-performing genetic model (the champion) and transfer its weights into a policy gradient method—PPO— so that PPO starts learning from a near-expert initialization instead of learning from scratch.
Genetic Algorithm
We first simulate a large population of neural nets controlling a voxel-based parkour. Each net has inputs like "LIDAR" scans (to sense platforms) and outputs that determine jumping, left/right movements, and velocity factors. Over many generations, we breed the top performers to create children, using random crossovers and mutations. Eventually we get a single champion net that performs well in zero-shot, purely from genetic search without gradient updates.
In this example, the game is a "voxel endless runner" that spawns platforms at random positions. Our GA attempts to keep the runner alive as long as possible by jumping and shifting left or right at the correct times. The net architecture is straightforward:
- Inputs:220 LIDAR occupancy cells + runner state
- Hidden Layer:8 neurons, each with random weights/bias
- Outputs:jumpProb, leftProb, rightProb, rawJumpSpeed, rawMoveSpeed
Above is an example that runs fully in your browser (via tensorflow.js)! It spawns a population of runners, each with random network weights. They gradually evolve to survive longer—at which point, we have a high-scoring "expert" net from genetic search. You can adjust the speed in the top right to see it train faster or slower.
Proximal Policy Optimization (PPO)
After the GA saturates (or hits a certain generation), we take the best network's weights and feed them into a standard on-policy PPO algorithm. PPO normally starts from random initialization, which is usually slow to learn from sparse rewards, but now it gets a huge head start. This lets the policy gradient approach quickly adapt or refine its weights for more stable, nuanced control. Typically, this kind of "demonstration" or "expert init" can drastically reduce the PPO training time needed to beat the same performance level.
Essentially, the genetic approach has done the coarse search over a massive parameter space (in parallel), while PPO can then do fine-grained gradient updates to squeeze out further gains. We achieve a working trade-off between an evolutionary, off-gradient method and a standard RL method that relies on gradient backprop.
Put Simply..
- Genetic: Run parallel neural nets, pick top performers, breed them, and mutate. Reward based on distance.
- Demonstration: We store the champion net. Weights are converted into a shape that PPO expects (like a feedforward policy network).
- PPO: PPO policy's initial weights are the champion weights. Our PPO then runs standard environment interaction and gradient updates (advantage estimation, clipped objectives, etc.). Thanks to our expert, PPO quickly stabilizes or surpasses the champion's performance!
Conclusion
Zero-shot or minimal-step transfer is a compelling technique in environments where gradient-based RL might struggle at the early phase (e.g. large state/action spaces, sparse rewards— think open-ended games like Minecraft). We know evolutionary methods can find partial solutions simply by random exploration, but by marrying these solutions with PPO's local gradient optimization, we can drastically reduce training time and converge faster than either approach alone.
We effectively show that a "weak but broad" solver (genetic search) plus a "focused gradient" solver (PPO) is sometimes the best of both worlds. We hope to extend this approach to more complex environments and tasks, other than just Minecraft parkour!