An educational drone reinforcement learning platform that teaches students RL through reward-function editing, PPO agent training, and autonomous drone simulation — no deep AI background required.
DeepFlyer reimagines AWS DeepRacer for drone autonomy. Instead of racing an RC car around a flat 2D track, students train a drone in simulation to fly through hoops, avoid obstacles, and navigate a 3D course — all by modifying reward functions in a browser-based editor.
The platform reframes reinforcement learning as something physical and visual. If the reward function is poorly shaped, the drone behaves poorly. If it's better designed, the drone becomes safer, smoother, and more efficient. Students can see that relationship directly — no lectures about abstract RL theory required.
The early prototype established the simulation stack, drone model, React reward editor, backend API, dynamic reward switching, and initial PPO training results. Future milestones target SLAM, event cameras, XAI overlays, and sim-to-real deployment.
“Reward functions are not just math. They encode behavior. The best way to teach that is to let students see a drone learn — and fail — in real time.”
The core concepts are abstract. Students tune parameters, watch plots, and read reward curves — but rarely develop the physical intuition for why reward design matters.
Students are taught that the reward function drives behavior, but never experience the consequences of a bad reward directly.
RL assignments produce plots and numbers. They don't produce a drone that crashes because the penalty for going off-course was too low.
Without a physical system to anchor understanding, reward hacking is a concept. With a drone, it's immediately obvious.
Why does a policy that scores 99% in simulation fail on hardware? The abstraction of code-only experiments never makes this tangible.
Safety boundaries and emergency stops are afterthoughts in most RL coursework. In real robotics, they are foundational design decisions.
DeepRacer is excellent for 2D car navigation but limited in scope. Drone autonomy requires 3D reasoning, altitude control, and richer state representations.
A fixed 4–5 hoop circuit with multiple laps at ~0.8 m altitude. Visual targets provide immediate student feedback. Best for public demos and intuitive reward-behavior observation.
A straight path from start to finish with static obstacles requiring lateral maneuvering. Fixed altitude. Easier to compare across reward functions. Best for controlled benchmarking.
Students don't write RL training code. They modify reward weights through the editor and observe how incentive changes produce different drone behaviors.
Higher values make the drone prioritize passing through hoops aggressively, potentially sacrificing smoothness.
Rewards cleaner hoop passage. Students observe the drone gradually centering its approach path over training.
Can produce faster flight but less precise hoop targeting. Classic exploration vs. exploitation tradeoff.
Increasing this makes the policy more conservative — students directly observe the agent "choosing" to avoid obstacles.
Reduces jerk in motor commands. Students notice the drone trajectory becoming more continuous over episodes.
Hard constraint on staying within the safety zone. Essential for real-hardware deployment with safety netting.
The observation space teaches students that RL agents don't learn from magic — they learn from state representations. Better observations consistently produce better policies.
Progress spanned simulation infrastructure, drone model, backend, frontend, RL training, and experiment tracking — all in three weeks.
PPO success within 100 steps (Distance-to-Goal preset)
Gazebo cold start / ~0.3s warm start
Reward API median latency (20ms P95)
Collision-detection false-positive rate
X500 URDF mesh optimization
Mass/inertia validation error
Students learn reward design most effectively when the platform hides unnecessary complexity. Every API endpoint exposed in the UI should have a direct observable effect on drone behavior.
No other concept maps as directly from abstract theory to physical outcome. Students who adjust a collision penalty and watch the drone become more conservative have had a genuine insight.
A sub-1s startup and reliable contact sensor behavior matters more than impressive visuals. Students who encounter simulation bugs stop learning RL and start debugging ROS.
The 10ms Reward API response time was a deliberate design target. If reward updates feel sluggish, students disengage from the feedback loop.
MLflow was integrated in Week 3 and it immediately revealed the convergence plateau at 9e5 steps. Without experiment tracking, that insight would have required re-running experiments.
Students should reason about rewards and actions, not attitude stabilization and PID tuning. The flight controller abstraction boundary is a pedagogical design decision as much as a technical one.
Reward design, PPO training, ROS 2 simulation, and a future path to real drone hardware — all accessible through a browser.
← Back to Portfolio