Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons
 
research article

Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

Frémaux, Nicolas  
•
Sprekeler, Henning  
•
Gerstner, Wulfram  
2013
Plos Computational Biology

Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.

  • Files
  • Details
  • Metrics
Type
research article
DOI
10.1371/journal.pcbi.1003024
Web of Science ID

WOS:000318069800024

Author(s)
Frémaux, Nicolas  
Sprekeler, Henning  
Gerstner, Wulfram  
Date Issued

2013

Publisher

Public Library of Science

Published in
Plos Computational Biology
Volume

9

Issue

4

Start page

1

End page

21

URL

URL

http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003024
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LCN  
Available on Infoscience
April 16, 2013
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/91512
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés