# Reinforcement learning, non-Markov environments, and memory

I these days accomplished reading a reinforcement learning textbook (by Barto and Sutton), and at some level of it I used to be once continuously fearful by one most necessary assumption that underlined your whole algorithms: the Markov property, pointing out that a succeeding match is dependent most productive on the one earlier it, and never on ones far prior to now. This roughly restriction is a suited instrument each and every virtually and theoretically, nevertheless it causes complications when facing a pretty natural scenario where there is long-term dependence between events. The book doesn’t tackle this, so I tried to, and discovered out it’s a edifying whisper with precise learn in the support of it. Right here’s what I did.

And right here’s the Rust code, that entails loads of algorithm and atmosphere implementations taken from the book and additionally my manner for this dependence whisper: https://github.com/corazza/reinforcement-learning

Nonetheless first, a instant explanation of the standard concepts:

## The basics of reinforcement learning

The just of RL algorithms is to learn a protection (for achieving some just) from interacting with an environment. A protection could well also even be regarded as a mapping from states to actions (or, a mapping from actions to probabilities of taking them in some particular bid), which is followed by a RL agent. The interplay includes the agent taking actions which causes the atmosphere to endure bid transitions and provide the agent with a reward signal evaluating the agent’s performance in achieving the just. The level isn’t chasing after instantaneous rewards, nevertheless maximizing overall cumulative reward (or function) from the interplay.

Many initiatives can efficiently be formulated in this framework: a game of chess where all strikes give a reward of 0, excluding a make a selection ensuing in a reward of 1 and loss ensuing in a reward of -1 (a draw additionally being 0).

The algorithms attain protection development not straight by estimating the label of the atmosphere’s states (or `(bid, circulation)` pairs), and editing the protection to greater replicate that recordsdata. Value is outlined because the function (cumulative reward) following a bid (or bid-circulation pair). In case it is possible you’ll maybe well even be in bid `S_t` with available actions `A1` and `A2`, then intellectual that the bid-circulation pair `(S_t, A1)` is extra treasured than `(S_t, A2)` ought to be mirrored in your protection so that the chance of taking `A1` just will not be any much less than the chance of taking `A2`.

This incremental direction of is named Generalized Policy Iteration and is a solid contender for the core view of reinforcement learning. It essentially includes two competing processes:

1. Policy review: for a given protection, estimate the values of states or bid-circulation pairs for the agent following that protection (i.e. learn a label just)
2. Policy development: for a given label just, trade the protection so that it is extra possible to result in further treasured conduct (e.g. by making it greedy with respect to the learned label just)

A truly most necessary theoretical result that enables that is the Policy development theorem, which justifies the protection development step. It assures that the following protection is either strictly greater, or that the customary protection is already optimal. (There’s some very exciting math right here derived from Bellman optimality equations, nevertheless I received’t salvage into that right here.)

One other most necessary view is that the the largest whisper all RL algorithms resolve could well also even be regarded as credit project, i.e. giving credit for future outcomes to past conduct.

## The Markov property and breaking it

The interplay direction of is a sequence of random variables: `(S_0, A_0, R_1, S_1, A_1, R_2, ...)` known as a Markov decision direction of. That is where the Markov property comes into play: it is far a restriction on the atmosphere/interplay dynamics pointing out that the chance of the subsequent bid and reward is a just of the earlier bid and circulation taken. The easiest recordsdata that determines the potentialities of transitioning into `S_{t+1}` with reward `R_{t+1}` is the pair `(S_t, A_t)` (these probabilities customarily aren’t recognized, on the opposite hand).

The restriction is precious: it underlies the convergence proofs for estimating values and guarantees that protection iteration is imaginable and finds the optimal protection in the restrict. Many examples of environments satisfy the Markov property so which that it is possible you’ll unruffled salvage far. It additionally guarantees that it is imaginable to encode recordsdata about values of states and actions “locally”: that we are able to count on the function following a taken circulation to indubitably be as a result of that circulation, and never some past one. The relation to the credit project whisper is evident.

Nonetheless it is additionally a edifying restriction that bothered me whereas reading: it’s not tense to come support up with meaningful initiatives/environments where the Markov property does not preserve. After ending the book I needed to acknowledge how onerous it could well be to change the algorithms to manage with credit project in these non-Markov environments. First, right here’s an instance of this form of case:

The assignment right here is easy. There may be a hall that ends with two choices: up or down. On each and every trial, the kind of paths is randomly trapped, that manner that taking it ends up in a tall detrimental reward (the fairly fairly loads of course offers a definite reward). All fairly fairly loads of actions give a runt detrimental reward to incentivize transferring ahead. The the largest piece is that the positioning of the entice is observable nevertheless most productive before everything up of the hall. Afterwards, this recordsdata is unavailable.

Why does this spoil the Markov property? This signifies that of in the final bid i.e. the “split” where the up/down decision must be made, the reward probabilities count on the starting bid, i.e. on the observation made there. The prefer “up” doesn’t own a bid of probabilities for receiving rewards, nevertheless reasonably the potentialities are a just of what the agent had seen sometime prior to now. The certainty of the label of accessible actions can’t be encoded locally at the split bid on my own.

That is the total create of many of these Markov-breaking examples: long-term dependence of future events on past events. And none of the algorithms from the book can resolve this situation, even though it seems to be to correspond to a cheap assignment.

I came up with a easy solution and it in fact works, even though most productive if certain requirements are met by the algorithm. Later I discovered out that folks already labored on this whisper (of direction), nevertheless I’m unruffled joyful I independently recognized it and had a working solution.

The answer is giving brokers a couple of bits memory and actions for mutating them. Actually, for the hall whisper, a single bit will suffice. Nonetheless this must be done transparently, with minimal changes to the algorithm itself. Extra precisely the agent must own its `(bid, circulation)` pairs prolonged at each and every step: the bid `S_t` turns into `(S_t, MS_t)` and the circulation `A_t` turns into `(A_t, MA_t)`, where `MS` and `MA` denote memory-bid and memory-circulation. Successfully this kind which that it is possible you’ll pick a conventional RL algorithm and give it, in each and every step of the interplay, the means to learn and write to memory `M` (which could in total be reasonably array), and `M` turns into a component of the atmosphere bid from the angle of the agent–that is how the “transparency” is accomplished. The reading and writing of memory must be entirely field to learning.

How can this resolve the hall whisper? Smartly, for one, now it is as a minimal imaginable to own a protection that solves it! The protection is easy: in the starting, looking out on the observation that tells the agent where the entice lies, either flip the bit or don’t (it starts in 0, always). Afterwards, don’t contact the bit (noop). Then, at the split where the up/down decision must be made, your bid just will not be any longer `S_t=Split`, nevertheless `(S_t, M_t)=(Split, m)`, and `m` tells you the observation from the past! The protection at this stage has encoded the label for going up or down that’s relying on the memory bid at the starting: we’ve handled temporal dependency between the observation match and the choice match.

Nonetheless the more difficult quiz is: can this protection essentially be learned?

## The limitation

To treasure when this protection is learnable and when it is not always, and why, it is highest to hunt at the highest RL preserve an eye fixed on algorithm, SARSA, named after the standard phase in a Markov decision direction of `(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})`. There’s no wish to spell it out intimately, it most productive has two the largest parts comparable to the two competing processes in Generalized Policy Iteration. Right here, `Q(s, a)` denotes the bid-circulation label just that the algorithm is learning, and epsilon-greedy manner picking an circulation that’s highest per `Q` a lot of the time (with `1 - epsilon` likelihood) and a random circulation one of the time (with `epsilon` likelihood).

For every and every step of the episode:

1. From `S_t` pick circulation `A_t`, ticket transition to `S_{t+1}` with reward `R_{t+1}`
2. Gain next circulation `A_{t+1}` so that it is epsilon-greedy with respect to `Q` (protection development)
3. Update `Q(S_t, A_t)` a runt in direction of a brand unusual purpose label `R_{t+1} + Q(S_{t+1}, A_{t+1})` (protection review)

The update purpose is the largest piece: `R_{t+1} + Q(S_{t+1}, A_{t+1})` is a brand unusual estimate of the function following `S_t` after `A_t` is taken, and what’s exciting is that it most productive takes into myth the instantaneous reward and never rewards that come afterwards. Nonetheless function isn’t lawful the instantaneous reward: it is far the cumulative reward, i.e. the sum of all rewards. Correcting that’s the role of the `Q(S_{t+1}, A_{t+1})` term, the label of the succeeding bid-circulation pair. As I mentioned in the introduction, label capabilities signify the function following a bid-circulation pair.

My Rust code for SARSA is right here, it has extra most necessary parts.

Can this algorithm learn a protection that solves the hall whisper? Smartly, no. Too spoiled! Why? Thanks to the credit project whisper. The algorithm can’t give credit to the circulation of remembering the observation because it most productive offers credit one step backwards! Right here’s an illustration:

The agent strikes from `S` into either `U` or `L` and this bid represents the observation. Then it strikes into the hall `C(1)` (there could well also even be loads of hall steps afterwards, `C(2), ..., C(n)`), and on this poke is free to bid the observation into `m`. Somehow it reaches `Sp` (split) where it ought to create its prefer and fetch the final reward looking out on its prefer and its past observation. SARSA merely isn’t in a design to credit that the largest transition `U/L -> C(1)` where the recording could well also’ve taken design. So its performance will sadly always be 50/50.

## No longer all is misplaced

Fortuitously, there are RL algorithms that are smarter about credit project! There are Monte Carlo programs, which await the whole episode to realize before updating estimates for all bid-actions. They additionally own the succor of intellectual the valid return that followed every one. One of their drawbacks is that they’re incapable of online learning.

There’s additionally n-SARSA! It’s the SARSA algorithm, nevertheless expanded so that it recordsdata the closing `n` transition and assigns credit accordingly, not lawful the employ of recordsdata from the closing step. The book parts out how SARSA and Monte Carlo are essentially lawful extremes of the n-SARSA continuum.

That’s lawful what’s wanted: an algorithm that seems to be to be like into the past far ample in remark to be in a design to credit the alternatives made support then.

I implemented n-SARSA too, Rust code is right here, and that’s the one I dilapidated for this whisper. Right here’s a sample episode after the educational was once done:

``````S: (Launch, 0), A: (Forward, Noop), R: -5
S: (ObserveL, 0), A: (Forward, Flip), R: -5      --- Decrease is trapped (flip the bit)
S: (Hall(1), 1), A: (Forward, Noop), R: -5
S: (Hall(2), 1), A: (Forward, Flip), R: -5
S: (Hall(3), 0), A: (Forward, Flip), R: -5
S: (Hall(4), 1), A: (Forward, Flip), R: -5
S: (Hall(5), 0), A: (Forward, Flip), R: -5
S: (Hall(6), 1), A: (Forward, Flip), R: -5
S: (Split, 0), A: (Up, Noop), R: 100             --- So poke up
Create: 60
``````

And right here’s one where `U` was once seen:

``````S: (Launch, 0), A: (Forward, Noop), R: -5
S: (ObserveU, 0), A: (Forward, Noop), R: -5      --- Higher is trapped (produce not flip the bit)
S: (Hall(1), 0), A: (Forward, Noop), R: -5
S: (Hall(2), 0), A: (Forward, Flip), R: -5
S: (Hall(3), 1), A: (Forward, Flip), R: -5
S: (Hall(4), 0), A: (Forward, Flip), R: -5
S: (Hall(5), 1), A: (Forward, Flip), R: -5
S: (Hall(6), 0), A: (Forward, Flip), R: -5
S: (Split, 1), A: (Down, Noop), R: 100           --- So poke to down
``````

It in fact works! At all times. Right here’s the corpulent label just learned by n-SARSA.

The core view right here is to create the algorithm the actual design to have interaction with its memory to boot to with the atmosphere. A single bit is a easy, uninteresting solution, nevertheless it in fact works on a easy decision with temporal dependence whisper. I googled round and discovered a paper that knowledgeable LSTMs for this reason: I haven’t learn it but nevertheless I possess that it corresponds to this similar total view, it’s lawful a valuable extra sophisticated architecture.