TensorForce: A TensorFlow library for applied reinforcement learning – reinforce.io
This post is about a practical question: How can the applied reinforcement learning community move from collections of scripts and individual examples closer to an API for reinforcement learning (RL) — a ‘tf-learn’ or ‘skikit-learn’ for RL? Before discussing the TensorForce framework, we will discuss observations and thoughts that motivated the project. Feel free to skip this part if you are just interested in the API walkthrough. We want to emphasize that this post does not contain an introduction to deep RL itself, and neither presents a new model or discusses latest state-of-the-art algorithms, hence the content might be of limited interest to pure researchers.
Say you are a researcher in computer systems, natural language processing, or some other applied domain. You have a basic understanding of RL and are interested in exploring deep RL to control some aspect of your system.
There is a number of blog-posts with introductions to deep RL, DQN, vanilla policy gradients, A3C, and so forth (we like Karpathy’s, in particular for its great description of the intuition behind policy gradient methods). There is also a lot of code out there to help with getting started, e.g. the OpenAI starter agents, rllab, and many Github projects implementing specific algorithms.
However, we observe a significant gap between these research frameworks and using RL for practical applications. Here are a few potential issues when moving to applied domains:
- Tight coupling of RL logic with simulation handles: Simulation environment APIs are very convenient, for instance, they make it possible to create an environment object and then use it somewhere in a for loop that also manages internal update logic (e.g. by collecting output features). This makes sense if the goal is to evaluate an RL idea, but it is harder to disentangle RL code and simulation environment. It also touches on the question of control flow: Can the RL code call the environment when it is ready, or does the environment call the RL agent when it requires a decision? For RL library implementations to be applicable in a wide range of domains, we often need the latter.
- Fixed network architectures: Most example implementations contain hardcoded neural network architectures. This is usually not a big problem, as it is straightforward to plug in or remove different network layers as necessary. Nonetheless, it would be better for an RL library to provide this functionality as a declarative interface, without having to modify library code. In addition, there are cases where modifying the architecture is (unexpectedly) more difficult, for instance, if internal states need to be managed (see below).
- Incompatible state/action interface: A lot of early open-source code using the popular OpenAI Gym environments follows the simple interface of a flat state input and a single discrete or continuous action output. DeepMind Lab, however, uses a dictionary format for, in general, multiple states and actions, while OpenAI Universe uses named key events. Ideally, we want an RL agent to be able to handle any number of states and actions, with potentially different types and shapes. For example, one of the TensorForce authors is using RL in NLP and wants to handle multimodal input, where a state conceptually contains two inputs, an image and a corresponding caption.
- Intransparent execution settings and performance issues: When writing TensorFlow code, it is natural to first focus on the logic. This can lead to a lot of duplicate/unnecessary operations being created or intermediate values unnecessarily being materialized. Further, distributed/asynchronous/parallel reinforcement learning is a bit of a moving target and distributed TensorFlow requires a fair amount of hand-adjusting to a particular hardware setting. Again, it would be neat to eventually have an execution configuration that could just declare available devices or machines and have everything else managed internally, e.g. two machines with given IPs, which are supposed to run asynchronous VPG.
To be sure, none of these issues is meant to critize research code, since there is usually no intent for the code to be used as an API for other applications in the first place. Here we are presenting the perspective of researchers who want to apply RL in different domains.