By Planning in the Latent Space of a Learned World Model. The world model Director builds from pixels allows effective planning in a latent space. To anticipate future model states given future actions, the world model first maps pictures to model states. Director optimizes two policies based on the model states’ anticipated trajectories: Every predetermined number of steps, the management selects a new objective, and the employee learns to accomplish the goals using simple activities. The direction would have a difficult control challenge if they had to choose plans directly in the high-dimensional continuous representation space of the world model. To reduce the size of the discrete codes created by the model states, they instead learn a goal autoencoder. The goal autoencoder then transforms the discrete codes into model states and passes them as goals to the worker after the manager has chosen them.
Deep reinforcement learning advancements have accelerated the study of decision-making in artificial agents. Artificial agents may actively affect their environment by moving a robot arm based on camera inputs or clicking a button in a web browser, in contrast to generative ML models like GPT-3 and Imagen. Although artificial intelligence has the potential to aid humans more and more, existing approaches are limited by the necessity for precise feedback in the form of often given rewards to acquire effective techniques. For instance, even robust computers like AlphaGo are restricted to a certain number of moves before earning their next reward while having access to massive computing resources.
Contrarily, complex activities like preparing a meal necessitate decision-making at all levels, from menu planning to following directions to the shop to buy supplies to properly executing the fine motor skills required at each stage along the way based on high-dimensional sensory inputs. Artificial agents can complete tasks more independently with scarce incentives thanks to hierarchical reinforcement learning (HRL), which automatically breaks down complicated tasks into achievable subgoals. Research on HRL has, however, been difficult because there is no universal answer, and existing approaches rely on manually defined target spaces or subtasks.