Using Human Input to Enhance Learning in Repeated Stochastic Games

  • Malik Hashem Altakrori

Student thesis: Master's Thesis


To alleviate the effect of global warming, several initiatives have taken place to decrease the CO2 emissions across the world. Since electricity generated by traditional power plants participate heavily in increasing these CO2 emissions, using smart power grids to decrease the massive electricity consumption (or save energy) by enhancing the consumption patterns of the consumers can reduce these CO2 emissions. The nature of the smart power grid requires a high frequency of interactions between the suppliers and the consumers in the electricity market. Since the customer is not available at all times, there is a need for automated systems that can make decisions on behalf of unavailable customers. These systems are called agents. In this research, we analyze the effectiveness of two learning algorithms, called Imitator and MCRL–LbD, that an agent can use to learn efficient behavior in a multi-agent environment, called the Multi-Stage Prisoner's Dilemma (MSPD), that abstractly models an electricity market. While traditional learning algorithms fail to learn effective behavior in the MSPD, both Imitator and MCRL–LbD leverage demonstrations of example behavior from a user to learn effective behavior under various circumstances. According to our results, MCRL–LbD outperforms Imitator and showed a stable desired performance under different conditions. While Imitator helped the agent learn the desired actions when the demonstrations were informed, an agent using MCRL–LbD managed to learn the desired actions regardless less of the degree of knowledge provided by the demonstrator. Additionally, Imitator's performance is affected more heavily by changes in the number of frequency of demonstrations than MCRL–LBD. On the other hand, Imitator does converge more quickly than MCRL–LbD. These results demonstrate the usefulness of incorporating learning from demonstration techniques in multi-agent settings such as future power grids. Future work should continue to study and improve these techniques to allow users to easily configure smart devices for such power grids according to their needs and preferences.
Date of AwardDec 2011
Original languageAmerican English
SupervisorJacob Crandall (Supervisor)


  • Human locomotion
  • Learning

Cite this