It’s been ten years since the inception of the Mario AI research community, but work in this space is still as engaging and exciting as it’s ever been. Today I’m going to look at a variety of research using machine learning to Super Mario level generation since the competition ceased in 2012. I’ll be looking at the kinds of levels they’re generating, how these algorithms go about building a Mario level and the opportunities that still lie ahead for this research field. It’s time to meet the new Super Mario Makers.
Machine Learning and Procedural Generation
Before we look at the varying projects and systems in earnest, let’s cover some the history of the field and a bit of background knowledge on the changes that have happened in the field in recent years. The Mario AI competition introduced several strands of research involving everyones favourite plumber. While some of the challenges set by the competition such as writing bots were overcome pretty quickly courtesy of Robin Baumgarten’s A* player, the challenge of procedurally generating Mario levels was only just getting started. The competition tasked researchers to not just create a system that made levels inspired by Super Mario Bros., but also attempt to customise them based on some simple telemetry data about the user that was playing the game.
The resulting systems over the next couple of years were pretty diverse both in terms of how they operated and the subsequent levels they created. While this article is focussed on more recent work replicating Mario level design, if you’re interested in learning about the more formative research in player-driven PCG for Mario, I’d highly recommend exploring work by Dr Noor Shaker. But looking at more contemporary research, one of the largest transitions is moving away from building levels that adopt player telemetry, but instead seek to mimic the original designs from Mario titles. What prevented this from occurring up until now, was that these earlier research projects all embed their own design knowledge about what a Super Mario level actually is.
Now this is an interesting area of study all in itself: what makes a platforming level a Super Mario level? The thing is that while there are certainly aspects of the overall aesthetic that influence the process, what is more critical is the logic and structure to the layout of even the simplest Mario levels. If you give someone a piece of paper and a pencil and ask them to sketch you a Super Mario level, they’d probably differ in many respects due to that persons knowledge of the franchise but there would no doubt be some common elements. There would be brick blocks, maybe even question blocks, pipes, goombas and koopa troopas. There might be a pit to fall into and die, or a stairwell that would head towards the flag at the end of the level.
We have a collective understanding of these core elements, but often lack the intimate knowledge of how these things should link together and the relationships between them that make the Super Mario series a continually fun and engaging franchise. Heck, Nintendo is so confident in their mastery of this concept they’ve developed not one but two editions of Super Mario Maker where they give you a fairly flexible tool for building your own Mario levels to play on a Nintendo device without any real concern of whether it would impact the sales of the main series.
So bringing this back to AI research, Super Mario level generation research has continued to thrive. However, the biggest shift in recent years is the efforts to understand Super Mario level design. To address this, more recent research avoids having researchers directly inject their own interpretations of Mario level design, given — as mentioned — there is no absolute and complete set of rules that dictate how Mario levels are built that we can lift from. Even when given the tools to do so in Super Mario Maker, we ourselves not only struggle to reproduce the ‘Mario Method’, but we often do something else entirely — a point that I’ll come back to later. Now the most recent research using machine learning has sidestepped this issue by letting the algorithm study levels itself.
As we’ll see in the some of the projects I’m about to explore, in each case the system is fed Super Mario levels to allow it to build its own internal models. The manner in which this data is fed ranges from complete symbolic representations of tile grids, to reading in level graphics all the way to watching YouTube videos. In each case, the systems infer their own logic of how tiles should be grouped together, what tiles are used in certain contexts and in some instances what textures should be applied to these tiles themselves. This process has enabled for these recent level generation systems to more accurately interpret and reproduce aspects of Super Mario level design.
Generating from Design Patterns
First up, let’s take a look at the work of Dr Steve Dahlskog — a lecturer at the University of Malmo in Sweden. Steve’s work sought to assess Super Mario games using design patterns: identifying common elements in the layout or structure of levels or mechanics in games. His argument being that by establishing rules of how game elements are constructed, AI systems could adopt that knowledge as means to make intelligent decisions on how to make new levels or even games. This research has led to him exploring procedural generation of dungeon levels as well as player experience evaluation systems, but much of his research — particularly in the earlier years of his PhD — revolved around Super Mario Bros.
Dahlskog’s research starts in (Dalskog & Togelius, 2012) by identifying 23 unique design patterns within 20 land-based levels from the original Super Mario Bros., omitting underwater levels and castle levels. These patterns range from micro level behaviour such as chains of goombas or gaps are presented in the level, to macro behaviour where level construction is more abstract, such as multiple paths or stair ways constructed using pipes with gaps in between.
Using these patterns as means to identify properties that would reflect real game levels, the second phase of research (Dahlskog & Togelius, 2013) employed evolutionary computation to build levels and then assess them based on the number of patterns found in each example. This is achieved by breaking levels up into 200 vertical slices and then examining how those slices connect to one another and what patterns exist within subsets of these slices. The resulting output generates levels that reflect some of the classic Mario design patterns, but failed to respect the pacing of these patterns to create a more natural flow.
This was followed in 2014 with two projects: one that rebuilt the level evaluation to record micro and meso level patterns distinct from one another — resulted in increased level variety — followed by a new level generation process — the multi-level level generator — that built levels at varying levels of abstraction: starting at macro pattern level, then working down to meso and micro.
His final body of research in this space was published in 2014 with assistance from Mark Nelson — and while it’s not machine learning it’s worth check out — given it calculates continuing subsequences or n-grams of vertical slices in order to construct a markov model of the level creation process. Markov models are a smart approach towards this given they’re designed to predict the subsequent action made in a decision process based on probability of subsequent outcomes. By using a collection of the most commonly found vertical slices used in Mario levels as shown on screen now, the system can then analyse a set of one or more levels, build the markov model and then attempt to produce levels that will have similar n-grams within it. As you can see here the resulting levels not only carry appear to reflect specific designs from existing Mario levels but concatenate them in ways not previously considered.
Generating from Memory
Next up let’s swing over to the United States and the work of Adam Summerville, who completed his PhD at UC Santa Cruz in 2018 and is at the time of writing an Assistant Professor at California State Polytechnic University. Adam has worked on a variety of fun research during his time as a grad student, but he tried out two distinct approaches to Mario level generation using machine learning.
First up is a project published in (Summerville et al, 2015) that shares some similarities with the final project by Dahlskog in that it also uses Markov Chains, but tries a different approach at validating the outcome. One element of the n-gram levels that can be problematic is that there’s no absolute guarantee that resulting levels are playable. It’s at the mercy of the Markov Chain producing something that makes sense. Hence Summerville’s paper takes a similar Markov Chain approach, but the decisions made by the Markov Chain are validated using Monte Carlo Tree Search that checks the decisions being made will not only be playable but tailors them against specific design parameters.
To get started, the Markov Chain is generated in a similar manner to Dahlskog’s work by examining vertical slices of each levels, identifying tiles that are solid, breakable, enemies, pipes, coins, questions blocks and hidden power-up. Once the Markov Chain is established, every step of the level generation process relies on the MCTS algorithm to validate the quality of each possible action the it can make. If the markov chain says there are three commonly used follow-ups to the current vertical slicem the MCTS scores them in a number of ways, with the system taking the one deemed most suitable. The MCTS evaluations are multifaceted, given that they’re designed not only to ensure the resulting levels are playable, but also relies on additional metrics that the user can parameterise based on their own personal interests. First each level is assessed for completeness by taking Baumgarten’s A* bot mentioned earlier and using it to test the level. In addition to having the bot test the level, there are independent parameters for desirability of certain levels features such as the number of gaps, enemies, coins and power-ups added. This hand-tweaking allows for the designer to have more control of what types of levels are made, all the while relying on MCTS to ensure the resulting levels make sense.
The second approach by Summerville was to use a variant of recurrent artificial neural networks known as a long short-term or LSTM network. LSTM networks date back to the late 1990s and are great for handling sequences of data given they have a memory component to the input data, allowing it to not just read in fresh information, but dictate when to remember or forget data from previous input cycles. As a result LSTM networks are often used in speech and video recognition processes, given it’s continuous sequences of data.
As detailed in (Summerville and Matteas, 2016) , a second approach was to train a LSTM against 15 levels from the original Super Mario Bros. as well as 24 levels from the original Japanese release of Super Mario Bros. 2 — often referred to in western Markets as Super Mario Bros: The Lost Levels. Different configurations of networks were experimented with to enable the system to read and generate the corresponding output, with the best configuration — known as snaking-path-depth changing direction from up-to-down then up again when generating levels, but also embeds special characters into the generated level that reflect the potential path a player could take and the current column it is generating for the level, thus enabling the system to have an understanding of how far into the level it is working from.
Having trained the snaking-path-depth network to within a certain level of confidence, it is then tasked with generating new levels as we can see now. These levels are assessed against a variety of metrics. These metrics — as Summerville himself states — are not intended to assess the believability of the level coming from an original Mario game, but instead enable new levels to be created that share similar properties but are nonetheless novel. The evaluation considers not just whether the level is completable — once again using an A* driven bot — but the percentage of empty space in the level, the negative space of the level — which is the empty space the player can actually reach, the number of interesting tiles placed, the number of jumps involved in playing the level optimally as well as measurements of how linear and lenient the level is to play.
Generating from Video
The third strand of research to examine is by Matthew Guzdial, who at the time of writing is close to completing his PhD at Georgia Tech University. Matthew’s work is arguably the most famous of that discussed here, given it’s appeared all across the web. One of the big reasons for this is the element of novelty employed in capturing Mario level data. As we’ve seen already, each researcher is inputting Mario level information in different ways. Dahlskog annotated levels with design patterns, while Summerville fed tile data from levels. But Guzdial’s initial work aimed to capture not just level information, but an understanding of how players navigate these levels in real-time. So it learns about Mario levels, by watching people play them on YouTube!
So how does watching a YouTube video result in an AI making Mario levels? As detailed in (Guzdial and Riedl, 2016) the project adopts OpenCV — an open source computer vision toolkit — that processes each frame of a given playthrough video. The project aims to achieve two distinct elements: first determining what it can be learned about level layout from scraping video footage, but also how to represent the design knowledge hidden within the gameplay footage such that it can be reproduced for the purposes of level generation.
To learn what it can of level layout, it runs a process to identify and categorise what Guzdial refers to as ‘high interaction areas’ — a segment of the level in which the player spends more relative time in compared to others. This can refer to areas with jumping puzzles, floating coins in the map or question blocks and hidden items such as power-ups. These high interaction areas are identified in the video footage then analysed to understand how sprites are placed within the sequence. This is more easily achieved in the original Super Mario Bros. given there are only 102 sprites the system needs to consider. Videos are broken up into distinct sections of level by assessing the differences in contents of each frame. The resulting interaction areas are then clustered such that they can be categorised effectively, resulting in 21 clusters where video segments ranging 2 to 250 frames in size carry specific interesting properties such as being underground or in the treetops. In order to prevent the death of Mario — which causes a black screen — from interrupting this analysis process, only video footage where players don’t die is used.
The second phase is then to build a probabilistic model that is based on the clusters pulled from the video data. This model encodes design rules whereby Level nodes — (L) in the diagram above — can be generated in a variety of different ways due to the data learned from the clusters based on a specific style. This requires the G and D nodes to be properly calibrated. The G node of the model represents all of the geometric information about a given shape in the world that is comprised of one or more sprites. The tree bark in the treetop segments are a common shape that the system recognises and learns to generalise them across a variety of different permutations. But also there’s the D node of the model, which stores the relational information of a G node to all other G nodes in a given level section. This is essentially encoding design knowledge of how objects are placed relative to one another. So the system has effectively grabbed all of this video information, parsed out interesting shapes that use specific sprites, then learned how those shapes relate to one another. The interesting part of this is that it’s not learning how to make a Mario level, nor does the system really understand what Mario is. Instead it’s learning how sprites are positioned relative to one another in segments of video footage that just so happens to come from a video game.
When the model is ready, it can generate segments of level by considering the shapes that would typically appear in that segment, which in turn need to consider not just the sprites they would draw on screen, but their position on screen and their relationship with other shapes that would appear on screen. The resulting levels are astoundingly accurate for a system that is only learning from video footage. This project was but the start of a longer body of work explored in subsequent years about how video footage of games can be harnessed not just to understand how sprites can be put together for level designs, but how it can recereate actual in-game behaviour such as collisions, jump physics and scoring.
Learning from Deception
Last but most certainly not least is more recent research by Dr Vanessa Volz: a PhD graduate from the Technical University of Dortmund and is currently a research associate at Queen Mary University in London.
Volz’s research is detailed in (Volz et al, 2018) that explored how to build levels using a process known as generative adversarial networks or GANs for short. GAN’s are a process of unsupervised learning that has proven popular since 2014, but is based on existing research on adversarial learning between neural networks that dates back to the early 1990s. A generative adversarial network is a deep learning technique comprised of two distinct convolutional neural networks known as the generator and the discriminator. The generator is creating solutions to a given problem while the discriminator evaluates their quality.
To make this happen, the discriminator is learning to recognise a specific set of samples from a dataset while the generator learns to create samples that fool the discriminator into thinking what it has created is authentic. Over time each system becomes increasingly better at their respective tasks, with the discriminator becoming better at recognising authentic data, while the generator becomes better at fooling the discriminator into believing its output is authentic and as a bonus, the output of the generator improves in quality. This approach has resulted in significant improvements in fake AI-generated imagery and style transfer of photographs. Arguably the biggest impact it’s had on gaming thus far is the recent work in modding communities using machine learning to up-scale textures to 4K resolution for game such as Elder Scrolls: Morrowind, DOOM, Metroid Prime and Final Fantasy VII.
But bringing it back to Mario, how did Volz and the rest and her fellow authors get this running for Mario levels? To do this, it’s broken down into two distinct phases: the first part is training the desired generator network. The discriminator is trained against one level of Super Mario Bros. fed from the same corpus used by Summerville’s work, while the generator is trained to start learning how to fool the discriminator. Once this process is completed, the generator can create new levels that fool the discriminator, and thus enables the beginning of phase 2 of learning. The second phase uses a process known as Covariance Matrix Adaptation Evolution Strategy (or CMA-ES for short) to further train the generator such that it can build levels that reflect specific design properties such as the number of ground tiles and enemies placed, but is also evaluated based on whether the Baumgarten’s A* bot can complete the generated levels and the number of jumps it requires to do so.
This then results in levels such as this one here that is based on snippets generated by the system, but ordered in progressively more difficult segments. While it isn’t always perfect — and requires the encoding used by the system to be properly calibrated — the real benefit of this system is that levels can be generated very quickly, to the point it could in theory generate new levels for the player while they’re in the middle of trying out existing ones! You can watch the level generator in action via the video below, plus if you’re keen on trying out this generator yourself, the code is available up on GitHub.
The New Mario AI Framework
And funnily enough just as I was putting the finishing touches to this piece, a new and improved version of the Mario AI framework was made public and ready for a new generation of researchers and hobbyists to use! The framework is being built by Ahmed Khalifa — currently a PhD candidate at New York University — and seeks to not only integrate many of the original features from the 2009 original to, but adds built-in AI players and level generators, contains thousands of generated levels from previous competitions and aims to better support continuing research in the future. All that and it’s using the original Mario art too! Head on over to MarioAI.org to learn more and download the latest version.
References & Related Work
Steve Dahlskog and Julian Togelius (2012): Patterns and Procedural Content Generation. Proceedings of the FDG Workshop on Design Patterns in Games (DPG).
Steve Dahlskog and Julian Togelius (2013): Patterns as Objectives for Level Generation. Proceedings of the Workshop on Design Patterns in Games at FDG.
Steve Dahlskog and Julian Togelius (2014): A Multi-level Level Generator. Proceedings of the IEEE Conference on Computational Intelligence and Games (CIG).
Steve Dahlskog, Julian Togelius and Mark J. Nelson (2014): Linear levels through n-grams. Proceedings of Academic MindTrek.
Summerville, A.J., Philip, S., & Mateas, M. (2015). MCMCTS PCG 4 SMB : Monte Carlo Tree Search to Guide Platformer Level Generation.
Summerville, A., & Mateas, M. (2016). Super Mario as a String: Platformer Level Generation Via LSTMs.
Guzdial, M.J., & Riedl, M.O. (2016). Toward Game Level Generation from Gameplay Videos. CoRR, abs/1602.07721.https://arxiv.org/ftp/arxiv/papers/1602/1602.07721.pdf
Volz, V., Schrum, J., Liu, J., Lucas, S.M., Smith, A.D., & Risi, S. (2018). Evolving mario levels in the latent space of a deep convolutional generative adversarial network. GECCO.