I’m doing this analysis with the hope of provide aspiring game developers with useful information on how to price their game, which genres perform the best, and which variables played the greatest factor in the success of their game.
However we have a few considerations regarding our goal:
The data set includes almost all the games on steam in 2023. This is bad, as a significant proportion of listings are low-effort listings, which aren’t helpful data points to any aspiring game developer.
The data set also has an inherent survivor ship bias when looking at older listings. This is because many unsuccessful older titles get removed from the platform, possibly making the later years appear more successful.
AAA developers also dominate certain categories within the data set so we have to be mindful of this and take into consideration the scope of your game.
Within the analysis, I’ll do my best to take into account these factors and deal with the data properly, regardless, I think the findings are interesting.
The data we have has many ghost rows, outlier values, and takes string types for everything.
# Path to dataset
steam_raw <- read_csv("data/Steam_Trends_2023_by_evlko_and_Sadari.csv")
glimpse(steam_raw)
## Rows: 65,111
## Columns: 14
## $ `App ID` <dbl> 730, 578080, 570, 271590, 359550, 105600, 4000, …
## $ Title <chr> "Counter-Strike: Global Offensive", "PUBG: BATTL…
## $ `Reviews Total` <dbl> 7382695, 2201296, 2017009, 1322782, 978762, 9277…
## $ `Reviews Score Fancy` <chr> "88%", "57%", "82%", "89.85%", "86%", "97%", "96…
## $ `Release Date` <date> 2012-08-21, 2017-12-21, 2013-07-09, 2015-04-13,…
## $ `Reviews D7` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ `Reviews D30` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ `Reviews D90` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ `Launch Price` <chr> "$14.99", "$29.99", "$29.99", "$29.99", "$59.99"…
## $ Tags <chr> "FPS, Shooter, Multiplayer, Competitive, Action,…
## $ name_slug <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ `Revenue Estimated` <chr> "$110,666,598.05", "$66,016,867.04", "$60,490,09…
## $ `Modified Tags` <chr> "FPS_, Shooter_, Multiplayer_, Competitive_, Act…
## $ `Steam Page` <chr> "https://store.steampowered.com/app/730", "https…
We have handled this with the following code below.
#cleaning into new 'steam' df
steam <- steam_raw %>%
#the default names in the data set weren't nice...
rename(
launch_price = `Launch Price`,
revenue_est = `Revenue Estimated`,
review_score_fancy = `Reviews Score Fancy`,
release_date = `Release Date`
) %>%
#remove the ghost cols
select(-`Reviews D7`, -`Reviews D30`, -`Reviews D90`, -name_slug) %>%
#changing the data types to useful numeric vals
mutate(
launch_price = readr::parse_number(`launch_price`),
revenue_est = readr::parse_number(`revenue_est`),
review_score = readr::parse_number(`review_score_fancy`),
release_date = as.Date(`release_date`),
release_month = lubridate::month(release_date, label = TRUE)
)
glimpse(steam)
## Rows: 65,111
## Columns: 12
## $ `App ID` <dbl> 730, 578080, 570, 271590, 359550, 105600, 4000, 252…
## $ Title <chr> "Counter-Strike: Global Offensive", "PUBG: BATTLEGR…
## $ `Reviews Total` <dbl> 7382695, 2201296, 2017009, 1322782, 978762, 927752,…
## $ review_score_fancy <chr> "88%", "57%", "82%", "89.85%", "86%", "97%", "96%",…
## $ release_date <date> 2012-08-21, 2017-12-21, 2013-07-09, 2015-04-13, 20…
## $ launch_price <dbl> 14.99, 29.99, 29.99, 29.99, 59.99, 9.99, 9.99, 39.9…
## $ Tags <chr> "FPS, Shooter, Multiplayer, Competitive, Action, Te…
## $ revenue_est <dbl> 110666598, 66016867, 60490100, 39670232, 58715932, …
## $ `Modified Tags` <chr> "FPS_, Shooter_, Multiplayer_, Competitive_, Action…
## $ `Steam Page` <chr> "https://store.steampowered.com/app/730", "https://…
## $ review_score <dbl> 88.00, 57.00, 82.00, 89.85, 86.00, 97.00, 96.00, 87…
## $ release_month <ord> Aug, Dec, Jul, Apr, Dec, May, Nov, Feb, May, Nov, N…
Why should we care about optimising how we go about creating games for steam? Here’s a couple graphs that show some top level data of whats going on in the steam games market, and how imbalanced the revenue distribution is.
The revenue of purchasing games is highly unbalanced. A small few games, regardless of the genre, take home the large majority of the Revenue, and popularity.
We will define revenue generated as our metric for success. This may not be the perfect statistic, as success of a game means something different to each developer. But for this analysis, that will be the assumption.
Below here is an analysis of a few different possibly relevant variables to the success of the games they create that game designers should be thinking about.
Steam doesn’t categorize each game into a single genre, this is because they use a tagging system, where the developer gives their game a set of tags, comma seperated in the Tags column. Non-discrete genres are a problem for our analysis, to solve this I’ve decided to map all the tags into a given genre.
We will also need a tag priority, the order matters here as were sorting games into the first relevant tag on the list, I’m showing my attempt, as it can be modified depending on your opinion of what tag goes in each genre.
genre_map <- list(
Shooter = c("Shooter", "FPS", "Third-Person Shooter", "Bullet Hell", "Arena Shooter"),
Survival = c("Survival", "Open World Survival Craft", "Crafting", "Resource Management"),
Horror = c("Horror", "Survival Horror", "Psychological Horror", "Gore"),
RPG = c("RPG", "JRPG", "Action RPG", "CRPG", "Party-Based RPG"),
Strategy = c("Strategy", "RTS", "Turn-Based Strategy", "Grand Strategy", "Real Time Tactics"),
Simulation = c("Simulation", "Life Sim", "Farming Sim", "Automobile Sim", "Flight"),
Platformer = c("Platformer", "2D Platformer", "3D Platformer", "Precision Platformer"),
Puzzle = c("Puzzle", "Logic", "Match 3", "Puzzle Platformer"),
Sports = c("Sports", "Racing", "Football", "Basketball", "Vehicular Combat"),
Adventure = c("Adventure", "Story Rich", "Exploration", "Narrative", "Walking Simulator"),
Action = c("Action", "Combat", "Hack and Slash", "Beat 'em up", "Fast-Paced"),
Indie = c("Indie", "Experimental", "Minimalist", "Cozy", "Cute")
)
We also filter the games that made less than $1000 in total revenue, as it removes any low effort listings that I’m considering irrelevant.
steam_genre_filtered <- steam_genre_clean %>%
filter(revenue_est > 1000)
Lets now look at the genres popularity and the median revenue_est for each genre.
In broad strokes, it seems the number of games in a genre correlates with how much the median game in the genre makes.
Survival and Shooter look to be the best performing categories, with Survival having less listings.
Here’s another way of viewing the same data, except we get a feeling for the distribution of the revenues within each genre.
We can see in both graphs Survival and Horror genres produce the best average revenues. We can see shooter is the genre with most listings, but before we move on, lets get a sense of how saturated any given genre is.
The lines display for an average game in each genre, the probability of making a given revenue.
It looks like, the best genre is Horror and Survival for generating revenue across all price points. One section to note is that at the $100,000 price point, we have a greater than 0% probability in many genres.
Here we will look at how games are priced on steam’s platform.
We are going to bucket the price tiers, as this reflects how many people think about pricing their games, as the prices are clustered around the $4.99, $9.99, … marks. Lets see how revenue distributions change depending on the pricing of a game.
We can easily see that less games are priced at the higher tiers, but also make a greater median in money. However, this doesn’t mean you should price your game at $50-$60, as the statistic is skewed by the relatively high proportion of AAA games that are priced in this bracket.
Using the same price buckets as before, lets take a look at the distribution of what price points the games we looked at are being set.
We can see most games are priced around the $2.50 to $10 price range.
Another way we can sort the data is by looking at games that fall in specific buckets of both Launch Price and also Review Score (a measure of how well made). To see the relationship between how well made and the pricing of games has on the resultant Median Revenue.
As we can see, the sweet spot seems to be pricing games at the 50-60 bracket to maximize revenue, however as noted before, this is probably skewed, as only developers that produce AAA games price their games in this bracket.
In a Random Forest, our goal here is to see how the changing of different variables affect the expected revenue a given game generates.
The Predicted Revenue on the Y axis has been set the same across graphs for comparison purposes.
Here we can see (unsurprisingly), the better Launch price and Review scores have the biggest impact on the overall revenue of a given game, and also Review Scores for unsurprising reasons, Launch Price because only AAA developers release games at the $60 range.
Release Year and Release Month, have little to no impact.
Horror and Survival punch above their weight, though most genres follow the trend of saturation correlating with median revenue.
Higher prices only work for games that already look and feel like AAA titles. For most developers, it seems the $10 range is the best.
review score seems to be the strongest correlate of revenue. I.e Quality matters more than timing, price, and genre.