A/B Testing : 90,000 Mobile Game Data(Python)

To answer the question, ‘Does setting the gate at level 30 or level 40 result in higher player retention rates?’ I conducted an A/B test on the data of the mobile game ‘Cookie Cats.’ This study utilized nonparametric statistics and bootstrapping to address the issue of sample non-normality. The results revealed that setting the gate at level 30 resulted in higher player retention rates, although there was no significant difference in the number of game rounds played by players.

13 min readApr 26, 2024

Cookie Cats” is a game similar to Candy Crush, where players match identical blocks to eliminate them and earn points to progress through levels. After reaching a certain number of levels, players encounter a ‘gate’. If they wish to continue playing, it requires them to either watch ads or spend in-game currency to bypass. In the original version, the gate was placed at level 30. The company plans to move the gate from level 30 to level 40 but is uncertain about how this decision will affect user retention rates.

This ‘gate’ mechanism requires users to take a break or spend money to continue playing, so such decisions may not only affect user retention rates but also revenue.

This is a project from datacamp, and the project description is as follows: ‘…To complete this project, you should be comfortable working with pandas DataFrames and with using the pandas plot method. You should also have some understanding of hypothesis testing and bootstrap analysis.’

Understanding Data

The data is from 90,189 players that installed the game while the AB-test was running. The variables are:

userid - a unique number that identifies each player.
version - whether the player was put in the control group (gate_30 - a gate at level 30) or the test group (gate_40 - a gate at level 40).
sum_gamerounds - the number of game rounds played by the player during the first week after installation
retention_1 - did the player come back and play 1 day after installing?
retention_7 - did the player come back and play 7 days after installing?

When a player installed the game, he or she was randomly assigned to either gate_30 or gate_40.

After initial data observation, we raise the most important question: ‘Will the player retention rate be higher when the gate is set at level 30 or level 40?’ Below is the analysis.

Data Cleaning

Checking for duplicate values.

# Check duplicated row & userid
print(data.duplicated().any())
print(data["userid"].duplicated().any())

Result: False. No duplicate values were found.

Checking for null values.

#check missing value
data.isnull().sum()

Result: False. No null values were found.

Checking for outliers.

Using the describe() function, we can quickly determine if there are any outliers. Among them, ‘sum_gamerounds’ represents the number of game rounds played by players, and the maximum value is unexpectedly 49,854 rounds. Upon inspecting the game’s advertisements, they emphasize ‘800+ levels,’ so it’s evident that there is erroneous data. Therefore, I plotted a boxplot to check if there is more than one outlier.

plt.figure(figsize=(8, 6))
sns.boxplot(x='version', y='sum_gamerounds', data=data)
plt.title('Boxplot of sum_gamerounds by version')
plt.xlabel('Version')
plt.ylabel('Sum Gamerounds')
plt.show()

It seems there is only this one outlier. Therefore, I used a more rigorous approach, the IQR method, to remove it. The Interquartile Range (IQR) is defined as Q3-Q1. When a value deviates from the range between the first and third quartiles by more than 1.5 times the IQR, it is considered an outlier.

# Removing outliers using IQR method
Q1 = data['sum_gamerounds'].quantile(0.25)
Q3 = data['sum_gamerounds'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

data_no_outliers = data[(data['sum_gamerounds'] > lower_bound) & (data['sum_gamerounds'] < upper_bound)]

After removing the outlier, the boxplot below looks much more normal. Data cleaning is complete.

Exploratory Data Analysis (EDA)

Q: How many rounds did players play within two weeks?

# Group by sum_gamerounds and count unique userids for each gameround number
gamerounds_count = data.groupby('sum_gamerounds').size().reset_index(name='count_of_players')

# Calculate the total number of players
total_players = data['userid'].count()

# Calculate the percentage of players for each gameround
gamerounds_count['percentage'] = (gamerounds_count['count_of_players'] / total_players) * 100

# Display the result
print(gamerounds_count)

The data shows that 3,994 people didn’t even play a single round! (Approximately 4.42% of the total players) Upon careful consideration, this is actually quite common. After downloading the game, players might open it, only to realize that the graphics or music are not to their liking, leading them to delete the game or simply not play it at all.

Additionally, it can be observed that 21,725 people played only the first four rounds and then stopped, accounting for approximately 24% of the total players! Perhaps they found the game uninteresting (it could be too simple, too boring, or too difficult), indicating that the enjoyment of games is highly subjective.

Q: How many people are in the control group (gate_30) and the experimental group (gate_40)? What is the average number of rounds played?

# Group by version and calculate the metrics
metrics = data.groupby('version').agg(
    average_sum_gamerounds=('sum_gamerounds', 'mean'),
    user_count=('userid', 'count')
)

# Calculate the percentage of count id for each version
metrics['percentage_count_id'] = (metrics['user_count'] / metrics['user_count'].sum()) * 100

print(metrics)

The control group (gate_30) consists of 44,700 people, accounting for 49.56% of the total, with an average of 52.45 rounds played. The experimental group (gate_40) consists of 45,489 people, accounting for 50.43% of the total, with an average of 51.29 rounds played. It can be seen that the control and experimental groups are evenly distributed.

Q: What are the retention rates for the control group (gate_30) and the experimental group (gate_40)?

# Group by version and calculate the metrics
metrics2 = data.groupby('version').agg(
    average_retention_1_rate=('retention_1', 'mean'),
    average_retention_7_rate=('retention_7', 'mean')
)

print(metrics2)

The one-day retention rate for the control group (gate_30) is 44.81%, while for the experimental group (gate_40) it is 44.22%. The seven-day retention rate for the control group (gate_30) is 19.02%, and for the experimental group (gate_40) it is 18.02%. From the descriptive statistics, it appears that the retention rates for the control group (gate_30) are higher. However, it is not appropriate to conclude that the control group is better, as it could simply be due to chance in sampling favoring the control group.

A/B Testing 1: Is there a significant difference in the number of rounds played between the control group and the experimental group?

We are now officially entering the A/B testing analysis. I have drawn a graph to gain a clearer understanding of the distribution of gamerounds.

# Plotting distribution plot
plt.figure(figsize=(10, 6))
sns.kdeplot(data=data_no_outliers, x='sum_gamerounds', hue='version', fill=True)
plt.title('Distribution Plot of sum_gamerounds by Version')
plt.xlabel('Sum Gamerounds')
plt.ylabel('Density')
plt.legend(title='Version')
plt.show()

It can be observed that the number of rounds played (gamerounds) by the control group and the experimental group overlaps considerably, and both distributions are right-skewed, unlike the normal distribution. Therefore, it is necessary to perform a Shapiro-Wilk test to check for normality.

If the data is normally distributed, a Levene test is conducted to assess homogeneity of variances. If both groups show homogeneity of variances, a standard t-test is applied. In case of heterogeneous variances between the two groups, a Welch t-test is utilized.

If the data does not follow a normal distribution, nonparametric analysis is conducted.

# Extract sum_gamerounds for each version
sum_gamerounds_gate_30 = data[data['version'] == 'gate_30']['sum_gamerounds']
sum_gamerounds_gate_40 = data[data['version'] == 'gate_40']['sum_gamerounds']

# Perform Shapiro-Wilk test
statistic_30, p_value_30 = shapiro(sum_gamerounds_gate_30)
statistic_40, p_value_40 = shapiro(sum_gamerounds_gate_40)

# Print the results
print(f'Shapiro-Wilk Test for gamerounds(gate_30) - Statistic: {statistic_30}, p-value: {p_value_30:.5f}')
print(f'Shapiro-Wilk Test for gamerounds(gate_40) - Statistic: {statistic_40}, p-value: {p_value_40:.5f}')

Shapiro-Wilk Test for gamerounds(gate_30) - Statistic: 0.088, p-value: 0.00000
Shapiro-Wilk Test for gamerounds(gate_40) - Statistic: 0.482, p-value: 0.00000

Both reject the null hypothesis that the data follows a normal distribution at a 5% significance level. Thus, it can be inferred that the distribution of rounds played by players is not normal, necessitating nonparametric analysis. In this case, I opt for the commonly used Mann-Whitney U test. According to the linked article, the assumptions of this test are as follows:

The variable must be continuous, as this test ranks observations for each group.
The data is assumed to be non-normally distributed or skewed.
Despite the non-normal distribution assumption for both groups, it is assumed that the shapes of the distributions are similar.
The data should consist of two randomly selected independent samples, meaning there is no relationship between these groups.
To conduct an effective test, a sufficient sample size is required, typically with more than 5 observations per group.

H0 : Two populations are equal

H1 : Two populations are not equal

# Perform Mann-Whitney U test
statistic, p_value = mannwhitneyu(sum_gamerounds_gate_30, sum_gamerounds_gate_40)

# Print the results
print(f'Mann-Whitney U Test - Statistic: {statistic}, p-value: {p_value}')

Mann-Whitney U Test - Statistic: 1024331250.5, p-value: 0.0502

With a p-value of 0.0502, the results indicate that we do not reject the null hypothesis of equality between the two groups. In other words, there is no significant difference in the number of gamerounds between the control and experimental groups.

A/B Testing 2: Is there a significant difference in retention rates between the control and experimental groups? (Using Z-test)

H0: µ1 — µ2 <= 0 (There is no difference in average retention rates between the control and experimental groups)

H1: µ1 — µ2 > 0 (The average retention rate of the control group is higher than that of the experimental group)

Q: Is there a significant difference in one-day retention rates?”

# Convert 'retention_1' to binary (0 for False, 1 for True)
data['retention_1_binary'] = data['retention_1'].astype(int)

# Count number of successes (retention_1) and total trials for each version
r1successes_gate_30 = data[data['version'] == 'gate_30']['retention_1_binary'].sum()
r1trials_gate_30 = len(data[data['version'] == 'gate_30'])
r1successes_gate_40 = data[data['version'] == 'gate_40']['retention_1_binary'].sum()
r1trials_gate_40 = len(data[data['version'] == 'gate_40'])

# Perform z-test for proportions
r1count = np.array([r1successes_gate_30, r1successes_gate_40])
r1nobs = np.array([r1trials_gate_30, r1trials_gate_40])
r1z_score, r1p_value = proportions_ztest(r1count, r1nobs)

# Print the results
print(f'Z-score_r1: {r1z_score}, p-value_r1: {r1p_value}')


Z-score_r1: 1.7840, p-value_r1: 0.0744

Under the Z-test, the p-value is 7.44%. At a 5% significance level, we do not reject the null hypothesis that the difference in one-day retention rates between the two groups is greater than 0. In other words, there is no difference in one-day retention rates between the control and experimental groups.

Q: Is there a significant difference in seven-day retention rates?

# Convert 'retention_7' to binary (0 for False, 1 for True)
data['retention_7_binary'] = data['retention_7'].astype(int)

# Count number of successes (retention_7) and total trials for each version
r7successes_gate_30 = data[data['version'] == 'gate_30']['retention_7_binary'].sum()
r7trials_gate_30 = len(data[data['version'] == 'gate_30'])
r7successes_gate_40 = data[data['version'] == 'gate_40']['retention_7_binary'].sum()
r7trials_gate_40 = len(data[data['version'] == 'gate_40'])

# Perform z-test for proportions
r7count = np.array([r7successes_gate_30, r7successes_gate_40])
r7nobs = np.array([r7trials_gate_30, r7trials_gate_40])
r7z_score, r7p_value = proportions_ztest(r7count, r7nobs)

print(f'Z-score_r1: {r7z_score}, p-value_r1: {r7p_value}')

Z-score_r1: 3.1643, p-value_r1: 0.0015

Under the Z-test, the p-value is 0.15%. At a 5% significance level, we reject the null hypothesis that the difference in seven-day retention rates between the two groups is greater than 0. This means that the control group (gate set at level 30) has a higher seven-day retention rate than the experimental group (gate set at level 40).

A/B Testing 3: Is there a significant difference in retention rates between the control and experimental groups? (Using the Bootstrapping method)

Image Resource : Introduction to Bootstrapping in Data Science — part 1

Bootstrapping is a method of sampling with replacement from a given training set, where each sample selected may be chosen again and added back to the training set. In other words, it involves creating a sample distribution that represents the population distribution by repeatedly sampling from a finite dataset without assuming normality.

Once statistics for each bootstrapped sample are obtained, you can plot a distribution to understand the shape of your data and calculate standard deviation, variance, hypothesis tests, and confidence intervals. Since each bootstrapped sample represents a subset randomly selected from the population, we can make inferences about the entire population.

Still confused? Let me give you an example for ChatGPT to implement. Suppose we have a population of 500 students’ grades, which are quantitative data ranging from 0 to 100, and I want the population to follow a common right-skewed distribution seen in reality. Because if an exam is difficult, usually most people score low, and only a few score high. ChatGPT constructs a Beta distribution with a = 2, b = 5, with a mean of 28.7 (shown in the blue distribution plot).

Next, I use bootstrapping with a sample size of 30. Each time, I resample from this dataset to create a new sample and calculate its mean. This process is repeated 1000 times. The mean of the bootstrapped samples is 29.42 (shown in the green distribution plot), with a 95% confidence interval of [24.3, 34.2].

To facilitate comparison of the means of the two distributions, I overlayed the plots for comparison.

It can be seen that the means of both distributions are very close.

As the retention rates in this study are proportion data, let me provide an example of proportion data for ChatGPT to implement. For instance, suppose we have data on whether 500 students receive tutoring. Based on my intuition, most students probably receive tutoring. Therefore, let’s say out of the population, 372 students receive tutoring and 128 students do not. Hence, the tutoring rate is 74% (shown in the left plot). Strictly speaking, this follows a Bernoulli distribution, with only two possible outcomes: success (=1) and failure (=0).

Next, I use bootstrapping with a sample size of 30 and repeat the resampling process 1000 times. The resulting sample proportion is 74.22%, with a 95% confidence interval of [0.56, 0.9] (shown in the right plot).

It can be observed that the population proportion (74%) and the proportion estimated by bootstrapping (74.22%) are very close.

Returning to the topic, to compare whether there is a significant difference in retention rates between the control and experimental groups, I resampled 5000 times using the Bootstrapping method and obtained the estimated retention rate distribution as follows:

# Number of bootstrap samples
n_bootstrap = 5000

# Initialize empty arrays to store bootstrapped means for each A/B group
boot_1d = np.zeros((n_bootstrap, 2))
boot_7d = np.zeros((n_bootstrap, 2))

# Bootstrapping
for i in range(n_bootstrap):
    # Generate bootstrap samples with replacement
    bootstrap_sample = data.sample(frac=1, replace=True)
    
    # Calculate mean retention rates for each version
    boot_mean_1 = bootstrap_sample.groupby('version')['retention_1'].mean().values
    boot_mean_7 = bootstrap_sample.groupby('version')['retention_7'].mean().values
    
    # Store bootstrapped means in arrays
    boot_1d[i] = boot_mean_1
    boot_7d[i] = boot_mean_7

# Plot Kernel Density Estimate of the bootstrap distributions
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(13, 5))

sns.kdeplot(boot_1d[:, 0], color='lightcoral', ax=ax1, label="Gate 30")
sns.kdeplot(boot_1d[:, 1], color='lightblue', ax=ax1, label="Gate 40")

sns.kdeplot(boot_7d[:, 0], color='lightcoral', ax=ax2, label="Gate 30")
sns.kdeplot(boot_7d[:, 1], color='lightblue', ax=ax2, label="Gate 40")

ax1.set_xlabel("Retention rate", size=12)
ax1.set_ylabel("Density", size=12)
ax1.set_title("1 day retention rate distribution", fontweight="bold", size=14)

ax2.set_xlabel("Retention rate", size=12)
ax2.set_title("7 days retention rate distribution", fontweight="bold", size=14)

# Add legend
ax1.legend()
ax2.legend()

plt.show()

Above, we can see that the estimated average retention rate and distribution plot from bootstrapping are both higher for ‘gate30’ (gate placed at level 30). But is this difference statistically significant? I’ve visualized their difference in the following plot:

We can see that the probability of the difference being 0 (no difference between the two) is very low.

In the plot for ‘One-day retention rate’ on the left, the probability of the difference between ‘gate30’ and ‘gate40’ being greater than 0 is 96.6%, which falls within the rejection region at the 5% significance level. Therefore, I can conclude that ‘placing the gate at level 30 leads to higher one-day retention rates’.

In the plot for ‘Seven-day retention rate’ on the left, the probability of the difference between ‘gate30’ and ‘gate40’ being greater than 0 is 99.9%, also falling within the rejection region at the 5% significance level. Hence, I can conclude that ‘placing the gate at level 30 leads to higher seven-day retention rates’.

Conclusion

This study utilized non-parametric methods, specifically the Mann-Whitney U test, and found no significant difference in the number of rounds played between ‘gate30’ and ‘gate40’.

In comparing retention rates, both Z-test and Bootstrapping methods were employed. The Z-test results indicated no significant difference in one-day retention rates, while for seven-day retention rates, ‘gate30’ was significantly higher.

Through Bootstrapping, it was found that both one-day and seven-day retention rates for ‘gate30’ were significantly higher. Therefore, the conclusion drawn is that placing the gate at level 30 leads to higher retention rates. One interpretation is that players may be more satisfied when taking breaks during mobile game sessions.

However, this seemingly perfect conclusion harbors a misconception. While statistically significant, the effects of large sample size statistics often come with very small practical effects, leading to ‘statistically significant, but not practically significant’ results. For instance, statistically significant weight loss of 30g from taking a diet pill may be insignificant to consumers.

Returning to the topic, the Bootstrapping estimated one-day retention rates are as follows:

Gate 30: 44.82%, Gate 40: 44.23% B

ootstrapping estimated seven-day retention rates are as follows:

Gate 30: 19.01%, Gate 40: 18.21%

The evidence suggests that placing the gate at level 30 leads to higher retention rates. However, how much revenue can this slight difference in retention rates actually generate for the company? I believe this is what companies are most concerned about. Therefore, having access to additional data such as player spending behavior could lead to more insightful analyses.

References