Predicts the Recurrence in Thyroid Cancer Based on Machine Learning Models(Python)
This is my final presentation for the course ‘Introduction to Big Data Analytics’ I took during my graduate studies. The result is based on data from 383 real-world thyroid cancer patients and I addressed data imbalance through undersampling. Various models were used in Python to predict patient recurrence. Among these, XGBoost demonstrated the best performance with an AUC of 0.97 under k-fold cross-validation. In the final feature importance analysis, all models identified "response to cancer treatment" as the most significant feature. However, this may be an overly strong predictor.
Data Resource
The dataset Differentiated Thyroid Cancer Recurrence is from UC Irvine Machine Learning Repository, which was released in October 2023 and not widely used yet.
The dataset consists of 383 observations of thyroid cancer patients, with 13 clinical pathological features, all of which are categorical variables.
- ‘Thyroid Function’: Normal, Hyperthyroidism or Hypothyroidism.
- ‘Adenopathy’: No lymph node enlargement, Enlargement on the right side, Enlargement on the left side, Enlargement on both sides, Enlargement in the posterior area, etc.
- ‘Pathology’: Papillary carcinoma cells, Follicular carcinoma cells, Hurthle cells, etc.
- ‘T’: Abbreviation for Tumor, categorized into 7 grades based on the size and extent of the tumor.
- ’N’: Abbreviation for Node, categorized into 3 grades based on the severity of lymph node metastasis.
- ‘M’: Abbreviation for Metastasis, divided into 2 categories. ‘No distant metastasis’ and ‘Distant metastasis’.
- ‘Response’: After cancer treatment, patients may have ‘Excellent response,’ ‘Uncertain response,’ and other four situations.
- Finally, the outcome variable is ‘Recurrence’ with two results: ‘Recurrence’ and ‘Non-recurrence.
Univariate Analysis
After obtaining the data, it’s advisable to start with a simple exploratory data analysis (EDA) to quickly assess the distribution of the data and the relationships between variables.
Age: A right-skewed distribution with a peak between 30 and 35 years old
Gender: Female predominance (81.5%).
Cancer stage: Patients in stage one account for 86%.
Response to cancer treatment: 54% of patients show an excellent response after receiving cancer treatment.
Recurrence status: Ultimately, 72% of patients show no recurrence.
Bivariate Analysis
The heatmap above shows the correlation coefficients. In the bottom row, we can observe the top five variables most correlated with the target variable ‘Recurred’:
- Response: 0.76
- Risk: 0.73
- N (Node): 0.63
- Adenopathy: 0.61
- T (Tumor): 0.56
Feature Selection
Not all the features need to be included in the model. Including too many features can lead to overfitting or introduce noise that interferes with the model. Therefore, irrelevant variables are removed in this process.
After running Linear Probability Models (LPM) and Logit models for ‘Thyroid Function’ and ‘Physical Examination’ variables against the target variable, it was observed that the p-values for each dummy variable did not reach statistical significance levels. Therefore, it was concluded that these variables were not directly related to recurrence.
Besides, there is a variable named ‘Risk’(see figure below), which I interpreted as representing the risk of recurrence. However, using ‘Risk’ to predict ‘Recurred’ would inevitably yield very impressive results. This is something that our professor always emphasizes in class — when the model looks too good, you need to be extremely cautious because you might be predicting Y with Y. Therefore, it was also removed from the model.
Additionally, there is a feature in the graph called ‘Stage’ (cancer staging), ranging from stage one to stage four. However, cancer staging is determined by the grades of T, N, and M, which constitute the TNM Staging System commonly used in medical practice.
One may consider ‘Stage’ as a weighted average of T, N, and M, but only one should be chosen. ‘Stage’ merely categorizes cancer patients into four levels, whereas the TNM Staging System provides a more detailed classification.
For instance(see figure above), all patients are categorized as stage one, but upon closer inspection, their T (tumor size), N (extent of lymph node involvement), and M (presence of distant metastasis) vary. Therefore, it is advisable to exclude the coarser ‘Stage’ variable.
Data Processing
Since there are only 108 individuals who experienced recurrence, accounting for 28% of the entire sample (383 individuals), this dataset is characterized by class imbalance.
To avoid models predicting only ‘no recurrence,’ we employed the Random Undersampling method. After randomly removing data, there are now 108 instances of both ‘no recurrence’ and ‘recurrence,’ resulting in a total of 216 instances. Out of these, 80% are allocated for training (172 instances), and 20% for testing (44 instances).
Additionally, for feature transformation, categorical variables were encoded using one-hot encoding.
Model Prediction
Logistics Regression
Logistic regression can be used not only for classification but also for variable interpretation. The results are as follows:
Age is a continuous variable, so for each unit increase in age, the odds ratio of recurrence increases by e^.078 = 1.08 times.
In the “Response” variable, the results listed are for Excellent Response, with Indeterminate Response as the baseline. This means that patients with an excellent response to cancer treatment have a lower probability of recurrence compared to those with an indeterminate response.
Next, let’s proceed with prediction. The Python code for logistic regression and the ROC curve results are as follows:
# 'Recurred' is the target column
selected_cols = ['Gender', 'Smoking', 'Hx Smoking', 'Hx Radiothreapy', 'Adenopathy',
'Pathology', 'Focality', 'T', 'N', 'M','Response', 'Recurred']
# Extract the columns needed for one-hot encoding
data_selected = balanced_data[selected_cols]
# Split the data into features and target variable
X = data_selected.drop('Recurred', axis=1)
y = data_selected['Recurred']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and fit the logistic regression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
# Predict on the test set
y_pred = logreg.predict(X_test)
# Predict on training and testing data
y_pred_train = logreg.predict(X_train)
y_pred_test = logreg.predict(X_test)
# Evaluate the model
def evaluate_model(y_true, y_pred):
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
return precision, recall, f1, accuracy, cm
# Evaluate on training set
precision_train, recall_train, f1_train, accuracy_train, cm_train = evaluate_model(y_train, y_pred_train)
y_pred_proba_train = logreg.predict_proba(X_train)[:, 1]
auc_train = roc_auc_score(y_train, y_pred_proba_train)
# Evaluate on testing set
precision_test, recall_test, f1_test, accuracy_test, cm_test = evaluate_model(y_test, y_pred_test)
# Calculate AUC for testing set
y_pred_proba_test = logreg.predict_proba(X_test)[:, 1]
auc_test = roc_auc_score(y_test, y_pred_proba_test)
# Feature importance
feature_importance = pd.DataFrame({'Feature': X_train.columns, 'Importance': logreg.coef_[0]})
feature_importance.sort_values(by='Importance', ascending=False, inplace=True)
print("Training Set Metrics:")
print(f"AUC for Training Set: {auc_train:.4f}")
print(f"Precision: {precision_train}")
print(f"Recall: {recall_train}")
print(f"F1 Score: {f1_train}")
print(f"Accuracy: {accuracy_train}")
print(f"Confusion Matrix:\n{cm_train}")
print("\nTesting Set Metrics:")
print(f"Precision: {precision_test}")
print(f"AUC for Testing Set: {auc_test:.4f}")
print(f"Recall: {recall_test}")
print(f"F1 Score: {f1_test}")
print(f"Accuracy: {accuracy_test}")
print(f"Confusion Matrix:\n{cm_test}")
print(f"AUC for Testing Set: {auc_test:.4f}")
print("\nFeature Importance:")
print(feature_importance)
# K-fold cross-validation
# Define the metrics you want to evaluate
scoring = {
'precision': make_scorer(precision_score),
'recall': make_scorer(recall_score),
'f1_score': make_scorer(f1_score),
'accuracy': make_scorer(accuracy_score)
}
# Define the k-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)
# Perform cross-validation and compute scores for each metric
cv_results = cross_validate(logreg, X, y, cv=kf, scoring=scoring)
# Calculate mean AUC across k-fold cross-validation
mean_auc_cv = np.mean(auc_scores)
std_auc_cv = np.std(auc_scores)
# Calculate mean scores for each metric
mean_precision = np.mean(cv_results['test_precision'])
mean_recall = np.mean(cv_results['test_recall'])
mean_f1_score = np.mean(cv_results['test_f1_score'])
mean_accuracy = np.mean(cv_results['test_accuracy'])
# Print the mean scores
print("\nK-fold cross validation:")
print(f"Mean AUC across k-fold CV: {mean_auc_cv:.4f} ± {std_auc_cv:.4f}")
print(f"Mean Precision: {mean_precision:.2f}")
print(f"Mean Recall: {mean_recall:.2f}")
print(f"Mean F1 Score: {mean_f1_score:.2f}")
print(f"Mean Accuracy: {mean_accuracy:.2f}")
The average k-fold (k=10) cross-validation AUC is 0.97.
SVM(Support Vector Machine)
SVM is a supervised learning method based on finding a decision boundary that maximizes the margins between two classes, allowing for perfect separation.
SVM has two hyperparameters to adjust: Gamma and C.
Gamma = How far the influence of a single training example reaches.
If gamma is small, the influence range of data points is relatively far, and even distant data points have influence on the hyperplane. In other words, a low gamma value means that the model considers a broader range of points to define the decision boundary, thus outlining a smoother decision boundary.
If gamma is large, the influence range of data points is relatively close, and for two points to be considered as belonging to the same group, they need to be very close. A high gamma value leads the model to focus more on the details near each data point, resulting in more complex and irregular decision boundaries, which can lead to overfitting.
C = How much you want to punish your model for each misclassified point.
A larger C imposes a greater penalty on misclassifications, demanding the model to fit the training data more closely, hence minimizing classification errors. However, this may lead to overfitting, reducing the model’s ability to generalize.
A smaller C results in a softer margin, allowing for more classification errors. This may lead to underfitting, but it can improve the model’s generalization ability.
The combination and adjustment of the hyperparameters gamma and C affect the model’s fitting degree. As shown in the figure below, if gamma and C are too large (as seen in the bottom right corner), it can lead to SVM overfitting.
How to choose the best Gamma and C : Grid Search
# Train SVM model
svm_model = SVC(probability=True, kernel='rbf')
svm_model.fit(X_train, y_train)
# Define the parameters for tuning
param_grid = {
'C': [0.01, 0.1, 1, 10, 100], # Range of regularization parameter C
'gamma': [0.1, 1, 10] # Range of gamma values
}
# Perform grid search with cross-validation
grid_search = GridSearchCV(svm_model, param_grid, cv=10, scoring='accuracy')
# Fit the model for parameter tuning
grid_search.fit(X, y)
# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print("Best Parameters:", best_params)
print("Best Score (Accuracy):", best_score)
In the cross-validation results, the performance of accuracy is best when c=1 and gamma=0.1, as shown below:
Best Parameters: {'C': 1, 'gamma': 0.1}
Best Score (Accuracy): 0.9445887445887446
Next, let’s proceed with prediction. The SVM Python code and ROC Curve results are as follows:
# Train SVM model
svm_model = SVC(probability=True, C=1, kernel='rbf', gamma=0.1)
svm_model.fit(X_train, y_train)
# Predict on the test set
y_pred_svm = svm_model.predict(X_test)
y_pred = svm_model.predict(X_test)
# Predict probabilities for AUC calculation
y_pred_proba_train_svm = svm_model.predict_proba(X_train)[:, 1]
y_pred_proba_test_svm = svm_model.predict_proba(X_test)[:, 1]
# Predict on training and testing data
y_pred_train_svm = svm_model.predict(X_train)
y_pred_test_svm = svm_model.predict(X_test)
# Evaluate the model
def evaluate_model_svm(y_true, y_pred):
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
return precision, recall, f1, accuracy, cm
# Evaluate on training set
precision_train_svm, recall_train_svm, f1_train_svm, accuracy_train_svm, cm_train_svm = evaluate_model_svm(y_train,y_pred_train_svm)
# Evaluate on testing set
precision_test_svm, recall_test_svm, f1_test_svm, accuracy_test_svm, cm_test_svm = evaluate_model_svm(y_test, y_pred_test_svm)
# Calculate AUC for training set
auc_train_svm = roc_auc_score(y_train, y_pred_proba_train_svm)
# Calculate AUC for testing set
auc_test_svm = roc_auc_score(y_test, y_pred_proba_test_svm)
print("Support Vector Machine Metrics:")
print(f"AUC for Training Set: {auc_train_svm:.4f}")
print(f"Precision: {precision_train_svm}")
print(f"Recall: {recall_train_svm}")
print(f"F1 Score: {f1_train_svm}")
print(f"Accuracy: {accuracy_train_svm}")
print(f"Confusion Matrix:\n{cm_train_svm}")
print("\nTesting Set Metrics:")
print(f"AUC for Testing Set: {auc_test_svm:.4f}")
print(f"Precision: {precision_test_svm}")
print(f"Recall: {recall_test_svm}")
print(f"F1 Score: {f1_test_svm}")
print(f"Accuracy: {accuracy_test_svm}")
print(f"Confusion Matrix:\n{cm_test_svm}")
# Feature importance
perm_importance = permutation_importance(svm_model, X_test, y_test, n_repeats=30, random_state=42)
# Get feature importance scores
importance_scores = perm_importance.importances_mean
# Create a DataFrame for feature importance
feature_importance = pd.DataFrame({'Feature': X_train.columns, 'Importance': importance_scores})
feature_importance.sort_values(by='Importance', ascending=False, inplace=True)
print("\nFeature Importance:")
print(feature_importance)
# K-fold cross-validation for SVM
cv_results_svm = cross_validate(svm_model, X, y, cv=kf, scoring=scoring)
# Calculate mean AUC across k-fold cross-validation for SVM
auc_scores_svm = cross_val_score(svm_model, X, y, cv=kf, scoring='roc_auc')
mean_auc_cv_svm = np.mean(auc_scores_svm)
std_auc_cv_svm = np.std(auc_scores_svm)
# Calculate mean scores for each metric
mean_precision_svm = np.mean(cv_results_svm['test_precision'])
mean_recall_svm = np.mean(cv_results_svm['test_recall'])
mean_f1_score_svm = np.mean(cv_results_svm['test_f1_score'])
mean_accuracy_svm = np.mean(cv_results_svm['test_accuracy'])
# Print the mean scores for SVM
print("\nSVM K-fold cross validation:")
print(f"Mean AUC: {mean_auc_cv_svm:.4f} ± {std_auc_cv_svm:.4f}")
print(f"Mean Precision: {mean_precision_svm:.2f}")
print(f"Mean Recall: {mean_recall_svm:.2f}")
print(f"Mean F1 Score: {mean_f1_score_svm:.2f}")
print(f"Mean Accuracy: {mean_accuracy_svm:.2f}")
The average k-fold (k=10) cross-validation AUC is 0.95.
Random Forest
Random Forest is essentially constructing multiple independent decision trees, where each tree utilizes randomly selected training data and features. Finally, the final result is determined by a voting mechanism (mode/average). Below are the ROC Curve results:
The average k-fold (k=10) cross-validation AUC is 0.98.
XGBoost (Extreme Gradient Boosting)
The XGBoost algorithm works by continuously adding trees and splitting features to grow a tree. Each time a tree is added, it learns a new function to fit the residual of the previous prediction.
XGBoost is defined by its author as:
"a scalable machine learning system for tree boosting."
The key here is ‘boosting,’ which is a type of ensemble learning method, similar to bagging. Boosting works by recognizing that old classifiers may misclassify some data points in the training set. If we continue training with all the data, these misclassifications may persist. Therefore, we need to focus on learning from the misclassified data points. This way, the new classifiers trained on this data can produce better results for these particular cases.
Below are the ROC Curve results:
The average k-fold (k=10) cross-validation AUC is 0.97.
Conclusion
This study utilized data from 383 thyroid cancer patients and excluded features such as ‘thyroid function,’ ‘physical examination,’ ‘risk of recurrence,’ and ‘cancer stage’.
The categories were balanced using random undersampling, resulting in a final sample of 216 records, with 108 each for recurrence and non-recurrence.
Under 10-fold cross-validation, both logistic regression and machine learning models achieved AUC values ranging from 0.93 to 0.97, with XGBoost performing the best. However, the excessively precise predictive results can be attributed to strong predictor variables, as explained below.
Feature Importance
In the ranking of feature importance, ‘Response to cancer treatment’ consistently ranks first. Why?
According to an article titled ‘Interpretation of the Treatment Response Evaluation System for Differentiated Thyroid Cancer’ from Peking Union Medical College Hospital in Beijing, patients with an excellent response to cancer treatment have achieved clinical remission, with a recurrence rate of only 1% to 4%.
On the other hand, patients with a structurally incomplete response will continue to have or develop new metastases in the neck. Without treatment, 50% to 85% will remain in a disease-sustaining state, with a 50% probability of distant metastasis.
In our dataset of 383 samples, there were 208 patients with an excellent response, among whom only 1 experienced recurrence. On the contrary, there were 91 patients with a structurally incomplete response, among whom 89 experienced recurrence.
Such extreme distributions highlight the fact that this feature is an excessively strong predictor, resulting in the model’s predictive ability being too strong.
I remember during my final presentation, the professor immediately pointed out that the model was ‘too good to be true.’ In other words, in such cases, there’s no need to use machine learning methods; simply running logistic regression would yield excellent results.
Additionally, the professor mentioned that when dealing with imbalanced data, undersampling leads to sample loss. Instead, weighting methods should be considered, where different weights are assigned to majority and minority classes to achieve balance.
I noticed that others often encounter the problem of overly optimistic models as well. This is a common misconception when using machine learning — thinking that fancy models are always better, which is not necessarily true.
If such issues arise, it suggests the presence of excessively strong predictors. In such cases, a simple classifier with about three to four layers of if-else statements would suffice. Machine learning may not be suitable for the dataset at hand, so it’s better to try another dataset.
References
- Xin-Hui Tan(2023) , "Detecting Intradialytic Hypotension Using Machine Learning", National Taiwan University CSIE Master's thesis
- Nan Miles Xi(2022), “Improving the diagnosis of thyroid cancer by machine learning and clinical data”, scientific reports
- scikit learn, RBF SVM parameters
- Visually Explained, The Kernel Trick in Support Vector Machine (SVM)