Click the “Knowledge City” above to follow us! Introduction

Hello everyone! Today, I would like to share some clinical and research notes written by a radiation oncologist who often shares valuable content on X. The radiation oncology department really produces a lot of note-taking enthusiasts; this guy genuinely treats X as his notebook. Most of the time, no one is obligated to teach you, but you must learn. However, mastering something well requires a broad perspective and teaching others! If you need a Word or PDF version, please reply “Broad Perspective” in the background.

Having multiple skills is beneficial, and I highly recommend everyone to learn about machine learning prediction models! Learning through sharing deepens understanding.

This article is approximately 2912 words long and contains many images.

Careful reading will take 5-10 minutes

The job of an assistant professor is like an iceberg. What you see is just the tip, while the bulk of effort & struggle remains hidden beneath the surface. Do you agree?

What we see: Lecturing with poise. Enviable long holidays. Prestigious publications. Innovative research projects.But that’s just the tip of the iceberg.

But, beneath the surface, there’s a hidden world: Endless marking The sting of rejections Juggling multiple projects The constant chase for grants Late-night committee meetings One-to-one student supervision The relentless search for funding

How to Efficiently Learn Machine Learning Prediction Models with mlr3 Meticulous peer-review processes Course development and perfecting Administrative tasks that pile up, unnoticed Community outreach that stretches beyond the campus Hours of teaching preparation, far from the lectern’s spotlight.

Overview of MLR3

mlr3 is an object-oriented, extensible machine learning framework focused on regression, classification, survival analysis, and other machine learning tasks. It is the successor to mlr and provides efficient model building and comparison.

Some key features of mlr3 include:

Object-oriented design: Implemented in R6 for a clean object-oriented design.
Optimized data operations: Uses data.table for faster and more convenient data operations.
Unified containers and results: Returns results in data.table format, simplifying the API.
Defensive programming and type safety: Checks user input through checkmate to avoid mechanisms in base R that might lose information.
Reduced dependencies: The number of dependencies for mlr3 has been greatly reduced for easier maintenance.
Support for a wide range of algorithms: The mlr3verse currently includes 138 algorithms, covering commonly used binary classification and regression tasks in medicine, unsupervised clustering tasks, and survival analysis tasks. (https://mlr3extralearners.mlr-org.com/articles/learners/list_learners.html)
MLR3 graph flow system: The graph flow system of mlr3 views the entire machine learning process as a graph (or flow), where each node represents an operation, such as feature engineering steps, learners, copying, branching, merging, etc. Data flows along these nodes to form a complete machine learning process.

Course Objectives

Compared to learning individual algorithm R packages, using a framework allows users to quickly master the use of various algorithms, facilitating benchmarking and model selection.
Learning based on the system helps extend downstream issues, such as the model interpretability solutions in the mlr3 system can seamlessly connect with the subsequent DLAEX system.

Currently, most courses are short-term training. Furthermore, a machine learning article requires many other foundational statistics and charting skills, and there is currently no comprehensive course that addresses this issue. The purpose of this course is not only to provide a solid foundation for machine learning but also to bridge the research on model interpretability, allowing students to seamlessly connect with advanced model interpretability tools like DALEX in their subsequent studies.

Instructors

1. Flexible Fatty

PhD in Oncology from a double first-class university, currently working at one of the top five cancer centers in China. Research focuses on real-world studies, bioinformatics analysis, and artificial intelligence research. Currently published over ten SCI papers as the first or co-first author, with a cumulative IF of 50+. Currently collaborating on research with several universities and hospitals in China. Together with the translation team, he translated the entire mlr3book into Chinese for the first time and published it on a public account.

2. Rio

MD, clinical doctor. Published over ten articles in both Chinese and English. Enthusiast of R and Python. Participated in the translation of approximately 50,000 words of the mlr3book.

Course Directory and Schedule

Part One: Basics of R Language and Tidyverse System

1. The necessity of learning R and preparation (environment setup and package installation)
2. General requirements for tidy data and data organization (tidydata)
3. Basics of R language (1) – One-dimensional variables
4. Basics of R language (2) – Two-dimensional and high-dimensional variables
5. General usage of functions and solutions to errors
6. Introduction to tidyverse system

Part Two: Traditional Clinical Basic Statistical Chart Production

1. Quick production of baseline tables between groups and statistical considerations
2. Batch implementation of univariate analysis and statistical considerations
3. Summary of methods for selecting cut-off values for continuous variables (including survival data)
4. Application and quick implementation of directed acyclic graphs
5. Batch implementation of multivariate analysis and sensitivity analysis for adjusting covariates
6. Some methods for identifying key factors (P-value method, machine learning methods, effect size change method, etc.)
7. Organization of survival data and conventional survival analysis methods (KM, COX, survival curves, and cumulative risk curve plotting)

Part Three: Overview of mlr3 Basics

1. How can mlr3 help us?
2. Machine learning in R language
3. Introduction to mlr3 package and DALEX package
4. Overview of DALEX package and model-agnostic interpretability solutions
5. Installing and loading mlr3 and DALEX, DALEXtra packages
6. Basic knowledge of mlr3 – sugar function
7. Basic knowledge of mlr3 – graph
8. Principles for resolving encountered errors

Part Four: Initial Exploration of mlr3 Overall Process and Detailed Tasks and Learners

1. Tasks (Task) – Classification tasks and regression tasks: internal testing tasks, using external data components, attributes and methods of tasks
2. Learners (learners) – Classification attributes and methods of learners
3. Preliminary introduction to evaluation
4. Introduction to commonly used learners – logistic regression; linear regression; decision trees; random forests; support vector machines; XGBoost; K-nearest neighbors; K-means clustering; neural networks; survival analysis COX regression; deep learning survival analysis (deepsurv); deep learning survival analysis (deephit); naive Bayes

Part Five: Evaluation, Resampling, and Benchmarking

1. Several strategies for resampling: leave-one-out; cross-validation; Bootstrap resampling and subsampling cross-validation
2. Attributes and usage of resampling objects
3. Benchmarking
4. Detailed evaluation – commonly used attributes and methods
5. Nested resampling

Part Six: Hyperparameter Tuning

1. The importance of hyperparameter tuning in machine learning
2. Model optimization: learners and search space; terminators; instantiating tuning objects using ti() function; black-box optimization problems and their algorithms
3. Tuning sugar functions – tune(), auto_tuner()
4. Expanding the search space
5. Simple application of data.table package

Part Seven: Feature Selection

1. Overview of feature selection
2. Filtering method: calculating scores for filters, feature importance, and selecting filtered features
3. Embedded methods, feature selection after embedding
4. Wrapper methods: simple forward selection, introduction to FSelectInstance class, various feature selection algorithms, feature selection incorporating multiple performance metrics optimization, AutoSelector for automatic feature selection (allowing feature selection to also incorporate resampling)

Part Eight: Sequential Pipeline

1. Introduction to graph flow system
2. Introduction to sequential graph flow methods
3. Building and using graph learners
4. Hyperparameter tuning for graph learners

Part Nine: Non-Sequential Pipeline

1. Introduction to non-sequential graph flow methods
2. Bagging method for building new learners
3. Stacking method for building new learners
4. Hyperparameter tuning and path selection in non-sequential graph flow systems

Part Ten: Data Preprocessing

1. Data cleaning
2. Constructing dummy variables
3. Handling missing values
4. Using pl(“robustify”) to maintain the stability of the architecture
5. Feature transformation

Part Eleven: Model Interpretability

1. Introduction to model-agnostic interpretability DALEX system
2. Introduction to Shapley value principles and applications with visualization
3. Introduction to LIME principles and applications with visualization
4. Methodology for variable importance based on evaluation metrics
5. Preliminary introduction to other methodologies

Part Twelve: Overall Process of Building and Validating Binary Classification Models Based on mlr3

1. Introduction to the overall process of binary classification prediction models
2. Building binary classification predictions
3. External validation of binary classification prediction models
4. DCA curves, calibration curves, and probability calibration of binary classification prediction models

Part Thirteen: Overall Process of Building and Validating Survival Models Based on mlr3

1. Introduction to the overall process of survival prediction models
2. Building survival predictions
3. External validation of survival prediction models
4. DCA curves and calibration curves of survival prediction models

Part Fourteen: Establishment and Evaluation of Unsupervised Clustering Systems Based on mlr3

1. Introduction to the overall process of unsupervised clustering prediction models
2. Building unsupervised clustering models
3. Internal validation of unsupervised clustering models – clinical and basic correlations
4. External validation of unsupervised clustering models – optimal model method

Teaching Format and Schedule

Teaching Format: Remote online live teaching.

Teaching Schedule: Course starts in June 2024, with a total of no less than 45 hours, with 4-6 hours of teaching on weekends, expected to complete all teaching content in 8-10 weeks.

Q&A Support: Establish a dedicated WeChat group for the course, with free Q&A for course content within one year.

Video Replay: Free unlimited replays within one year.

Course Price and After-Sales Guarantee

Course Price: 4800 RMB

Please contact the teaching assistant in advance for procedures such as public transfers.

Organizing Company: Tianqi Zhuli (Tianjin) Productivity Promotion Co., Ltd.

After-Sales Guarantee: Unconditional free refund within two weeks after the course officially starts.

Reward Policy: Students who publish articles with IF 10+ using the learned content can have their tuition refunded (specific requirements and processes need to be consulted with the teaching assistant).

Registration Consultation

You can contact my teaching assistant for inquiries.

Teaching Assistant Contact Number: 18502623993

Spirit of the Dragon Horse~!