CMU MLD Logo

10-742: Machine Learning in Healthcare

Fall 2024

Course Overview

Machine learning (ML) is experiencing explosive growth in healthcare, and is now top of mind for leaders at hospitals, insurance companies, and pharmaceutical firms. This course offers a survey of ML in healthcare today. Students will gain firsthand experience working with electronic health records, time-series medical data, health insurance ("administrative") data, and many other healthcare data sources. The course will cover how ML (and AI more generally) is impacting healthcare financing, operations, and care delivery, with select 'deep dives' into specific verticals such as radiology, pathology, and ophthalmology. Students will learn how to apply ML methods to varied problems in healthcare, such as predicting disease onset and forecasting how long a patient will remain in the hospital. The course will address the challenges of working responsibly with healthcare data, including potential biases and inconsistencies and confounders, and provide strategies for identifying and mitigating these issues.

The course assumes a strong competency in Python/pandas/jupyter, and hands-on experience building models such as xgboost, logistic regression, and neural networks. It also requires a mathematical maturity that includes college-level probability, statistics, and discrete math. No background in healthcare is expected. The class is open to graduate students in SCS. It is also open to qualified, motivated undergraduates from all majors in SCS, and to other students who fulfill the above requirements.

The course consists of twice-weekly lectures with assigned readings, problem sets, and a final project. There will be no exams.

Staff


Instructor Adam Berger adam@andrew.cmu.edu Office hours: Tues 5-6PM at GHC 9115
TA Andrew Wang azwang2@andrew.cmu.edu Office hours: Wed 11-12AM GHC 9112
TA Raehash Shah raehashs@andrew.cmu.edu Office hours: Mon 4-5PM GHC 9112
Final Project Advisor Venkat Sivaraman vsivaram@andrew.cmu.edu Office hours: Fri 4:30-5:30PM NSH A408

Tools

You will find all assignments on the course Canvas site. This is also where you will submit your assignments.

We will communicate with one another using the course Piazza site You are MUCH more likely to get a timely response from course staff if you post your note on Piazza, as opposed to other channels (email/slack/text/etc). You may post questions related to asignments on Piazza, but do not post source code, answers, or even hints to Piazza, or any other location where other students may be able to see it.

Schedule

Lectures are Tues/Thurs 3:30-4:50PM ET at Scaife Hall 236.
Date Topic (Presenter) Description
SEGMENT 1: ESCAPE VELOCITY
Part 1: Getting Oriented
1

Tues Aug 27
Course Overview
(Adam Berger)
Download slides
Course syllabus, policies and logistics. The US healthcare system and its oddities. Why Pittsburgh? The last 50 years and the next 50 years. Hinton vs. Langlotz.

Assignment 0 Out

Prerequisite readings: None
2

Thurs Aug 29
Healthcare entities, data, and systems
(Adam Berger)
Download slides
Cast of characters in healthcare. Delegated risk. The flaw at the core of the US healthcare system. Healthcare data and systems. Clinical data is a weak and untrustworthy proxy for ground truth.

Assignment 1 Out

Prerequisite readings:
Optional readings:
3

Tues Sep 3
Risk Stratification
(Adam Berger)
Download slides
What is risk stratification? Traditional approaches. Why we seek parsimonious models, and how to build them. Evaluation. Leaky labels. Models that "look over the physician’s shoulder." Intervention-tainted outcomes. Interpretable models.

Prerequisite readings:
Optional readings:
4

Thurs Sep 5
Regulation
(Adam Berger)
Download slides
Role of government in healthcare in general and ML+healthcare in particular. What is HIPAA, and what it's not. When an IRB is required. ML approaches to DeID, including conditional random fields. HiTech Act. Meaningful Use. Cures Act.

Assignment 0 Due

Prerequisite readings: None
Part 2: Using (and Misusing) Data
5

Tues Sep 10
Statistics you need to know
(Adam Berger)
Download slides
Key probability distributions in healthcare. Hypothesis testing. Survival Analysis. Evaluation metrics. Missing data and how to address it.

Prerequisite readings: None

Optional readings:
6

Thurs Sep 12
Lies, Damn Lies, And Healthcare Data
(Adam Berger)
Download slides
Common types of bias in healthcare data. Debiasing techniques. Process bias: Learning the "wrong thing" from healthcare data. Non-stationarity of healthcare data. Transfer Learning.

Assignment 1 Due
Assignment 2 Out

Prerequisite readings:
Optional readings:
Part 3: Paying for Healthcare
7

Tues Sep 17
Paying for health care 1
(Pamela Peele)
Download slides
What is health insurance? Underwriting, rate making, risk adjustment. Adverse Selection. Moral Hazard. Utilization Management. Cost-sharing: deductibles, co-insurance. Providers taking risk; ACOs. Commercial insurance. Public insurance. MLR. Measuring total cost of care. Care management. Medicare: Type A, B, C, D. Regulating bodies: CMS/HHS. Quality Metrics. MA risk delegation. Adverse selection. Gaming the system.

Prerequisite readings:
    None
8

Thurs Sep 19
Paying for Healthcare 2
(Pamela Peele)
Download slides
Surprisingly difficult problems in population health: Care management. Provider attribution. Clustering encounters. Fraud/Waste/Abuse. Network Optimization. HCC coding.

Prerequisite readings:
    TBD
SEGMENT 2: APPLYING ML IN HEALTHCARE
Part 4: ML in Clinical Care
9

Tues Sep 24
ML + Radiology
(Shandong Wu)
Radiology practice, radiological imaging modalities (X-ray, CT, MRI, and Ultrasound). How AI/machine learning can augment radiological imaging acquisition and interpretation to empower radiologists for diagnosis and decision making.

Prerequisite readings:
10

Thurs Sep 26
ML + Pathology
(Liron Pantanowitz)
Download slides
What is computational pathology? The feasibility of employing AI in Pathology practice. There are many unmet needs in the practice of pathology, like workload increase and workforce shortage, that could potentially be solved by leveraging AI/ML. Reviewing the different applications of traditional and generative AI in pathology. The impact of pre-imaging factors, business use case in different practice settings, prerequisites for deployment, and method for clinical validation of AI.

Assignment 2 Due
Assignment 3 Out

Prerequisite readings:
11

Tues Oct 1
ML + Ophthalmology
(Jay Chhablani)
Download slides
Brief intro to modern ophthalmology. OCT image analysis. Diagnosing AMD, diabetic retinopathy, and glaucoma from OCT and retinal fundus. Grading cataracts. Applications to third-world medicine. Challenges with generalization across imaging devices, populations, and clinical settings. Regulatory constraints.

Prerequisite readings:
12

Thurs Oct 3
Clinical Decision Support
(Shyam Visweswaran)
Download slides
AI-based clinical decision support (AI-CDS) systems in healthcare. Why healthcare requires AI-CDS. Stages in the development and evaluation of AI-CDS. Examples of real-world AI-CDS projects. Racial bias in AI-CDS.

Prerequisite readings:
13

Tues Oct 8
Dynamic Treatment Strategies
(Adam Berger)
Download slides
Reinforcement learning (RL) in healthcare, focusing on one case study: using RL for treating sepsis. Defining the state-space configuration. Why off-policy learning is important in medicine and what are its complexities.

Assignment 3 Due
Assignment 4 Out

Prerequisite readings:
Optional readings:
14

Thurs Oct 10
Human-centered ML in Healthcare
(Venkatesh Sivaraman)
Download slides
What is human-centered design? Case studies in injecting ML into clinician workflows. How to incorporate clinicians into the modeling process.

Final Project Milestone 1

Prerequisite readings: none
Part 5: Secondary Usages of EHRs
Week of Oct 14 Fall Break - No Classes
15

Tues Oct 22
Text Processing in Healthcare
(Adam Berger)
Download slides
Why do we have clinical notes? Peculiar challenges of clinical text. Classical and modern NLP approaches. Secondary uses of EHR text. ClinicalBERT.

Assignment 4 Due

Prerequisite readings:
Optional readings:
16

Thurs Oct 24
Cancer Phenotyping
(Harry Hochheiser)
Download slides
What is cancer phenotyping? Why is it a difficult problem, and how can unstructured clinical data help? SEER, NCDB, TCGA. Extracting temporal relations. Extracting structured data (e.g. tumor site, histologic grade) from pathology reports. Applications to treatment toxicity, pharmacovigilance. The saga of of Watson for Oncology.

Prerequisite readings:
17

Tues Oct 29
Clinical trial matching using LLMs
(Michael Wornow)
Download slides
A third of clinical trials fail simply because they can't enroll enough patients. Why? How patients are matched to trials today and why this is challenging technically. Prior work in rule-based and deep learning methods for matching patients to trials. Recent work involving LLMs. Practical considerations for deployment, limitations, and opportunities for future work.

Prerequisite readings:
SEGMENT 3: ADVANCED TOPICS
Part 6: Precision Medicine
18

Thurs Oct 31
Omics 1
(Carl Kingsford)
Download slides
Overview of challenges and issues in precision medicine (PM). Typical data types, endpoints, issues, and challenges. How PM fits into the drug development pipeline. Select commercial and academic uses cases. Applying ML technologies such as SVMs, random forests, knowledge graphs, and deep learning.

Prerequisite readings: Optional readings:
Tues Nov 5 Democracy Day - No Class (Go Vote!)
19

Thurs Nov 7
Omics 2
(Carl Kingsford)
Download slides
A deeper dive into the AI, ML, and computational techniques for precision medicine using molecular measurements. Various feature extraction algorithms for sequencing data. Computation techniques for translating biological measurements into actionable insights.

Final Project Milestone 2

Prerequisite readings:
Optional readings:
Part 7: Study Design and Forecasting
20

Tues Nov 12
Causal inference
(Colin Gray)
Download slides
Why predictive modeling is insufficient for understanding causality. A primer on propensity scores, matching, regression adjustments, and experimentation as applied to healthcare. How ML techniques are used in modern causal modeling.

Prerequisite readings: none

Optional readings:
21

Thurs Nov 14
Bayesian Clinical Trials
(Andrew King)
An introduction to clinical trial design. Like a mechanical watch, trials can be embellished with complications that add functionality. A review of Bayesian statistics. Focusing on the REMAP-CAP trial.

Prerequisite readings:
22

Tues Nov 19
Epidemiological modeling and forecasting
(Roni Rosenfeld)
Health care vs. public health vs. population health. Fundamentals of epidemiology. Epidemics, pandemics and endemics. Why epidemics come in waves. Approaches to epidemic modeling. Forecasting, nowcasting and backcasting. Approaches to epidemic forecasting.

Prerequisite readings: none
Part 8: Final Sprint
23

Thurs Nov 21
Real World Case studies
(Adam Berger)
Download slides
Case studies of real-world healthcare companies whose business leverages machine learning.

Prerequisite readings: none
24

Tues Nov 26
Special office hours for final project
(All teaching staff)
Teaching staff will be at the lecture hall and available for students to drop in and get feedback/guidance on their final project.


Thurs Nov 28
Thanksgiving - no class
25

Tues Dec 3
Final project - poster presentations
(Adam Berger)

Final Project Milestone 3

NOTE: Location for this class will be GHC 4300 Commons

Class members will present their final project to the public in a traditional conference poster session setting.

Course Policies

Attendance

We expect that students will attend class in person. We do not support remote participation in lectures and will not make recordings available. In an age of virtual / asynchronous learning, this may seem retrograde. But there’s a good reason for it. Pittsburgh is the epicenter of health technology research, and we’ve managed to convince some of the leading minds in healthcare/AI to deliver lectures on their topics of expertise. We can't have them presenting to an empty lecture hall. We will take attendance at all lectures, and attendance is an important ingredient in your final course grade. We understand that people have high-priority conflicts and get ill, and in recognition of that, we consider four absences during the semester as equivalent to full attendance.

Collaboration and Use of Third Party Tools/Libraries/Repos

For the assignments – including the final project, you are welcome to use any generative AI tool to generate or check your source code or accompanying text. You are also welcome to fork an existing github repo, so long as that repo does not belong to a fellow class member or a student from a previous version of this class or any other healthcare ML class. Having said that, any work you turn in (whatever its provenance) is completely your responsibility. You will learn much more from working through any problem than from copying its answer from a tool or other source. Much good can come from using generative AI, but you should never accept the output of such a tool without completely understanding it.

You may work alone or in groups for the assignments. Each student must write up and turn in their own assignment. On every assignment, you must identify your collaborators. You must acknowledge all sources (e.g. Wikipedia, a website, a genAI tool) in your assignment. It is a violation of this policy to submit a problem solution that you cannot explain to a member of the course staff. Plagiarism and other dishonest behavior cannot be tolerated in any academic environment. If you have any questions about the collaboration policy, or if you feel that you may have violated the policy, please talk to one of the course staff.

Do not store your answers anywhere that others can easily access them. Your answers should not be accessible from the public internet, or any file system or cloud repository where other students (today or in the future) may be able to access them.

For the final project, you may work alone or in groups of two students at most. Groups should create a single poster/writeup.

Late Policy

This course's late policy is simple: hand in all your assignments on time. There is no grace period for submitting answers to the prerequisite readings for each lecture; the entire purpose of those questions is to get you prepared for the lecture so you can get the most out of it. For the problem sets, we allow submissions up to two days after the due date, with a penalty of 25% of the total points available for one day late and 50% for two days late. We won't accept submissions more than 48 hours after the deadline, since that's when we'll start grading the assignments. You will receive zero credit for submitting the final project proposal or final project itself after the deadline.

Pedagogical Sequencing

We are intentionally front-loading the problem sets in the semester so that you can focus on your final projects in the second half of the semester, which means that some course material will appear in a problem set before it is presented in a lecture. That is a feature, not a bug. Especially when we expect you are seeing material for the first time, we will make every effort to explain the material carefully and completely in the problem set.

Pass-Fail

You are allowed to take this course as Pass/Fail; instructor permission is not required. What letter grade is the cutoff for a Pass will depend on your specific program; we do not specify whether or not you pass but rather we compute your letter grade the same as everyone else in the class and your program converts that letter grade to a Pass or Fail depending on their cutoff. Be sure to check with your program/department as to whether you can count a Pass/Fail course towards your degree requirements.

Reasonable Person Principle

The course staff subscribes the the ancient SCS Reasonable Person Principle and we will strive to accomodate students who face exceptional circumstances during the semester. Our overriding goal is that students learn, find themselves challenged, and grow into ML+Healthcare practitioners. The lectures, problem sets, readings, final project, and grades are all a means to that end.

Grading

We will calculate your final grade using the following weighting:

  • Problem sets (5 total): 35 points (point allocation marked on each problem set)
  • Final project proposal (Milestone 1): 3 points
  • Final project progress report (Milestone 2): 3 points
  • Final project (poster and live presentation): 19 points
  • Answers to reading assignments: 10 points (you get full credit for up to 3 missed reading assignments)
  • Attendance at lectures: 20 points (missing up to 4 lectures still counts as full attendance)
  • Participation: 10 points

    Final Project

    The final project is an opportunity to follow your curiosity, go deep in a particular healthcare domain and dataset, and build something interesting. You are responsible for coming up with a project idea and delivering a 1-page proposal to the course staff by Milestone 1 (see below), and iterating on the proposal until you receive approval from the instructor. You will then implement the project and present it to the course staff and a broader CMU audience in a public poster session, similar to what you would experience at an academic conference.

    It is NOT required that you develop something completely novel and publishable. A perfectly acceptable, full-credit final project would be to replicate (perhaps with your own twists) an existing result from the literature. Before submitting your proposal to the course staff, be sure to ask yourself these questions:

    1. Does this project address a meaningful problem in healthcare? (The first third of the course will help you make an informed assessment.)
    2. Does this project rely on machine learning (supervised, unsupervised, or RL)? Healthcare analytics projects are important and interesting, but this is an ML class.
    3. Does this project leverage a healthcare dataset?
    4. Is this project non-trivial? (E.g. downloading and running the code and data from a public github repo associated with a published paper would not pass this criterion)
    While we don't want to be overly-prescriptive about the project writeup, you will want to include at least some of the typical elements of most papers published in the last five years in the area of machine learning in healthcare, such as:
    • EDA (exploratory data analysis) and a "Table 1" describing your data in detail
    • Pointer to source code (for reproducibility)
    • Why you selected the model form(s) you did
    • Some details on model building: train/test split, approach to smoothing, number of iterations, etc.
    • Where and how your work fits relative to the academic literature
    • Shortcomings of your work and recommendations for future work
    Sample ideas

    Here’s some example ideas for a project - to get you calibrated on the scope and complexity we expect.

    • Pneumonia Detection Challenge: Participate retrospectively in the 2018 RSNA Pneumonia challenge.
    • Detecting abnormal heartbeat: Participate retrospectively in the 2016 Physionet challenge, to classify recordings of heartbeats. Physionet runs a cardiology-related challenge every year and the previous year challenges, along with data, are available here.
    • Automated Medical Coding: Using MIMIC or another dataset, build a model that assigns ICD-9/ICD-10 codes based on the clinical information (including notes) generated during a visit. See here, among many other sources.
    • Question answering (Q/A) on EMRs: The typical EMR patient record contains a vast amount of information, especially for older patients with chronic diseases. A high-quality Q/A system can allow physicians to to quickly retrieve specific information from a patient chart by asking natural language questions, instead of manually searching through pages and pages of information. Use a base model like ClinicalBERT and attach a Q/A head on top of this model, which you will probably want to train using an annotated clinical question-answering dataset (e.g. emrQA).
    • Patient embedding: Loosely following the work described in the DeepPatient paper, build an embedding model for patients and demonstrate its performance on some healthcare application.
    • Quality of care assessment: Use CMS data to build a model that predicts the quality of care provided by healthcare providers and institutions. (You will need to define ‘quality’). You may wish to look at datasets containing, for example, hospital readmissions, mortality rates, and patient satisfaction data. What features are most associated with high-quality care?
    • Multimodal AI: There are more and more public multimodal healthcare datasets becoming available, consisting of images and associated metadata. One example is the ISIC dataset, which contains images of skin lesions along with corresponding clinical information. Use a pre-trained convolutional neural network (CNN) model, such as ResNet, to extract visual features from the dermatoscopic images. In parallel, use a pre-trained language model, such as BERT or BioBERT, to extract textual features from the clinical notes. Explore different techniques for fusing the visual and textual features.
    • For other ideas for a project, have a look at this survey paper.
    Some popular, public healthcare datasets you may wish to consider:

    1. MIMIC
    2. The Cancer Imaging Archive
    3. The National Cancer Institute ‘SEER’ data
    4. Human Connectome Project (HCP)
    5. CDC data
    6. CMS data
    7. Human Mortality Database
    8. Chexpert chest x-ray data
    9. Kaggle also has many healthcare datasets and so does this.
    Some of these datasets may require paperwork/approval to get access, so plan ahead.

    Milestones

    The Final Project will have three milestones during the semester:

    • Milestone 1 (Oct 10): Submit 1-page project proposal, which describes the problem you are addressing, a brief outline of your approach, and the data you intend to use.
    • Milestone 2 (Nov 7): Submit 1-page progress report, describing work completed and pending.
    • Milestone 3 (Dec 3): Poster session / live demo. Submit your poster as a pdf within Canvas before the poster session.

    Credit

    This course derives inspiration from MIT 6.7930, a course in machine learning that MIT has offered since 2017. To our knowledge, that course is the first and (until now) only course on ML+Healthcare offered for credit to students at any US university. MIT has kindly put the 6.7930 course material under their OCW (open coursework) license, and we have drawn selectively from that material in constructing lectures and problem sets for this course. We thank Professors Peter Szolovits and David Sontag for their permission---granted directly and through OCW---to adapt the 6.7930 material for this course. Any errors in this (CMU) course material is, of course, the responsibility of the instructor and not anyone else.

    (END OF PAGE)