10-742: Machine Learning in Healthcare

Fall 2024

Course Overview

Machine learning (ML) is experiencing explosive growth in healthcare, and is now top of mind for leaders at hospitals, insurance companies, and pharmaceutical firms. This course offers a survey of ML in healthcare today. Students will gain firsthand experience working with electronic health records, time-series medical data, health insurance ("administrative") data, and many other healthcare data sources. The course will cover how ML (and AI more generally) is impacting healthcare financing, operations, and care delivery, with select 'deep dives' into specific verticals such as radiology, pathology, and ophthalmology. Students will learn how to apply ML methods to varied problems in healthcare, such as predicting disease onset and forecasting how long a patient will remain in the hospital. The course will address the challenges of working responsibly with healthcare data, including potential biases and inconsistencies and confounders, and provide strategies for identifying and mitigating these issues.

The course assumes a strong competency in Python/pandas/jupyter, and hands-on experience building models such as xgboost, logistic regression, and neural networks. It also requires a mathematical maturity that includes college-level probability, statistics, and discrete math. No background in healthcare is expected. The class is open to graduate students in SCS. It is also open to qualified, motivated undergraduates from all majors in SCS, and to other students who fulfill the above requirements.

The course consists of twice-weekly lectures with assigned readings, problem sets, and a final project. There will be no exams.

Staff

Instructor	Adam Berger	adam@andrew.cmu.edu	Office hours: Tues 5-6PM at GHC 9115
TA	Andrew Wang	azwang2@andrew.cmu.edu	Office hours: Wed 11-12AM GHC 9112
TA	Raehash Shah	raehashs@andrew.cmu.edu	Office hours: Mon 4-5PM GHC 9112
Final Project Advisor	Venkat Sivaraman	vsivaram@andrew.cmu.edu	Office hours: Fri 4:30-5:30PM NSH A408

Tools

You will find all assignments on the course Canvas site. This is also where you will submit your assignments.

We will communicate with one another using the course Piazza site You are MUCH more likely to get a timely response from course staff if you post your note on Piazza, as opposed to other channels (email/slack/text/etc). You may post questions related to asignments on Piazza, but do not post source code, answers, or even hints to Piazza, or any other location where other students may be able to see it.

Schedule

Lectures are Tues/Thurs 3:30-4:50PM ET at Scaife Hall 236.

Date	Topic (Presenter)	Description
SEGMENT 1: ESCAPE VELOCITY
Part 1: Getting Oriented
1 Tues Aug 27	Course Overview (Adam Berger) Download slides	Course syllabus, policies and logistics. The US healthcare system and its oddities. Why Pittsburgh? The last 50 years and the next 50 years. Hinton vs. Langlotz. Assignment 0 Out Prerequisite readings: None
2 Thurs Aug 29	Healthcare entities, data, and systems (Adam Berger) Download slides	Cast of characters in healthcare. Delegated risk. The flaw at the core of the US healthcare system. Healthcare data and systems. Clinical data is a weak and untrustworthy proxy for ground truth. Assignment 1 Out Prerequisite readings: Healthcare System Overview - Khan Academy Optional readings: Commonwealth Fund Overiew of US Healthcare System
3 Tues Sep 3	Risk Stratification (Adam Berger) Download slides	What is risk stratification? Traditional approaches. Why we seek parsimonious models, and how to build them. Evaluation. Leaky labels. Models that "look over the physician’s shoulder." Intervention-tainted outcomes. Interpretable models. Prerequisite readings: Population-Level Prediction of Type 2 Diabetes From Claims Data and Analysis of Risk Factors Optional readings: Improving Palliative Care with Deep Learning
4 Thurs Sep 5	Regulation (Adam Berger) Download slides	Role of government in healthcare in general and ML+healthcare in particular. What is HIPAA, and what it's not. When an IRB is required. ML approaches to DeID, including conditional random fields. HiTech Act. Meaningful Use. Cures Act. Assignment 0 Due Prerequisite readings: None
Part 2: Using (and Misusing) Data
5 Tues Sep 10	Statistics you need to know (Adam Berger) Download slides	Key probability distributions in healthcare. Hypothesis testing. Survival Analysis. Evaluation metrics. Missing data and how to address it. Prerequisite readings: None Optional readings: Statistics in Medicine
6 Thurs Sep 12	Lies, Damn Lies, And Healthcare Data (Adam Berger) Download slides	Common types of bias in healthcare data. Debiasing techniques. Process bias: Learning the "wrong thing" from healthcare data. Non-stationarity of healthcare data. Transfer Learning. Assignment 1 Due Assignment 2 Out Prerequisite readings: EHR Safari: Data is Contextual Dissecting racial bias in an algorithm used to manage the health of populations Optional readings: Why Is My Classifier Discriminatory? Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations Datasheets for Datasets
Part 3: Paying for Healthcare
7 Tues Sep 17	Paying for health care 1 (Pamela Peele) Download slides	What is health insurance? Underwriting, rate making, risk adjustment. Adverse Selection. Moral Hazard. Utilization Management. Cost-sharing: deductibles, co-insurance. Providers taking risk; ACOs. Commercial insurance. Public insurance. MLR. Measuring total cost of care. Care management. Medicare: Type A, B, C, D. Regulating bodies: CMS/HHS. Quality Metrics. MA risk delegation. Adverse selection. Gaming the system. Prerequisite readings: None
8 Thurs Sep 19	Paying for Healthcare 2 (Pamela Peele) Download slides	Surprisingly difficult problems in population health: Care management. Provider attribution. Clustering encounters. Fraud/Waste/Abuse. Network Optimization. HCC coding. Prerequisite readings: TBD
SEGMENT 2: APPLYING ML IN HEALTHCARE
Part 4: ML in Clinical Care
9 Tues Sep 24	ML + Radiology (Shandong Wu)	Radiology practice, radiological imaging modalities (X-ray, CT, MRI, and Ultrasound). How AI/machine learning can augment radiological imaging acquisition and interpretation to empower radiologists for diagnosis and decision making. Prerequisite readings: The Current and Future State of AI Interpretation of Medical Images Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection
10 Thurs Sep 26	ML + Pathology (Liron Pantanowitz) Download slides	What is computational pathology? The feasibility of employing AI in Pathology practice. There are many unmet needs in the practice of pathology, like workload increase and workforce shortage, that could potentially be solved by leveraging AI/ML. Reviewing the different applications of traditional and generative AI in pathology. The impact of pre-imaging factors, business use case in different practice settings, prerequisites for deployment, and method for clinical validation of AI. Assignment 2 Due Assignment 3 Out Prerequisite readings: (Skim sections 1-3; focus on section 4) Revolutionizing Digital Pathology With the Power of Generative Artificial Intelligence and Foundation Models (Read entire paper plus 'Methods' section) Learning to predict RNA sequence expressions from whole slide images with applications for search and classification
11 Tues Oct 1	ML + Ophthalmology (Jay Chhablani) Download slides	Brief intro to modern ophthalmology. OCT image analysis. Diagnosing AMD, diabetic retinopathy, and glaucoma from OCT and retinal fundus. Grading cataracts. Applications to third-world medicine. Challenges with generalization across imaging devices, populations, and clinical settings. Regulatory constraints. Prerequisite readings: Artificial intelligence and deep learning in ophthalmology A foundation model for generalizable disease detection from retinal images
12 Thurs Oct 3	Clinical Decision Support (Shyam Visweswaran) Download slides	AI-based clinical decision support (AI-CDS) systems in healthcare. Why healthcare requires AI-CDS. Stages in the development and evaluation of AI-CDS. Examples of real-world AI-CDS projects. Racial bias in AI-CDS. Prerequisite readings: Using machine learning to selectively highlight patient information Outlier-based detection of unusual patient-management actions: An ICU study
13 Tues Oct 8	Dynamic Treatment Strategies (Adam Berger) Download slides	Reinforcement learning (RL) in healthcare, focusing on one case study: using RL for treating sepsis. Defining the state-space configuration. Why off-policy learning is important in medicine and what are its complexities. Assignment 3 Due Assignment 4 Out Prerequisite readings: The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care (Be sure to also read the Methods section at the end) Optional readings: Does the “Artificial Intelligence Clinician” learn optimal treatment strategies for sepsis in intensive care? Understanding the Artificial Intelligence Clinician and optimal treatment strategies for sepsis in intensive care (For an RL refresher) Reinforcement Learning: An Introduction and/or Stanford CS221 lecture on MDPs
14 Thurs Oct 10	Human-centered ML in Healthcare (Venkatesh Sivaraman) Download slides	What is human-centered design? Case studies in injecting ML into clinician workflows. How to incorporate clinicians into the modeling process. Final Project Milestone 1 Prerequisite readings: none
Part 5: Secondary Usages of EHRs
Week of Oct 14	Fall Break - No Classes
15 Tues Oct 22	Text Processing in Healthcare (Adam Berger) Download slides	Why do we have clinical notes? Peculiar challenges of clinical text. Classical and modern NLP approaches. Secondary uses of EHR text. ClinicalBERT. Assignment 4 Due Prerequisite readings: Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data - PMC (Required if you haven't already read it) Efficient Estimation of Word Representations in Vector Space Optional readings: Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission
16 Thurs Oct 24	Cancer Phenotyping (Harry Hochheiser) Download slides	What is cancer phenotyping? Why is it a difficult problem, and how can unstructured clinical data help? SEER, NCDB, TCGA. Extracting temporal relations. Extracting structured data (e.g. tumor site, histologic grade) from pathology reports. Applications to treatment toxicity, pharmacovigilance. The saga of of Watson for Oncology. Prerequisite readings: From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records
17 Tues Oct 29	Clinical trial matching using LLMs (Michael Wornow) Download slides	A third of clinical trials fail simply because they can't enroll enough patients. Why? How patients are matched to trials today and why this is challenging technically. Prior work in rule-based and deep learning methods for matching patients to trials. Recent work involving LLMs. Practical considerations for deployment, limitations, and opportunities for future work. Prerequisite readings: PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models Zero-Shot Clinical Trial Patient Matching with LLMs
SEGMENT 3: ADVANCED TOPICS
Part 6: Precision Medicine
18 Thurs Oct 31	Omics 1 (Carl Kingsford) Download slides	Overview of challenges and issues in precision medicine (PM). Typical data types, endpoints, issues, and challenges. How PM fits into the drug development pipeline. Select commercial and academic uses cases. Applying ML technologies such as SVMs, random forests, knowledge graphs, and deep learning. Prerequisite readings: Building a knowledge graph to enable precision medicine GenomicKB: a knowledge graph for the human genome Optional readings: Learning Embeddings from Knowledge Graphs With Numeric Edge Attributes
Tues Nov 5	Democracy Day - No Class (Go Vote!)
19 Thurs Nov 7	Omics 2 (Carl Kingsford) Download slides	A deeper dive into the AI, ML, and computational techniques for precision medicine using molecular measurements. Various feature extraction algorithms for sequencing data. Computation techniques for translating biological measurements into actionable insights. Final Project Milestone 2 Prerequisite readings: Salmon provides fast and bias-aware quantification of transcript expression Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Optional readings: SQUID: transcriptomic structural variation detection from RNA-seq
Part 7: Study Design and Forecasting
20 Tues Nov 12	Causal inference (Colin Gray) Download slides	Why predictive modeling is insufficient for understanding causality. A primer on propensity scores, matching, regression adjustments, and experimentation as applied to healthcare. How ML techniques are used in modern causal modeling. Prerequisite readings: none Optional readings: Causality Handbook Causal ML Book
21 Thurs Nov 14	Bayesian Clinical Trials (Andrew King)	An introduction to clinical trial design. Like a mechanical watch, trials can be embellished with complications that add functionality. A review of Bayesian statistics. Focusing on the REMAP-CAP trial. Prerequisite readings: Bayesian Clinical Trials The REMAP-CAP (Randomized Embedded Multifactorial Adaptive Platform for Community-acquired Pneumonia) Study. Rationale and Design
22 Tues Nov 19	Epidemiological modeling and forecasting (Roni Rosenfeld)	Health care vs. public health vs. population health. Fundamentals of epidemiology. Epidemics, pandemics and endemics. Why epidemics come in waves. Approaches to epidemic modeling. Forecasting, nowcasting and backcasting. Approaches to epidemic forecasting. Prerequisite readings: none
Part 8: Final Sprint
23 Thurs Nov 21	Real World Case studies (Adam Berger) Download slides	Case studies of real-world healthcare companies whose business leverages machine learning. Prerequisite readings: none
24 Tues Nov 26	Special office hours for final project (All teaching staff)	Teaching staff will be at the lecture hall and available for students to drop in and get feedback/guidance on their final project.
Thurs Nov 28	Thanksgiving - no class
25 Tues Dec 3	Final project - poster presentations (Adam Berger)	Final Project Milestone 3 NOTE: Location for this class will be GHC 4300 Commons Class members will present their final project to the public in a traditional conference poster session setting.

Course Policies

Attendance

We expect that students will attend class in person. We do not support remote participation in lectures and will not make recordings available. In an age of virtual / asynchronous learning, this may seem retrograde. But there’s a good reason for it. Pittsburgh is the epicenter of health technology research, and we’ve managed to convince some of the leading minds in healthcare/AI to deliver lectures on their topics of expertise. We can't have them presenting to an empty lecture hall. We will take attendance at all lectures, and attendance is an important ingredient in your final course grade. We understand that people have high-priority conflicts and get ill, and in recognition of that, we consider four absences during the semester as equivalent to full attendance.

Collaboration and Use of Third Party Tools/Libraries/Repos

For the assignments – including the final project, you are welcome to use any generative AI tool to generate or check your source code or accompanying text. You are also welcome to fork an existing github repo, so long as that repo does not belong to a fellow class member or a student from a previous version of this class or any other healthcare ML class. Having said that, any work you turn in (whatever its provenance) is completely your responsibility. You will learn much more from working through any problem than from copying its answer from a tool or other source. Much good can come from using generative AI, but you should never accept the output of such a tool without completely understanding it.

You may work alone or in groups for the assignments. Each student must write up and turn in their own assignment. On every assignment, you must identify your collaborators. You must acknowledge all sources (e.g. Wikipedia, a website, a genAI tool) in your assignment. It is a violation of this policy to submit a problem solution that you cannot explain to a member of the course staff. Plagiarism and other dishonest behavior cannot be tolerated in any academic environment. If you have any questions about the collaboration policy, or if you feel that you may have violated the policy, please talk to one of the course staff.

Do not store your answers anywhere that others can easily access them. Your answers should not be accessible from the public internet, or any file system or cloud repository where other students (today or in the future) may be able to access them.

For the final project, you may work alone or in groups of two students at most. Groups should create a single poster/writeup.

Late Policy

This course's late policy is simple: hand in all your assignments on time. There is no grace period for submitting answers to the prerequisite readings for each lecture; the entire purpose of those questions is to get you prepared for the lecture so you can get the most out of it. For the problem sets, we allow submissions up to two days after the due date, with a penalty of 25% of the total points available for one day late and 50% for two days late. We won't accept submissions more than 48 hours after the deadline, since that's when we'll start grading the assignments. You will receive zero credit for submitting the final project proposal or final project itself after the deadline.

Pedagogical Sequencing

We are intentionally front-loading the problem sets in the semester so that you can focus on your final projects in the second half of the semester, which means that some course material will appear in a problem set before it is presented in a lecture. That is a feature, not a bug. Especially when we expect you are seeing material for the first time, we will make every effort to explain the material carefully and completely in the problem set.

Pass-Fail

You are allowed to take this course as Pass/Fail; instructor permission is not required. What letter grade is the cutoff for a Pass will depend on your specific program; we do not specify whether or not you pass but rather we compute your letter grade the same as everyone else in the class and your program converts that letter grade to a Pass or Fail depending on their cutoff. Be sure to check with your program/department as to whether you can count a Pass/Fail course towards your degree requirements.

Reasonable Person Principle

The course staff subscribes the the ancient SCS Reasonable Person Principle and we will strive to accomodate students who face exceptional circumstances during the semester. Our overriding goal is that students learn, find themselves challenged, and grow into ML+Healthcare practitioners. The lectures, problem sets, readings, final project, and grades are all a means to that end.

Grading

We will calculate your final grade using the following weighting:

Problem sets (5 total): 35 points (point allocation marked on each problem set)

Final project proposal (Milestone 1): 3 points

Final project progress report (Milestone 2): 3 points

Final project (poster and live presentation): 19 points

Answers to reading assignments: 10 points (you get full credit for up to 3 missed reading assignments)

Attendance at lectures: 20 points (missing up to 4 lectures still counts as full attendance)

Participation: 10 points

Final Project

The final project is an opportunity to follow your curiosity, go deep in a particular healthcare domain and dataset, and build something interesting. You are responsible for coming up with a project idea and delivering a 1-page proposal to the course staff by Milestone 1 (see below), and iterating on the proposal until you receive approval from the instructor. You will then implement the project and present it to the course staff and a broader CMU audience in a public poster session, similar to what you would experience at an academic conference.

It is NOT required that you develop something completely novel and publishable. A perfectly acceptable, full-credit final project would be to replicate (perhaps with your own twists) an existing result from the literature. Before submitting your proposal to the course staff, be sure to ask yourself these questions:

Does this project address a meaningful problem in healthcare? (The first third of the course will help you make an informed assessment.)
Does this project rely on machine learning (supervised, unsupervised, or RL)? Healthcare analytics projects are important and interesting, but this is an ML class.
Does this project leverage a healthcare dataset?
Is this project non-trivial? (E.g. downloading and running the code and data from a public github repo associated with a published paper would not pass this criterion)

While we don't want to be overly-prescriptive about the project writeup, you will want to include at least some of the typical elements of most papers published in the last five years in the area of machine learning in healthcare, such as:

EDA (exploratory data analysis) and a "Table 1" describing your data in detail
Pointer to source code (for reproducibility)
Why you selected the model form(s) you did
Some details on model building: train/test split, approach to smoothing, number of iterations, etc.
Where and how your work fits relative to the academic literature
Shortcomings of your work and recommendations for future work

Sample ideas

Here’s some example ideas for a project - to get you calibrated on the scope and complexity we expect.

Pneumonia Detection Challenge: Participate retrospectively in the 2018 RSNA Pneumonia challenge.
Detecting abnormal heartbeat: Participate retrospectively in the 2016 Physionet challenge, to classify recordings of heartbeats. Physionet runs a cardiology-related challenge every year and the previous year challenges, along with data, are available here.
Automated Medical Coding: Using MIMIC or another dataset, build a model that assigns ICD-9/ICD-10 codes based on the clinical information (including notes) generated during a visit. See here, among many other sources.
Question answering (Q/A) on EMRs: The typical EMR patient record contains a vast amount of information, especially for older patients with chronic diseases. A high-quality Q/A system can allow physicians to to quickly retrieve specific information from a patient chart by asking natural language questions, instead of manually searching through pages and pages of information. Use a base model like ClinicalBERT and attach a Q/A head on top of this model, which you will probably want to train using an annotated clinical question-answering dataset (e.g. emrQA).
Patient embedding: Loosely following the work described in the DeepPatient paper, build an embedding model for patients and demonstrate its performance on some healthcare application.
Quality of care assessment: Use CMS data to build a model that predicts the quality of care provided by healthcare providers and institutions. (You will need to define ‘quality’). You may wish to look at datasets containing, for example, hospital readmissions, mortality rates, and patient satisfaction data. What features are most associated with high-quality care?
Multimodal AI: There are more and more public multimodal healthcare datasets becoming available, consisting of images and associated metadata. One example is the ISIC dataset, which contains images of skin lesions along with corresponding clinical information. Use a pre-trained convolutional neural network (CNN) model, such as ResNet, to extract visual features from the dermatoscopic images. In parallel, use a pre-trained language model, such as BERT or BioBERT, to extract textual features from the clinical notes. Explore different techniques for fusing the visual and textual features.
For other ideas for a project, have a look at this survey paper.

Some popular, public healthcare datasets you may wish to consider:

MIMIC
The Cancer Imaging Archive
The National Cancer Institute ‘SEER’ data
Human Connectome Project (HCP)
CDC data
CMS data
Human Mortality Database
Chexpert chest x-ray data
Kaggle also has many healthcare datasets and so does this.

Some of these datasets may require paperwork/approval to get access, so plan ahead.

Milestones

The Final Project will have three milestones during the semester:

Milestone 1 (Oct 10): Submit 1-page project proposal, which describes the problem you are addressing, a brief outline of your approach, and the data you intend to use.
Milestone 2 (Nov 7): Submit 1-page progress report, describing work completed and pending.
Milestone 3 (Dec 3): Poster session / live demo. Submit your poster as a pdf within Canvas before the poster session.

Credit

This course derives inspiration from MIT 6.7930, a course in machine learning that MIT has offered since 2017. To our knowledge, that course is the first and (until now) only course on ML+Healthcare offered for credit to students at any US university. MIT has kindly put the 6.7930 course material under their OCW (open coursework) license, and we have drawn selectively from that material in constructing lectures and problem sets for this course. We thank Professors Peter Szolovits and David Sontag for their permission---granted directly and through OCW---to adapt the 6.7930 material for this course. Any errors in this (CMU) course material is, of course, the responsibility of the instructor and not anyone else.

(END OF PAGE)