Can deep learning tell you when you’ll die?

Posted on December 1, 2017

–Improving Palliative Care with Deep Learning

This is an artificial intelligence study. The researchers took electronic health records,

age
gender
diagnoses
treatments
- scans
- drugs

to predict whether you’ll be dead in 3-12 months. If so, you’re a candidate for palliative care, rather than continuing e.g. aggressive cancer treatment.

The authors have good intentions, but start with some worrisome assumptions (bolding mine):

“This frees the palliative care team from manual chart review of every admission and helps counter the potential biases of treating physicians by providing an objective recommendation based on the patient’s EHR.”

Do palliative care teams really examine every admission?

“Our approach uses deep learning to screen patients admitted to the hospital to identify those who are most likely to have palliative care needs.”

I’ve been to the hospital quite a few times. I can’t imagine the palliative care team was looking at my 21-year-old-bloody-chin-from-basketball-chart.

Second, “objective recommendation” is a joke. Again, the AI is based on,

age
gender
diagnoses
treatments
- scans
- drugs

Who did the above? The healthcare team.

You’re a 45 year old female diagnosed with some type of cancer through various treatments.

Every part of that biases a doctor. The fact you’re 45 may cause them to lean towards more aggressive treatment. Or that you have a different type of cancer. Being at one hospital may cause a lean towards certain approaches. Having partial ownership of the MRI machine may cause more MRIs. Owning the surgery center may cause extra surgery.

-> This is real talk. Doctors can own the treatment they’re giving you. Dolla dolla bills ya’ll.

Which, of course, could all be wrong.

Not to mention the researchers are picking which traits should be included in the algorithm (based on what humans have decided should be on a health record). It might be less biased, but “objective” is a fraudulent leap.

-> I have no desire to crap on these people’s work. They are well intentioned. But a counterweight to hype-men is needed, which AI is currently full of.

IBM’s Watson is having this very problem. Much of its training has been done at one hospital. When the conclusions are attempted to be used in another country, they haven’t been well received because those hospitals can tell it’s biased towards where it was trained.

–IBM pitched its Watson supercomputer as a revolution in cancer care. It’s nowhere close

We’ve been through this before-

–Pessimism regarding upcoming artificially intelligent personal trainers

First takeaway: AI often requires a ton of human work and input

This AI could not have worked without humans. Human diagnosis, human data entry, human interpretation. In this example, we’re a far cry from “AI is destroying jobs.” All this thing did was try to lessen the workload of a person referring someone to palliative care. It doesn’t replace any job.

–

How good was the AI?

The authors use a statistical technique called precision and recall. Google is an analogy worth starting with-

You search for “hips don’t lie” in Google.
In the first five links, all five are relevant to finding out if your hips, indeed, do lie
- We say the first five searches have a high precision
In the next five links, none are relevant
- These five have a poor precision, which brings the average precision down
  - BUT, average precision is counted in a way where being accurate early on helps. You prefer to be precise in the first five links compared to the 25th link. When you’re precise can matter.
In the next 10 links, again none are relevant, but because you’ve done your homework on hips lying, you know there are 10 links in the world Google is not finding
- This diminishes recall. There are relevant links not being found.

Precision is in a sense, accuracy. Recall is completeness.

When dealing with predicting whether someone should go to palliative care, we want to be precise. To properly align the person with palliative care if they’re e.g. going to die in 12 months.

We also want solid recall. We don’t want to miss cases.

That is, we could be extremely precise, but have poor recall. If Google only shows one link and it’s relevant for hips not lying, that gives high precision, but because it’s missing so many other links, its recall is poor. We want more than one source of information for whether those hips are filthy liars or not.

–

There tends to be an inverse relationship between precision and recall. Initially, Google may be good at showing you what you want to find, but not all of it (high precision, low recall). However, as you keep going, Google might find more and more (recall starts going up), but not in the best order, or not properly labeling them as relevant (precision goes down).

The authors focus on .9 precision. When their algorithm is very precise, it’s recall is .32. When it’s very good at identifying relevant cases for palliative care, it’s not very good at including all the cases who should be.

Then we can see as recall goes up, as we find more and more cases, precision goes down- we don’t properly label those cases.

The average precision was .65. Based on this interpretation of average precision (I’m getting outside my domain now) that means 2 out of every 3 palliative care cases is properly identified.

-> Ironically, there is a fair amount of arguing in the statistics world about interpretation. For instance, the ROC curve mentioned below, some think that’s worthless. I have statisticians who read this site. Feel free to help if needed!

Another way to look at this is true positives vs false positives.

What we want in any predictive model is to spend most our time in the upper left corner. We want to identify all the true positives, but not misidentify any.

What typically happens though is some type of curve to the right. The more you identify every positive, the more you tend to identify false positives too.

Credit: http://gim.unmc.edu/dxtests/roc3.htm

We can see at about .8 true positive, we identify .1 falsely.

As we get to identifying all the positives, we get to about a .75 false positive rate.

Second takeaway: AI is not omnipotent

It needs to be pointed out this paper was done by Andrew Ng, who is a god of the AI world. He’s a big part of the recent uprising. He injected deep learning life into Google, Baidu, and this paper was done with colleagues out of Stanford, a top program. It’s hard to say it gets more high level than this.

-> Ng is also dolling out healthcare advice left and right, such as pushing for med students to not go into radiology because AI will soon do the job. The demise of radiologists has been wrongly predicted before. (I’m sure Ng would have also predicted ATMs would replace bank tellers, but they haven’t.)

Beware of tech people giving healthcare advice. “Software is eating the world” while people would still rather sit on the toilet after a night of spicy food than deal with an electronic health record. Which, despite being DATA ENTRY, is a top source of burnout for doctors.

Yet still, this is a predictive value many would shrug their shoulders at. Yes, in the world of statistics, that curve is excellent. However, in the real world, at best, we’re missing 20% of people, and falsely identifying 10%. If you reveal that to a patient, that leaves A LOT of room for interpretation.

Tons of cancer conversations go like,

“Well, there’s X% chance of this if we go this route. Y% that route. Yada yada.”

“Uh, ok. So what would you do?”

This is one of the enormous oversights of techies and engineers when they try to view human health. They think humans can be classical mechanics, but really, we’re quantum.

Classical means do X, get Y
Quantum means do X, there’s this % Y happens, that % Z happens, and % T happens

Not everything is easily predicted. Humans are obviously not. It’s important to reiterate methodologically, we ain’t getting much better than this. Simply the reality when you’re dealing with probabilistic issues.

–

Third takeaway / reminder: humans often don’t care about evidence

There is endless, ENDLESS, data showing active investing loses to passive. Putting your money into an algorithm beats putting your money with a human. Yet human investing makes billions a year.

Statistics says every year some are not aware passive investing exists. That some active investors go on a hot streak, beating the market for five years, before reverting to the mean. Or people think about Warren Buffett. “Beating the market can be done.” Lo and behold, people fall into that, and give their money to a human. Even if Warren Buffett recommends everyday people use passive investing!

Medicine is no different. We’ve known for decades algorithms can beat humans at diagnosis. Nobel Prizes have been given for showing how flawed human intuition is. In fact, the same types of researchers showing this are those who’ve shown algorithms win at investing!

-See The Undoing Project and Thinking, Fast and Slow

Daniel Kahneman, Nobel Prize recipient for showing human intuition flaws, has talked to investors. He has shown a firm, while giving a talk in their building, with their data, their investing was no better than rolling dice. He’s recounted how after the talk an investor came up to him and more or less said, “I’m different.”

Do we really think a doctor isn’t going to say “I know the algorithm says I should refer you to palliative care, but I think we have a different approach here”? And patients won’t fall into it? Why does alternative medicine make billions a year?

I’m actually a long bull on AI. I laugh when people say self-driving cars are a lifetime away. But AI proponents have no clue the amount of inertia they’re dealing with in healthcare.

-> When the deep learning algorithm misidentifies a case, who is responsible? Who can the patient sue? Algorithms may be able to fuck up our elections without repercussions, but have you seen how people respond when you mess with their healthcare?!

This notion doctors are going to be replaced soon, AI is going to transform diagnosis, robots will be doing surgery, is a mirage. AI will infiltrate healthcare, but it will be a brutally slow slog. (It already has been. Deep learning has been around decades.) And much like investing, there’s no guarantee humans fall out completely.

–

Wait, was deep learning even better than other methods?

The authors, amazingly, didn’t even compare their results to other approaches. While machine learning predictive models are all the rage, predictive models have been around forever.

They do quote various other approaches, but don’t say how effective they are.

You can surmise this is too convenient. If you’re not comparing yourself to others, it wreaks of you didn’t perform any better, if not worse.

Feel free to take a deeper look if you want (the paper is surprisingly accessible), but just looking at the abstracts and you can see their numbers largely match up with those they cited. Again, if there was some drastic leap in predictive ability, you’d imagine they’d mention it (and some media outlets would have picked this story up; only one did).

-> For the more technical reader, their ROC curve was basically the same as here and here. By the way, these papers are at least a few years old. Back to that healthcare inertia problem- implementing an approach can be harder than coming up with the approach. How do we get people to actually use an algorithm?

–

There was something noteworthy

One of the biggest impediments to deep learning vs other predictive models is it can be very hard to tell someone why a result spit out.

For instance, in many predictive models, humans set how important each factor is. So you can easily tell a patient or doctor, “These two variables are the most important. They’re predominantly why we recommend this approach.”

Deep learning you can’t always do that, because you don’t always even know what the hell the thing is looking at.

There is one little trick these authors did to get around this. After inputting,

age
gender
diagnoses
treatments
- scans
- drugs

and getting a result, they could go back and change say, the number of treatments, and see how that changed the results. So they could get a picture for what the algorithm was weighing as most and least important. A patient might ask,

“Is it because of how many drugs we’ve tried?”

And they can change the number of drugs from e.g. 5 to 1, and see how the result changes.

Not Earth shattering, but this is a very important step for trying to give advice to doctors and patients.

Can deep learning tell you when you’ll die?