The diagnosis of sleep disorders is currently performed in sleep clinics using nocturnal polysomnography (PSG), a procedure that involves hooking up the patients with many electrodes and having them sleep in a specialized facility. In other cases, a simpler electrode hook-up (home sleep apnea testing) that primarily measures breathing is prescribed at home. Results are then interpreted by technicians and doctors through visual inspection and scoring of sleep stages: depth of sleep—dreaming versus non-dreaming—creating a “hypnogram”, respiratory events (related to sleep apnea), and periodic leg movements (leg kicks). The process is subject to human errors and depends on the quality of the scorer. It is also non-reproducible in the sense that the same technician will not score the exact same sleep recordings the same way twice. In a recent article in Nature Communication, Stephansen et al. show that this process will be replaced by machine learning procedures. These can do better than human scorers, not only automating the results but also producing an improved hypnogram that gives a probability for each sleep stage, a representation we called the hypnodensity. Even better, this manuscript shows that these sleep stage probabilities can be used to diagnose specific sleep disorders such as narcolepsy, potentially replacing a longer test that necessitated napping during the day, called the Multiple Sleep Latency Test (MSLT). We predict that within a few years human scoring of sleep studies will be replaced by machine learning based scoring.
Sleep disorders are diagnosed using polysomnography sleep studies
Every year, millions of nocturnal sleep studies are conducted all over the world to test for sleep problems such as sleep apnea (gasping, snoring, and stopping breathing during sleep) and narcolepsy (severe sleepiness and ability to go quickly into dreaming sleep). Currently, sleep clinics evaluate patients using a gold standard method called nocturnal polysomnography (PSG). The PSG allows the detection of sleep and wake, as well as the type of sleep [Rapid Eye Movement (REM) sleep or non-REM sleep], sleep apnea or hypopnea events (see “Sleep Apnea – Hard to Watch” https://www.youtube.com/watch?v=mjQdAf9cQBo), and detects arousal (waking up) and movements (that should not occur) during sleep.
During a full, in-laboratory PSG, the following physiological signals are typically included:
This is used to detect sleep onset, all sleep stages, and by how much sleep is disturbed by other factors. For example, it can measure brief (4 second) arousal reactions secondary to sleep apnea or movements. Waking too often, even for a few seconds, can make you tired even without realizing it.
This is used to detect Rapid Eye Movements (REMs), which are quick movements of the eyes, and slow-rolling eye movements. These occur during REM sleep (when we dream) and light sleep, respectively. REM sleep is important to detect for narcolepsy (patients with narcolepsy fall into REM sleep faster). In addition, during REM sleep the muscles are paralyzed and this makes the back of the airway more flaccid, leading to more severe sleep apnea. Sometimes sleep apnea only occurs in REM sleep. REM sleep can also be important to detect to diagnose REM behavior disorder, a condition that precede Parkinson’s disease, wherein patients act out their dreams, frequently hurting themselves or their bed partners.
This is used to detect loss of all muscle tone and paralysis that occurs normally during REM sleep. The paralysis prevents us from acting out during our dreams.
It detects arrhythmias of the heart, in case this is affected by sleep apnea.
These detect leg movement activity in sleep-related movement disorders such as Periodic Leg Movements (PLMs) and parasomnias (such as sleepwalking). PLMs are common in patients who have Restless Leg Syndrome, characterized by symptoms of funny feelings in the legs during the evening at rest that force patients to move their legs and interfere with falling asleep (See Restless Leg Syndrome – Time Elapsed Video https://www.youtube.com/watch?v=Pqke9lRn1h8).
This microphone detects snoring and helps confirm sleep apnea events.
It detects sleep apnea, characterized by a complete cessation of airflow through the nose and mouth, and milder events called hypopneas. It can also be used to distinguish between mouth breathing and apnea events.
It detects thoracic and abdominal movement as a measure of respiratory effort. Effort to breathe differentiates between central and obstructive sleep apnea. In obstructive apnea, the patient still tries to breathe while choking, so the chest and abdomen continue trying to move to suck the air in, even as the upper airway is blocked. It is associated with increased body weight. Central apnea is a brain problem where people just stop breathing. It can occur with heart disease and certain drugs, such as opioid medications.
The blood oxygen saturation drops when apnea events are long, and over the long term this can induce high blood pressure and increase the risk of heart problems (including heart attack and heart failure) and strokes.
It detects position-dependent (supine, lateral, or prone) sleep disturbances, commonly observed in obstructive sleep apnea. For example, people often have more sleep apnea or snoring on their back because the lower jaw and tongue can shift backward, especially when the mouth is open, pressing on the back of the airway making it more narrow. Some patients only have “positional” sleep apnea, meaning it primarily occurs only on their back.
In some cases, “portable” sleep monitoring is conducted to identify clear sleep apnea, and in most of these cases only breathing and oxygen saturation are recorded.
As can be seen, the information gathered with sleep studies is quite complex and requires experts to look over the collected data to make a diagnosis. This is typically done by technicians and supervised by medical doctors. Amazingly, PSGs are still currently reviewed and scored page by page (or, more precisely nowadays, screen by screen) through a night in 30-second segments to extract simple features such as sleep and REM sleep latency (how long one takes to fall asleep, or to go from sleeping to dreaming), sleep stage proportions, number of sleep apnea events (tallied as the Apnea-Hypopnea Index or AHI), and number of periodic leg movements (PLMI). This is time-consuming and variable based on individual scorers. Studies have shown that the percent agreement between two technicians (or scorers) looking over a sleep study is generally 65-80 percent.
A typical PSG hypnogram of a normal night of sleep is shown in the figure (top panel). During a normal night of sleep, patients first fall asleep into light sleep known as stage 1 (during which the EEG slows down), and then proceed to stage 2 (with special features called K complexes or sleep spindles), and finally progress to stage 3, a stage of sleep where slow waves develop (also called Slow Wave Sleep or SWS), reflecting maximal relaxation (stage 1-3 are also called collectively Non-REM sleep). This process lasts about 1 hour to 90 minutes, after which a novel state appears called REM sleep. In REM sleep, the brain is very active (the EEG look is similar, but not exactly, as when awake) and dreaming occurs together with rapid saccadic eye movements (REMs on the EOG). The muscle tone (EMG) is very low during REM sleep, and the body is paralyzed to avoid acting one’s dreams out. After a first period of REM sleep, the dreamer often wakes up or goes back into light NREM sleep to restart the cycle of light sleep to deep SWS to REM sleep with a similar 90-minute cycle, although more and more REM sleep takes over toward the end of the night, when the body temperature is lowest, an effect driven by the circadian clock.
Supervised machine learning to the rescue
A task performed by a human can be automated by supervised machine learning, wherein a computer learns to match a certain set of input to output data. A typical supervised machine learning task transforming society today is facial recognition, which can be found in modern smartphones. In this example, a mathematical model consisting of perhaps millions of adjustable parameters is exposed to a picture of a human face and has to decide whether the face is that of the smartphone owner or not. Through the process of continuous exposure to the picture, the model “learns” to perform this task automatically by adjusting its parameters accordingly.
Since the model is designed to simplistically mimic how our brain cells, or neurons, learn to recognize pictures or patterns, the model is known as a neural network. In the case when applied to a two-dimensional picture, a specific type of neural network called a convolutional neural network (CNN) is typically applied, since this kind of model is further modeled after how our vision system works. Neural networks can also take into account a time dimension, keeping in memory what was seen before time t to make a determination at time t+1, in a video for example. In this case, it is called a recurrent neural network (RNN). Supervised machine learning continues to progress at high speed, notably incorporating elements of natural selection to improve performance and enable self-learning (not unlike how a human learns), and the above elements are the simplest elements that can be created in establishing supervised machine learning methods.
Machine learning works best when it has a lot of “labelled” or “annotated” data. Therefore, it is ideal for the interpretation of sleep studies that are simply successions of signals that have been annotated by humans. These data include sleep stage time series and the presence of events recognized by humans based on specific signal characteristics (including sleep apnea or leg movements). In this study, the authors transformed the various signals described above (EEG, EOG, and EMG) into more easily interpretable bits that can show better the structure of the data, for example using frequency analysis, and fed these signals to a network while giving the answers of which sleep stage was observed by a sleep technician for each 30 second epoch. Thousands of annotated studies were then fed to the network for it to find all the ways specific sleep stages can be recognized, at the end giving a probability for each stage to be present at any time, a representation called hypnodensity. The hypnodensity can be simply transformed into a hypnogram by taking the highest probability of any given stage every 30 seconds. However, the hypnodensity graph gives more information than the hypnogram as confidence in the prediction is shown by how dominant is one given stage at any moment.
How do you prove that the machine does better than humans?
One of the obvious benefits of our ML-based scoring is that given a certain input, the same scoring will result, whereas a human would never score a study the same way twice. Simply said, it is more reproducible. Since it is trying to imitate humans, one may however wonder how it is possible to show that it can also do better than human scorers. One simple way is to have the same sleep study scored by multiple independent scorers. All these people have a slightly different interpretation of each 30 second epoch (for example four scorers may say stage 1 and another may say stage 2), or make occasional mistakes, and as a result, a better scored study can be assembled by considering the majority vote for each 30-second interval of all the scorers (a so-called consensus score). Simply put, it is like if the sleep study was seen by five different scorers and the group decided for each epoch which one is the best sleep stage decision, creating an even better estimate of what many experts would think. This consensus is obviously only better than what a single expert would think provided the experts are all at a similar level. Once this is done, it then becomes possible to compare machine learning scored hypnograms with the improved five-scorer consensus hypnogram and with the hypnograms made by each individual scorer independently. Doing this, we found that the ML-predicted hypnograms did better than any of the individual scorers in comparison to the consensus, proving it did better than a single human.
Even more interestingly, when the machine learning had difficulty deciding between stages, for example in the transition between stage 2 and stage 3, it parallels when humans disagreed. Therefore, there is an attraction in the hypnodensity approach, especially in cases where scoring might be difficult, such as in the transition between stage 1 and 2, or during brief moments of awakening also called arousals. We propose that hypnodensity charts can be used as supplements to classic hypnograms to represent sleep stage distributions in future sleep studies.
Beyond human imitation, application to diagnose narcolepsy.
An advantage of machine learning is that it never forgets. A disadvantage is that if it has not been taught something, it will do its best approximation (like a human being). As an example, if an epileptic seizure occurs in the sleep data, the EEG pattern would be novel and the ML program will try to find the best fit, which may be deep sleep or wake. To check that the ML programs worked well in the context of sleep disorders, we analyzed thousands of normal studies or studies of patients with sleep apnea, leg movements, insomnia, and narcolepsy. Whereas the correlation between human scorers and the ML-derived sleep stages was equally good in all other pathologies (seizures were not studied, but would also likely show strange probability distributions), we found that many more disagreements occurred between ML and human scoring in the case of type 1 narcolepsy.
Type 1 narcolepsy is a strange pathology. It is caused by the loss of approximately 20,000 hypothalamic neurons containing the neuropeptide hypocretin (orexin). The cause of this loss is an autoimmune reaction where the body destroys these cells thinking they are influenza A virus particles (in most cases). The most problematic symptom in narcolepsy is sleepiness, although sudden weight gain, sometimes dramatic, is also frequently associated. As hypocretin (orexin) helps to promote wakefulness, patients with type 1 narcolepsy are very sleepy and fall asleep suddenly during the day, needing to nap many times every day. Another symptom is cataplexy, sudden episodes of muscle weakness triggered by emotions, most typically knee buckling or jaw opening when telling or hearing a joke. Other frequently occurring symptoms include sleep paralysis (waking up paralyzed from REM sleep), hypnagogic hallucinations (dreaming while half awake and thinking it is real), and disturbed nocturnal sleep. Many of the symptoms of narcolepsy are believed to be due to the unusual ability of patients to transition rapidly from wakefulness into REM sleep and to experience disassociated REM sleep events, events where the patient seems half awake and half in REM sleep. Patients with narcolepsy also have an abnormal sleep cycle, falling asleep into REM sleep within minutes instead of the classic 90-minute cycle.
To our surprise, when plotting hypnodensity graphs of untreated patients with type 1 narcolepsy, we quickly realized that often the ML program had trouble concluding whether the sleeper was awake, or in stage 2 (light sleep), or in REM sleep, an unusual pattern. This immediately gave us the idea to use this feature to see if it could be used to diagnose narcolepsy, and indeed, by extracting various parameters from the hypnodensity graphs, we were able to build a model that diagnoses narcolepsy with a roughly 95% sensitivity and 95% specificity, similar to the MSLT test that requires daytime napping. As a nighttime sleep study is much more commonly performed than MSLTs, typically to detect sleep apnea, it is now possible to automatically analyze the data to look if the patient could also have narcolepsy.
The future: at home brain health check ups
The rise of analytical methods such as machine learning is also paralleled by rapid improvements in hardware miniaturization and improved ability to sense EEG or other signals. Although right now it is only represented by simple actigraphy monitors that are not very informative or predictive, this will change as new sensors will be added, analyzed, and validated using machine learning. For this reason, we believe that soon PSGs will become a simplified version conducted at home, and signals will automatically be analyzed by ML and sent to a doctor for interpretation. This will greatly facilitate diagnosis and the monitoring of sleep disorders such as narcolepsy and sleep apnea. Most importantly, it will also be possible to monitor and adjust responses to treatment at home. We also predict that monitoring sleep and associated physiology during sleep, will become commonplace as it becomes more comfortable, and will be used to conduct regular “brain health” checks. Indeed, sleep is a window to brain functioning, uncontaminated by the external world. Disturbances undetectable to the human eye are likely to be found, possibly predicting development of psychiatric disorders (including depression) or neurological problems (Parkinson’s disease or dementia) before they manifest in wake, allowing prediction, if not outright preventative at-home therapies. The future is bright for the sleep field if people work together and if governments, the private sector, and foundations-all necessary pieces—invest more funding is this underfunded area.
Stay up to date or catch-up on all our podcasts with Arianna Huffington here.