Data-Driven Road Trip Stop 3: Predicting Old FaithfulReading time: 5 minutes
This blog post series highlights data investigations sparked by Analyst Lily Brennan’s cross-country road trip from New Hampshire to Joshua Tree. Since we can’t all hit the road, we’re doing the next best thing: analyzing data related to some of the stops Lily’s making on her trip.
This week, Lily made her way to the country’s first and oldest National Park – Yellowstone! While waiting for Old Faithful to erupt, her Analyst mind started wondering if she could use a predictive model to know exactly when she could expect the geyser to blow. Lily writes in from the road on geysers, data quality, and just how ‘faithful’ her predictive model turned out to be.
Founded in 1872, Yellowstone National Park was the first defined National Park in the world. The recreation area, which spans parts of Montana, Wyoming, and Idaho, sits proudly on top of a volcanic hot spot. As a result, the park most famous feature is its variety of hot springs and geysers – including my third stop of the trip: Old Faithful!
Old Faithful erupts reliably around every 60-110 minutes. As my boyfriend and I arrived at the geyser, I wondered: should we get comfortable and camp out for 90 minutes? Or are we going to miss the show if we run into the gift shop first?
A couple of cues set me off. First, there were a number of tourists already in position to view the next eruption. If Old Faithful had just erupted, the crowd who had just watched the exciting show would probably disperse and move onto other attractions. However, if there was an anticipated eruption approaching, you could expect a larger crowd in position to get the best view.
Since we didn’t have enough cell service to google “When does Old Faithful erupt next?”, I surveyed the crowd of people and started making bets that it would go within the next 15 minutes. Sure enough, it happened 12 minutes later. And although my ‘size of crowd/number of tourists’ gauge was very scientific… my data analyst mind started wondering if I could instead create a predictive model to anticipate more accurately when the next eruption would be.
Building a Model to Predict Old Faithful
When we returned to cell service, I did a little research. I learned that Yellowstone National Park’s Rangers can predict the next eruption within about 10 minutes with 90% confidence. The rangers use a combination of the interval of time between eruptions, the duration of each eruption, and the height of the eruption to determine the time of the next eruption. A longer eruption expels more water, therefore taking longer to refill the hot water reservoir below (a full reservoir = an eruption) and extending the interval of time until the next eruption.
In order to get started on my model, I first needed to find some data. I found that R has a built-in dataset called “Faithful” containing both “eruption duration” and “time between eruptions”. However, there was no information included on when this data was recorded. In an attempt to find more recent records, I came across geysertimes.org, the best source I found for data containing date, duration, and interval wait time of eruptions, although even this data source was limited. The NPS clearly keeps detailed records. However, it’s a bit more difficult to find up to date data. Because of this, I decided to use data I found included in R to build the model. The R data was the most consistent, albeit not as recent.
First Attempt: Model 1 applied to geysertimes.org data
Before I built the model in Predict, I took a quick look at the visualization tab to see the relationship between the length of the eruption and the length of the wait time. The relationship appeared close to linear, which I expected based on what we know about how geysers work. Next, I built the model in Predict. Finally, as a test, I decided to apply the formula to the more recent (and limited) data I found from October 21st, 2020 that contained a couple of records of eruption duration and wait time to see how the predictions lined up with the true wait time interval.
My predictions consistently wound up low by about 20 minutes when applied to current eruptions. I was hoping to create a model that would produce predictions far closer to the true wait time than this. I set out to improve the model. Since it only contained one independent variable, my first instinct was to add additional independent variables into the pool. However, this brought me back to the challenge of finding reliably-recorded data. Luckily, I thought of a way to create additional variables from the existing variable of eruption duration using Construct.
Second Attempt: Model 2 applied to geysertimes.org data
In Construct, I used the Transform node to create a new variable that averaged the wait time from the last three eruptions. However, even after adding this new variable and re-scoring, the predictions still significantly differed from what I hoped for. The Rangers had me beat, but I had one last idea.
Third Attempt: Model 2 applied to R data
Building the model on old data and trying to score it on new data did not stand the test of Nature’s Timepiece. I decided to take a different approach. I kept 72 records as a holdout sample from the original Faithful dataset. Then I built the model on a random sample of 200 records from the Faithful dataset. Finally, I scored the 72 validation records with this new model. The result was predictions that proved to be much closer to the actual wait times (plus or minus 5 minutes). These predictions had an average difference between the expected and actual wait times of only 0.302 minutes or 18 seconds.
The results of model 2 applied to the R data were really good, even rivaling those of the Ranger’s. But how is it that the model performed so poorly on the more recent geysertimes.org data, yet so well on the timeless R data? The answer to that lies not in the model but underground…
Shifting Eruption Times
As it turns out, it is well-known that the interval between eruptions at Old Faithful is changing. After a small earthquake in 1998, Old Faithful erupts less frequently than it did prior. Intervals increased (on average) from about once every 76 minutes to about once every 80 minutes. So it is no surprise that our model with outdated data didn’t stand up to current data. And it brings up a good lesson on the importance of updating your predictive models regularly: base them on the most recent historic data to get the best results.
Ultimately, I couldn’t find any faithful data, and perhaps the model didn’t blow your mind. But it was a fun experiment! In the end, I learned two things:
- Old Faithful eruptions are in fact changing
- It’s surprisingly difficult to find consistent data around the geyser
Was this analysis way more than my boyfriend was hoping for when we visited Yellowstone National Park? Maybe. But he should know he is travelling with a Data Analyst! And to all you other analyst minds out there I say this: if you’re ever in Yellowstone National Park, it might be best to stick to the Ranger predictions… your family will thank you.
See you on the road,
-Lily, Rapid Insight Data Analyst