Recreating the US birthday analysis with Prophet
Background
In this post I will try to model the number of births in the US, using seasonal effects, day of week effects and special holiday effects. The original analysis used Gaussian Processes, it was included in BDA3 (on the cover actually). In the first version of the book the model was too big to be fit with Stan, so it was fit with a different package called GPStuff
. Here is the code that created the materials of the book.
Since then Stan has improved further and it looks like it is now possible to fit the original model Stan, as explained in this recent post by Aki Vehtari.
As described in the book, and a series of posts such as this on Andrew Gelman’s blog, the question was whether there are excess births on Valentine’s day and fewer births on Halloween. This question provided an excuse to examine the effect of any holiday in the US, and more generally any day of the year. It also served as an excuse to demonstrate the use of Gaussian Processes as a time series decomposition technique.
Here is the headline chart from that analys (recreated here from the gpstuff page):
Plan for this post
Here I try to the same analysis but using a simpler additive model based on Fourier Transforms, instead of Gaussian Processes. I am also using it as an excuse to learn how Prophet works, a package for time series forecasting implemented by a team at facebook.
Tha package is pretty simple to use and fast to fit. I was able to get most of the paterns right with little experimentation, though not completely right. The ease of use and the ability to add custom components makes Prophet a great tool. The most obvious drawback is that it’s made to accommodate daily data and there is no easy way to generalize to other frequencies.
Setup
Load packages
import pandas as pd
from prophet import Prophet
import altair as alt
import holidays
from prophet.plot import add_changepoints_to_plot
alt.data_transformers.enable('default', max_rows=None)
Get data
df = pd.read_csv('https://raw.githubusercontent.com/avehtari/casestudies/master/Birthdays/data/births_usa_1969.csv')
df['ds'] = pd.to_datetime(df.year.astype(str)+'-'+df.month.astype(str)+'-'+df.day.astype(str))
df.rename({'births':'y'}, axis=1, inplace=True)
df = df[['ds','y']]
df.head()
ds | y | |
---|---|---|
0 | 1969-01-01 | 8486 |
1 | 1969-01-02 | 9002 |
2 | 1969-01-03 | 9542 |
3 | 1969-01-04 | 8960 |
4 | 1969-01-05 | 8390 |
Model Definition
Define the standard model and add a custom effect that I call “each_day”. It has period
365.25 because I want to model each day of the year. It also has a high fourier_order
to make it very flexible. I experimented with lower values and found that I need a fairly high value to pick up the signals I am looking for. More experimentation is needed to pick this value though.
m = Prophet(changepoint_prior_scale = 0.01, yearly_seasonality=10)
## add effect for day of the year effect for more long term patters
m.add_seasonality(name='each_day_long_term', period=365.25, fourier_order=4)
## add effect for day of the year:
m.add_seasonality(name='each_day', period=365.25, fourier_order=100)
m.fit(df)
17:03:30 - cmdstanpy - INFO - Chain [1] start processing
17:03:59 - cmdstanpy - INFO - Chain [1] done processing
<prophet.forecaster.Prophet at 0x7fe099e03910>
Fit the model to data
future = m.make_future_dataframe(periods=1)
future.tail()
ds | |
---|---|
7301 | 1988-12-28 |
7302 | 1988-12-29 |
7303 | 1988-12-30 |
7304 | 1988-12-31 |
7305 | 1989-01-01 |
Generate estimates and forecasts
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
7301 | 1988-12-28 | 11680.646441 | 11239.300965 | 12094.109632 |
7302 | 1988-12-29 | 11724.885805 | 11294.168795 | 12215.288371 |
7303 | 1988-12-30 | 11502.343208 | 11071.599362 | 11957.702909 |
7304 | 1988-12-31 | 9078.021852 | 8622.992957 | 9533.391236 |
7305 | 1989-01-01 | 7960.215673 | 7521.253713 | 8409.581555 |
Create the basic plot of the underlying trend. I chose to visualize the change points for more clarity. The exacat number of points is a choice that requires callibration but I think it does not affect my problem too much so I move on.
fig1 = m.plot(forecast)
a = add_changepoints_to_plot(fig1.gca(), m, forecast)
Create the plot of all the different components. This is an automatic output from Prophet
and it is fairly close to the headline chart of the original analysis. But we are missing the key chart of the day of year (see below).
fig2 = m.plot_components(forecast)
And finally here we create the day of the year plot we were aiming for.
day_of_year_effect = forecast[['ds','each_day']].copy()
day_of_year_effect['indicative_day'] = '1999' +'-' + day_of_year_effect['ds'].dt.month.astype(str) +'-' + day_of_year_effect['ds'].dt.day.astype(str)
day_of_year_effect = day_of_year_effect.groupby('indicative_day')[['each_day']].mean().reset_index()
day_of_year_effect
indicative_day | each_day | |
---|---|---|
0 | 1999-1-1 | -1.301688 |
1 | 1999-1-10 | -0.163488 |
2 | 1999-1-11 | -0.017994 |
3 | 1999-1-12 | -0.053865 |
4 | 1999-1-13 | -0.081956 |
... | ... | ... |
361 | 1999-9-5 | 0.077351 |
362 | 1999-9-6 | 0.131178 |
363 | 1999-9-7 | 0.119668 |
364 | 1999-9-8 | 0.314853 |
365 | 1999-9-9 | 0.463697 |
366 rows × 2 columns
fig3 = alt.Chart(day_of_year_effect).mark_line().encode(
x='indicative_day:T',
y='each_day'
)
holiday_data = pd.DataFrame.from_dict(holidays.US(years=forecast['ds'].dt.year.values).items())
holiday_data.columns = ['ds', 'holiday']
holiday_data['ds'] = pd.to_datetime(holiday_data['ds'])
ignore_observed_dates = True
ignore_index = holiday_data['holiday'][holiday_data['holiday'].str.contains('Observed')].index
holiday_data_without_observed = holiday_data[~holiday_data.index.isin(ignore_index)]
if ignore_observed_dates:
holiday_data = holiday_data_without_observed
holiday_effect = forecast[['ds', 'each_day']].merge(holiday_data, on='ds', how='outer').groupby('holiday').agg({'ds':'first', 'each_day':'mean'}).reset_index()
holiday_effect.rename({'each_day':'effect size'}, axis=1, inplace=True)
holiday_effect['indicative_day'] = '1999' +'-' + holiday_effect['ds'].dt.month.astype(str) +'-' + holiday_effect['ds'].dt.day.astype(str)
fig4 = alt.Chart(holiday_effect).mark_text().encode(
x='indicative_day:T',
y='effect size',
text='holiday'
)
(fig3 + fig4 ).properties(width=600, height=250)