Background

In this post I will try to model the number of births in the US, using seasonal effects, day of week effects and special holiday effects. The original analysis used Gaussian Processes, it was included in BDA3 (on the cover actually). In the first version of the book the model was too big to be fit with Stan, so it was fit with a different package called GPStuff. Here is the code that created the materials of the book.

Since then Stan has improved further and it looks like it is now possible to fit the original model Stan, as explained in this recent post by Aki Vehtari.

As described in the book, and a series of posts such as this on Andrew Gelman’s blog, the question was whether there are excess births on Valentine’s day and fewer births on Halloween. This question provided an excuse to examine the effect of any holiday in the US, and more generally any day of the year. It also served as an excuse to demonstrate the use of Gaussian Processes as a time series decomposition technique.

Here is the headline chart from that analys (recreated here from the gpstuff page):

Plan for this post

Here I try to the same analysis but using a simpler additive model based on Fourier Transforms, instead of Gaussian Processes. I am also using it as an excuse to learn how Prophet works, a package for time series forecasting implemented by a team at facebook.

Tha package is pretty simple to use and fast to fit. I was able to get most of the paterns right with little experimentation, though not completely right. The ease of use and the ability to add custom components makes Prophet a great tool. The most obvious drawback is that it’s made to accommodate daily data and there is no easy way to generalize to other frequencies.

Setup

Load packages

import pandas as pd
from prophet import Prophet
import altair as alt
import holidays
from prophet.plot import add_changepoints_to_plot
alt.data_transformers.enable('default', max_rows=None)

Get data

df = pd.read_csv('https://raw.githubusercontent.com/avehtari/casestudies/master/Birthdays/data/births_usa_1969.csv')
df['ds'] = pd.to_datetime(df.year.astype(str)+'-'+df.month.astype(str)+'-'+df.day.astype(str))
df.rename({'births':'y'}, axis=1, inplace=True)
df = df[['ds','y']]
df.head()
ds y
0 1969-01-01 8486
1 1969-01-02 9002
2 1969-01-03 9542
3 1969-01-04 8960
4 1969-01-05 8390

Model Definition

Define the standard model and add a custom effect that I call “each_day”. It has period 365.25 because I want to model each day of the year. It also has a high fourier_order to make it very flexible. I experimented with lower values and found that I need a fairly high value to pick up the signals I am looking for. More experimentation is needed to pick this value though.

m = Prophet(changepoint_prior_scale = 0.01, yearly_seasonality=10)
## add effect for day of the year effect for more long term patters
m.add_seasonality(name='each_day_long_term', period=365.25, fourier_order=4)
## add effect for day of the year: 
m.add_seasonality(name='each_day', period=365.25, fourier_order=100)
m.fit(df)
17:03:30 - cmdstanpy - INFO - Chain [1] start processing
17:03:59 - cmdstanpy - INFO - Chain [1] done processing





<prophet.forecaster.Prophet at 0x7fe099e03910>

Fit the model to data

future = m.make_future_dataframe(periods=1)
future.tail()
ds
7301 1988-12-28
7302 1988-12-29
7303 1988-12-30
7304 1988-12-31
7305 1989-01-01

Generate estimates and forecasts

forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
ds yhat yhat_lower yhat_upper
7301 1988-12-28 11680.646441 11239.300965 12094.109632
7302 1988-12-29 11724.885805 11294.168795 12215.288371
7303 1988-12-30 11502.343208 11071.599362 11957.702909
7304 1988-12-31 9078.021852 8622.992957 9533.391236
7305 1989-01-01 7960.215673 7521.253713 8409.581555

Create the basic plot of the underlying trend. I chose to visualize the change points for more clarity. The exacat number of points is a choice that requires callibration but I think it does not affect my problem too much so I move on.


fig1 = m.plot(forecast)
a = add_changepoints_to_plot(fig1.gca(), m, forecast)

Create the plot of all the different components. This is an automatic output from Prophet and it is fairly close to the headline chart of the original analysis. But we are missing the key chart of the day of year (see below).

fig2 = m.plot_components(forecast)

And finally here we create the day of the year plot we were aiming for.

day_of_year_effect = forecast[['ds','each_day']].copy()
day_of_year_effect['indicative_day'] = '1999' +'-' + day_of_year_effect['ds'].dt.month.astype(str) +'-' + day_of_year_effect['ds'].dt.day.astype(str)
day_of_year_effect = day_of_year_effect.groupby('indicative_day')[['each_day']].mean().reset_index()
day_of_year_effect
indicative_day each_day
0 1999-1-1 -1.301688
1 1999-1-10 -0.163488
2 1999-1-11 -0.017994
3 1999-1-12 -0.053865
4 1999-1-13 -0.081956
... ... ...
361 1999-9-5 0.077351
362 1999-9-6 0.131178
363 1999-9-7 0.119668
364 1999-9-8 0.314853
365 1999-9-9 0.463697

366 rows × 2 columns

fig3 = alt.Chart(day_of_year_effect).mark_line().encode(
    x='indicative_day:T',
    y='each_day'
)
holiday_data = pd.DataFrame.from_dict(holidays.US(years=forecast['ds'].dt.year.values).items())
holiday_data.columns = ['ds', 'holiday']
holiday_data['ds'] = pd.to_datetime(holiday_data['ds'])

ignore_observed_dates = True
ignore_index = holiday_data['holiday'][holiday_data['holiday'].str.contains('Observed')].index
holiday_data_without_observed = holiday_data[~holiday_data.index.isin(ignore_index)]
if ignore_observed_dates:
    holiday_data = holiday_data_without_observed
holiday_effect = forecast[['ds', 'each_day']].merge(holiday_data, on='ds', how='outer').groupby('holiday').agg({'ds':'first', 'each_day':'mean'}).reset_index()
holiday_effect.rename({'each_day':'effect size'}, axis=1, inplace=True)
holiday_effect['indicative_day'] = '1999' +'-' + holiday_effect['ds'].dt.month.astype(str) +'-' + holiday_effect['ds'].dt.day.astype(str)

fig4 = alt.Chart(holiday_effect).mark_text().encode(
    x='indicative_day:T',
    y='effect size',
    text='holiday'
)
(fig3 + fig4 ).properties(width=600, height=250)