Facebook Time Series Analysis

22 minutes read

This post contains the analysis of the Messenger group chat with multiple participants for the time window from 2016-08-04 11:03:40 to 2022-02-15 14:38:42, which in total is a time span of 2021 days. All personal data has been stripped out, anonymized and replaced by names from The Family Guy without affecting the data itself (especially time series). The below analysis consists of the following:

Data Import (combine json files and load them to Python)
Descriptive Statistics (exploratory data analysis, along with text mining)
Time Series Analysis (deep dive into time series techniques)
Time Series Predictions (Apply different methods for Time Series Predictions: Simple, Double and Triple Exponential Smoothing methods, ARIMA: Seasonal ARIMA (SARIMA) & Seasonal Exogenous ARIMA (SARIMAX) and Facebook Prophet Model.
Conclude the Predictions (Compare the evaluation of Time Series Models)
Forecast the Future (Retrain all the models on the whole dataset and will forecast the future from Feb, 2022 until Feb 2023, so the results can be validated in a couple of months).

To reduce the length of the report, most of the code generating the analysis was removed, if you are interested in viewing the whole analysis and Python code you can find the whole Jupyter Notebook on my Github.

I. Data Import

Once you request the data from Facebook it can be delivered in multiple .json files depending on the size of the conversation. Below we start with reading the completed .csv file, if you are interested in the function which combines all .json files into a csv refer to my Github.

Read the .csv for analysis.

# Read data
df = pd.read_csv('output_anonym.csv', names = ['Date', 'Name', 'Text'])
# Convert as datetime
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)

print(df.head().to_markdown())

Anonymized data:

| Date                | Name          | Text           |
|:--------------------|:--------------|:---------------|
| 2022-02-15 14:38:42 | Brian Griffin | blah blah blah |
| 2022-02-14 23:53:50 | Peter Griffin | blah blah blah |
| 2022-02-14 23:51:39 | Peter Griffin | blah blah blah |
| 2022-02-14 23:12:25 | Adam West     | blah blah blah |
| 2022-02-14 22:51:19 | Lois Griffin  | blah blah blah |

Data contains:

Date: timestamp - the date and time when a message was sent
Name: info who sent a message (anonymized)
Text: a string of the message (anonymized)

For further analysis data is resampled on the Name variable counted for different time periods: Days, Weeks, Months, Years, indicating the number of messages sent per resampled period, e.g. if resampled daily, then each row would store a number of messages sent per a particular person for every day.

II. Descriptive Statistics

Analyzed data is captured for the following time period:

print('From: ', df.index.min())
print('To: ', df.index.max())
print('In total, captured: ', df.index.max() - df.index.min())

From:  2016-08-04 11:03:40
To:  2022-02-15 14:38:42
In total, captured:  2021 days 03:35:02

High-Level Summary:

png

The Count is the number of messages sent.
The Unique indicates the number of unique messages sent.
The Top is the most common value/ message.
The Freq is the most common value’s frequency

Plotting

For the purpose of plotting data is resampled daily

df_day = df['Name'].resample(rule='D').count().to_frame('Count')
print(df_day.head().to_markdown())

| Date                |   Count |
|:--------------------|--------:|
| 2016-08-04 00:00:00 |      36 |
| 2016-08-05 00:00:00 |     146 |
| 2016-08-06 00:00:00 |       0 |
| 2016-08-07 00:00:00 |       0 |
| 2016-08-08 00:00:00 |       0 |

Overall Messages Sent

df_day['Count'].plot(figsize=(15,6), title = 'Number of Messages sent (Daily)');

png

Messages Per Weekday

# Barplot
df_day.groupby('Day')['Count'].sum().plot.bar();
# Boxplot
ax = df_day.boxplot(by='Day', grid=False, figsize=(15,6));

png

On the barplot can be seen that Friday has the highest sum of messages sent, while boxplots' green lines show the median for each day with dots showing outliers (there was a Saturday with more than 1400 messages sent.

Hour Activity

# Resample by Hour
df_hour = df['Name'].resample(rule='H').count().to_frame('Count')
# Barplot
df_hour.groupby('Hour')['Count'].sum().plot.bar();
# Boxplot
ax = df_hour.boxplot(by='Hour', grid=False, figsize=(15,6));

png

Separate the data

Since data consist of multiple participants, it can be split into multiple time series and analyzed separately. The analysis is done on data resampled monthly - refer to the table below.

# Resample by month
df_sep = df[['Name']].groupby('Name').resample('M').count()
# Pivot
df_sep = pd.pivot_table(df_sep, values='Count', index=['Date'],
                            columns=['Name'], aggfunc=np.sum, fill_value=0)
# Show a sample 
print(df_sep.iloc[:,[0,2,6,9,10]].head().to_markdown())

| Date                |   Adam West |   Brian Griffin |   John Herbert |   Peter Griffin |   Stewie Griffin |
|:--------------------|------------:|----------------:|---------------:|----------------:|-----------------:|
| 2016-08-31 00:00:00 |         184 |             148 |             50 |             100 |              142 |
| 2016-09-30 00:00:00 |         210 |             388 |            228 |              28 |              242 |
| 2016-10-31 00:00:00 |         670 |            1278 |            578 |             300 |              798 |
| 2016-11-30 00:00:00 |        1428 |            2352 |           1192 |             210 |             1024 |
| 2016-12-31 00:00:00 |         880 |            1702 |            760 |             356 |              708 |

png

Member’s activity - number of messages sent per person

ax = df_sep.boxplot();

png

Daily Activity

# Groupby and count
df_radar_plot = df_radar_plot.groupby(['Name','Hour'])['Hour'].count().reset_index(
    name='count')
# Pivot
df_radar_plot = pd.pivot_table(df_radar_plot, values='count', index=['Hour'],
                           columns=['Name'], fill_value=0)

|   Hour |   Adam West |   Brian Griffin |   Glenn Quagmire |   John Herbert |   Peter Griffin |   Stewie Griffin |
|-------:|------------:|----------------:|-----------------:|---------------:|----------------:|-----------------:|
|      0 |   1.50853   |       1.22412   |         1.15415  |      0.755972  |        2.78212  |        0.613305  |
|      1 |   0.285149  |       0.266563  |         0.331653 |      0.0907167 |        0.982438 |        0.156455  |
|      2 |   0.124178  |       0.0698758 |         0.132661 |      0.0302389 |        0.460789 |        0.0750986 |
|      3 |   0.193166  |       0.14234   |         0.119395 |      0.0755972 |        0.660755 |        0.0187746 |
|      4 |   0.0413926 |       0.0388199 |         0.198992 |      0         |        0.321683 |        0.0125164 |

png

Cluster Analysis

The hierarchical clustering approach divides similar objects into groups called clusters by calculating the Euclidean distance between objects.

# Data format required for the algorithm
print(df_sep.T.iloc[:,[0,1,2,3]].head(3).to_markdown())

| Name           |   2016-08-31 00:00:00 |   2016-09-30 00:00:00 |   2016-10-31 00:00:00 |   2016-11-30 00:00:00 |
|:---------------|----------------------:|----------------------:|----------------------:|----------------------:|
| Adam West      |                   184 |                   210 |                   670 |                  1428 |
| Bonnie Swanson |                     0 |                     0 |                   172 |                   268 |
| Brian Griffin  |                   148 |                   388 |                  1278 |                  2352 |

# Transpose
hc_df = df_sep.T
# Choose linkage
link = shc.linkage(hc_df, method='complete', metric='euclidean')
# Create Dendrogram
dend = shc.dendrogram(link, labels = hc_df.index.tolist(), leaf_rotation=70)

png

The algorithm divided group members into three main clusters (grey dashed line indicates main clusters).

Text Mining

Word Cloud

There are a plethora of techniques to analyze text data. To avoid analyzing sensitive information here we will focus on WordCloud only and assume the group chat consists of Shakespeare Sonnets and the text looks as follow:

shakespeare_txt = pd.read_fwf('Shakespeare.txt')
#shakespeare_txt
print(shakespeare_txt.head(10).to_markdown())

|    | THE SONNETS                                           |
|---:|:------------------------------------------------------|
|  0 | From fairest creatures we desire increase,            |
|  1 | That thereby beauty's rose might never die,           |
|  2 | But as the riper should by time decease,              |
|  3 | His tender heir might bear his memory:                |
|  4 | But thou contracted to thine own bright eyes,         |
|  5 | Feed'st thy light's flame with self-substantial fuel, |
|  6 | Making a famine where abundance lies,                 |
|  7 | Thy self thy foe, to thy sweet self too cruel:        |
|  8 | Thou that art now the world's fresh ornament,         |
|  9 | And only herald to the gaudy spring,                  |

First of all, we need to remove the stopwords from the whole text.

#from stopwords_pl import list_words_pl -> if analysing messenger data
list_words_en[:10]

["'ll", "'tis", "'twas", "'ve", '10', '39', 'a', "a's", 'able', 'ableabout']

# Convert to plain text
text = " ".join(review for review in shakespeare_txt['THE SONNETS']) 
# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white",
                  width=800, height=400).generate(text)

png

The size of the word indicates how often it appears in the text.

The most common words used

# Choose Top
word_count_top = word_count[:20]
# Barplot      
ax = word_count_top.set_index("Word").plot.bar();

png

III. Time Series Analysis

Visual Analysis

Resampling with Simple Moving Average

Investigate the data by looking at a different frequency of the time series observations with a Simple Moving Average included

# Daily
df_day = df['Name'].resample(rule='D').count().to_frame('Count')
df_day['90 days of SMA'] = df_day['Count'].rolling(window = 90).mean()

# Weekly
df_wkl = df['Name'].resample(rule='W').count().to_frame('Count')
df_wkl['20 Weeks of SMA'] = df_wkl['Count'].rolling(window = 20).mean()

# Monthly
df_month = df['Name'].resample(rule='M').count().to_frame('Count')
df_month['6 months of SMA'] = df_month['Count'].rolling(window = 6).mean()

# Yearly
df_year = df['Name'].resample(rule='Y').count().to_frame(
    'Count').plot.line(title = 'Resampled Yearly', ax=axes[1,1]);

png

The rest of the time series analysis will be based on weekly resampled data.

Expanding Window

Expanding Window represents a data point on the x-axis as an average of messages sent until this point in time. Thus, the last point is an average of all the dataset. On the plot can be seen the line was going up until 2021 when the curve started flattening.

df_wkl['Count'].expanding().sum().plot.line(figsize=(15,5));

png

ETS Decomposition

# Resample by week
df_ETS = df['Name'].resample(rule='W').count().to_frame('Count')
# ETS
seasonal_decompose(df_ETS['Count']).plot();

png

A couple of observations:

Data is non-stationary.
Trend is decreasing; being flat on top until the end of 2018 started to go down in 2019 and again from 2020.
Seasonality is present. Activity is relatively low at the beginning of each year, while it has a peak at the end of each year.
Data contains a lot of noise, which can be explained neither by the trend nor seasonality, meaning external factors contribute as well.

ETS Decomposition: Zoom-in to the seasonal plot for one year only

# Zoom-in the one year
result.seasonal.loc['2017-01-01':'2017-12-31'].plot(figsize=(15,5));

png

Lag Plot

df_test = df['Name'].resample(rule='W').count().to_frame('Count')
lag_plot = lag_plot(df_test['Count'], lag=1)

png

Lag Plot represents a number of messages sent at time t on the X-axis vs at time t+1 on the Y-axis. A slight linear pattern suggests autocorrelation in data and that the data is not random.

Statistical Analysis

Beyond the visual representation of time series, we investigate it using statistical tests.

Test for Stationarity

Test whether the Time Series which we analyse is Stationary or not using a function below.

def adf_test(series,title=''):
  
    print(f'Augmented Dickey-Fuller Test: {title}')
    result = adfuller(series.dropna(),autolag='AIC') 
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    out = pd.Series(result[0:4],index=labels)

    for key,val in result[4].items():
        out[f'critical value ({key})']=val
        
    print(out.to_string())          
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")

adf_test(df_wkl['Count'])

Augmented Dickey-Fuller Test: 
ADF test statistic       -2.478960
p-value                   0.120678
# lags used               6.000000
# observations          283.000000
critical value (1%)      -3.453670
critical value (5%)      -2.871808
critical value (10%)     -2.572241
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary

The statistical test confirmed what we could see on the ETS plot - the data is non-stationary.

Test for Causality

Since the data can be split into multiple time series, we can test the causality between them (check if lagged x-values can explain the variation in y). We will use the Granger Causality Test to determine if time series have a casual impact on each other. First, however, we have to make sure that the time series are non-stationary.

Choose two time series (Stewie and Peter) to test for causality and stationarity.

df_sep_wkl[['Stewie Griffin', 'Peter Griffin']]

| Date                |   Stewie Griffin |   Peter Griffin |
|:--------------------|-----------------:|----------------:|
| 2016-08-07 00:00:00 |               28 |               8 |
| 2016-08-14 00:00:00 |              114 |              92 |
| 2016-08-21 00:00:00 |                0 |               0 |
| 2016-08-28 00:00:00 |                0 |               0 |
| 2016-09-04 00:00:00 |                0 |               0 |

# Stewie Griffin
gct_df_one = gct_df.iloc[0:, :1]
# Peter Griffin
gct_df_two = gct_df.iloc[0:, 1:]
# Plot
ax = gct_df_one.plot(figsize=(15,4));
gct_df_two.plot(figsize=(15,4), ax=ax);

png

Check Stewie’s time series whether it’s Stationary

print('First Person - Stewie Griffin')
adf_test(gct_df_one)

First Person - Stewie Griffin
Augmented Dickey-Fuller Test: 
ADF test statistic       -2.661844
p-value                   0.080877
# lags used               5.000000
# observations          284.000000
critical value (1%)      -3.453587
critical value (5%)      -2.871771
critical value (10%)     -2.572222
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary

Check Peter’s time series whether it’s Stationary

print('\n Second Person - Peter Griffin')
adf_test(gct_df_two)

 Second Person - Peter Griffin
Augmented Dickey-Fuller Test: 
ADF test statistic       -3.104510
p-value                   0.026221
# lags used               5.000000
# observations          284.000000
critical value (1%)      -3.453587
critical value (5%)      -2.871771
critical value (10%)     -2.572222
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Because Stewie’s Time Series is Non-Stationary, we need to find which order difference would make both time series stationary.

# test k_diff=1
gct_df_diff = diff(gct_df, k_diff=1)

print('First Person - Stewie Griffin')
adf_test(gct_df_diff.iloc[0:, :1])

print('\n Second Person - Peter Griffin')
adf_test(gct_df_diff.iloc[0:, 1:])

First Person - Stewie Griffin
Augmented Dickey-Fuller Test: 
ADF test statistic     -1.290186e+01
p-value                 4.231711e-24
# lags used             4.000000e+00
# observations          2.840000e+02
critical value (1%)    -3.453587e+00
critical value (5%)    -2.871771e+00
critical value (10%)   -2.572222e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

 Second Person - Peter Griffin
Augmented Dickey-Fuller Test: 
ADF test statistic     -1.194527e+01
p-value                 4.445467e-22
# lags used             4.000000e+00
# observations          2.840000e+02
critical value (1%)    -3.453587e+00
critical value (5%)    -2.871771e+00
critical value (10%)   -2.572222e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

After differencing the time series by 1, both are stationary now. Once time series are prepared, we can define the statistical hypothesis.

Null Hypothesis (H0): Person 2 (Peter Griffin) does not granger cause person 1 (Stewie Griffin). Alternative Hypothesis (H1): Person 2 (Peter Griffin) granger causes person 1 (Stewie Griffin).

# Test (Peter -> Stewie)
grangercausalitytests(gct_df_diff.loc[:,['Peter Griffin', 'Stewie Griffin']], 
                      maxlag = 3);

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=11.3455 , p=0.0009  , df_denom=285, df_num=1
ssr based chi2 test:   chi2=11.4649 , p=0.0007  , df=1
likelihood ratio test: chi2=11.2426 , p=0.0008  , df=1
parameter F test:         F=11.3455 , p=0.0009  , df_denom=285, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=7.8552  , p=0.0005  , df_denom=282, df_num=2
ssr based chi2 test:   chi2=15.9889 , p=0.0003  , df=2
likelihood ratio test: chi2=15.5594 , p=0.0004  , df=2
parameter F test:         F=7.8552  , p=0.0005  , df_denom=282, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=5.2989  , p=0.0014  , df_denom=279, df_num=3
ssr based chi2 test:   chi2=16.2955 , p=0.0010  , df=3
likelihood ratio test: chi2=15.8482 , p=0.0012  , df=3
parameter F test:         F=5.2989  , p=0.0014  , df_denom=279, df_num=3

P-value is < 0.05, which indicates that we can reject H0 and conclude that: Messages sent by person 2 (Peter Griffin) granger cause replies of person 1 (Stewie Griffin).

Let’s test the same but in the opposite direction: Null Hypothesis (H0) : Person 1 (Stewie Griffin) does not granger cause person 2 (Peter Griffin). Alternative Hypothesis (H1) : Person 1 (Stewie Griffin) granger causes person 2 (Peter Griffin).

# Test (Stewie -> Peter)
grangercausalitytests(gct_df_diff.loc[:,['Stewie Griffin', 'Peter Griffin']], 
                      maxlag = 3);

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=2.2611  , p=0.1338  , df_denom=285, df_num=1
ssr based chi2 test:   chi2=2.2849  , p=0.1306  , df=1
likelihood ratio test: chi2=2.2759  , p=0.1314  , df=1
parameter F test:         F=2.2611  , p=0.1338  , df_denom=285, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=1.3225  , p=0.2681  , df_denom=282, df_num=2
ssr based chi2 test:   chi2=2.6919  , p=0.2603  , df=2
likelihood ratio test: chi2=2.6793  , p=0.2619  , df=2
parameter F test:         F=1.3225  , p=0.2681  , df_denom=282, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=0.5723  , p=0.6336  , df_denom=279, df_num=3
ssr based chi2 test:   chi2=1.7601  , p=0.6237  , df=3
likelihood ratio test: chi2=1.7547  , p=0.6248  , df=3
parameter F test:         F=0.5723  , p=0.6336  , df_denom=279, df_num=3

This time we can see the opposite and don’t have enough evidence to reject H0.

IV. Predictions

This section will compare several time series models and evaluate them to check which one performs the best.

We will be working on weekly resampled data and will be modeling the number of messages sent by the group per week. The whole dataset is presented below.

# Assign the data
df_pred = df['Name'].resample(rule='W').count().to_frame('Count')
# Plot
df_pred.plot(figsize=(15,5));

png

Data Preparation

Train/Test Split

We will be evaluating the model on the last 52 weeks of the data (the last ~1 year serves as a testing set and previous data as a training set).

range_pred = 52
# Train/Test split
train = df_pred[:-range_pred]
test = df_pred[-range_pred:]
# Plot
train['Count'].plot(legend = True, label='Train', figsize=(15,5));
test['Count'].plot(legend = True, label='Test', figsize=(15,5));

png

print('Train Set Dates')
print(train.index.min())
print(train.index.max())

print('Test Set Dates')
print(test.index.min())
print(test.index.max())

Train Set Dates
2016-08-07 00:00:00
2021-02-21 00:00:00
Test Set Dates
2021-02-28 00:00:00
2022-02-20 00:00:00

Check for zero values

Weeks with zero messages will be replaced by the previous week’s value.

# Replace 0 by the previous' row value
train['Count'] = train['Count'].replace(to_replace=0, method='ffill')
train[train['Count'] == 0].count()

Count    0
dtype: int64

Build and Evaluate Models

Several models will be built and evaluated based on MAE and RMSE defined in the function below.

def eval_models(true_test, pred_test, model_name):
    
    # Calculate Metrics
    MAE = mean_absolute_error(true_test, pred_test)
    RMSE = np.sqrt(mean_squared_error(true_test, pred_test))
    # Save
    data = [{'Model': model_name, 
             'MAE': MAE, 
             'RMSE': RMSE }]
    models_results = pd.DataFrame(data)
    
    return models_results

0. Baseline - train set average

Further models created will be compared to the baseline (average of the train set). Simply assuming, without any modeling we would take the average of the past for predicting the future.

# Check the mean of train set
test['Train mean'] = train['Count'].mean()
train['Count'].mean()

1161.0504201680671

Plot Results:

png

Evaluate:

eval_models(test['Count'], test['Train mean'], "Baseline - train mean")

|    | Model                 |     MAE |    RMSE |
|---:|:----------------------|--------:|--------:|
|  0 | Baseline - train mean | 966.743 | 972.358 |

1. Simple Exponential Smoothing - Single, Double and Triple

Building four models:

SES with weights parameter (alpha). Larger alpha (i.e., close to 1) means more weight is given to the more recent observations.
DES, where additionally trend/seasonal component is added.
TES, where both trend and seasonal components are included in the model. (When seasonal components are included, Box-Cox transformation is used to avoid predictions being negative)

#1 SES
fitted_model_1 = ExponentialSmoothing(train['Count']).fit()
#2 DES (seasonal) 
fitted_model_2 = ExponentialSmoothing(train['Count'], seasonal='add').fit(
    use_boxcox=True)
#3 DES (trend with 'damped' True)
fitted_model_3 = ExponentialSmoothing(train['Count'], trend='add', damped=True)
#4 TES (trend + seasonal)
fitted_model_4 = ExponentialSmoothing(train['Count'], trend='add', seasonal='add')

Plot Results:

png

Zoom-in to the test set only:

png

Evaluate:

|    | Model                               |     MAE |    RMSE |
|---:|:------------------------------------|--------:|--------:|
|  0 | Baseline - train mean               | 966.743 | 972.358 |
|  0 | SES                                 | 106.043 | 125.102 |
|  0 | SES with seasonal component         | 209.201 | 260.403 |
|  0 | SES with trend component            | 115.664 | 134.36  |
|  0 | SES with trend + seasonal component | 170.096 | 206.173 |

All Exponential Smoothing techniques performed much better than a baseline approach. The simplest model among SES models with just one parameter alpha performed the best.

2. ARIMA

Let’s check one of the most popular model for time series forecasting - the ARIMA method. It will test parameters using the auto_arima function based on AIC and we will build a model later.

fitted_model_5 = auto_arima(train['Count'], start_p=0, start_q=0, 
                            seasonal=True, trace=True, m=range_pred)

Best model:  ARIMA(1,1,1)(0,0,0)[52]

# Model Prediction
fitted_model_5_fcast = fitted_model_5.predict(n_periods=len(test))

Plot Results:

png

Evaluate:

print(all_results.to_markdown())

|    | Model                               |     MAE |    RMSE |
|---:|:------------------------------------|--------:|--------:|
|  0 | Baseline - train mean               | 966.743 | 972.358 |
|  0 | SES                                 | 106.043 | 125.102 |
|  0 | SES with seasonal component         | 209.201 | 260.403 |
|  0 | SES with trend component            | 115.664 | 134.36  |
|  0 | SES with trend + seasonal component | 170.096 | 206.173 |
|  0 | ARIMA                               | 252.765 | 273.249 |

Surprisingly, ARIMA performed very poorly compared to the other models (very high RMSE). One reason could be that no Seasonal ARIMA was chosen, but just ARIMA, despite the fact that there is a seasonality in our data. Weekly data could be challenging for modeling because the seasonal period (the number of weeks in a year) can change depending on the year and this had to be fixed in the auto_arima function as a parameter m.

3. Facebook Prophet

Let’s test Facebook library Prophet to forecast. It needs data to be converted into a specific format before modeling.

print(train_propeth.head().to_markdown())

|    | ds                  |   y |
|---:|:--------------------|----:|
|  0 | 2016-08-07 00:00:00 | 182 |
|  1 | 2016-08-14 00:00:00 | 548 |
|  2 | 2016-08-21 00:00:00 | 548 |
|  3 | 2016-08-28 00:00:00 | 548 |
|  4 | 2016-09-04 00:00:00 |   6 |

print(test_propeth.head().to_markdown())

|    | ds                  |   y |
|---:|:--------------------|----:|
|  0 | 2021-02-28 00:00:00 | 150 |
|  1 | 2021-03-07 00:00:00 | 224 |
|  2 | 2021-03-14 00:00:00 | 100 |
|  3 | 2021-03-21 00:00:00 | 280 |
|  4 | 2021-03-28 00:00:00 | 216 |

Build the model:

# Build a model
m = Prophet(seasonality_mode= 'multiplicative')
# Fit to train data
m.fit(train_propeth)
# Assign periods to predict and freq
future = m.make_future_dataframe(periods = range_pred, freq='W')

Additional Capabilities of Prophet:

Prophet has multiple built-in functions to investigate the time series. Let’s check out two of them - the Components Plot and Changepoints Plot.

# Components Plot
m.plot_components(forecast);

png

Changepoint Plot - Show how the trend changed, the red line indicating when there were significant trend changes.

fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast)

png

Plot Results - Built-in Prophet capabilities:

m.plot(forecast);

png

Plot Results - Zoom in to the test set only using our previous plotting approach:

png

Evaluate:

print(all_results.to_markdown())

|    | Model                               |     MAE |    RMSE |
|---:|:------------------------------------|--------:|--------:|
|  0 | Baseline - train mean               | 966.743 | 972.358 |
|  0 | SES                                 | 106.043 | 125.102 |
|  0 | SES with seasonal component         | 209.201 | 260.403 |
|  0 | SES with trend component            | 115.664 | 134.36  |
|  0 | SES with trend + seasonal component | 170.096 | 206.173 |
|  0 | ARIMA                               | 252.765 | 273.249 |
|  0 | Prophet                             | 221.184 | 257.256 |

4. ARIMAX - with exogenous variable

The last method we wanna test out is the ARIMA model again, but with an exogenous variable - holiday. The assumption is that number of messages sent during a week could be affected by the number of holidays during that week.

Prepare the data:

# Holiday .csv contains polish holidays for years 2001 to 2025
holiday_df_all = pd.read_csv('holiday.csv')
# Filter holidays for dates in our dataset
holiday_df = holiday_df_all.loc['2016-08-04':'2022-02-15']

| Date                | Holiday_Name                           |
|:--------------------|:---------------------------------------|
| 2016-08-15 00:00:00 | Wniebowziecie Najswietszej Marii Panny |
| 2016-11-01 00:00:00 | Uroczystosc Wszystkich Swietych        |
| 2016-11-11 00:00:00 | Narodowe Swieto Niepodleglosci         |
| 2016-12-25 00:00:00 | pierwszy dzien Bozego Narodzenia       |
| 2016-12-26 00:00:00 | drugi dzien Bozego Narodzenia          |

Add the holiday column into the our dataset.

# Resample daily
df_sarimax_res_d = df_sarimax['Name'].resample(rule='D').count().to_frame('Count')
# To resampled daily data attach holiday column 
df_sarimax_res_d = df_sarimax_res_d.join(holiday_df)
# Fill Nan with zeros
df_sarimax_res_d['Holiday_Name_fill'] = np.where(
              df_sarimax_res_d['Holiday_Name'].isnull(), 0, 1)

| Date                |   Count |   Holiday_Name |   Holiday_Name_fill |
|:--------------------|--------:|---------------:|--------------------:|
| 2016-08-04 00:00:00 |      36 |            nan |                   0 |
| 2016-08-05 00:00:00 |     146 |            nan |                   0 |
| 2016-08-06 00:00:00 |       0 |            nan |                   0 |
| 2016-08-07 00:00:00 |       0 |            nan |                   0 |
| 2016-08-08 00:00:00 |       0 |            nan |                   0 |

Plot the daily data and overlap it with holidays (a grey vertical line indicates if that day is a holiday). A couple of blue peaks on the plot overlap with holidays.

ax = df_sarimax_res_d['Count'].plot()
# Overlap if holiday 
for x in df_sarimax_res_d.query('Holiday_Name_fill>=1').index: 
    ax.axvline(x=x, color='k', alpha = 0.3);

png

Now, prepare the data for prediction (resampled weekly). We need to count a number of holidays for each week and attach this info to the train and test dataset.

# Resample Weekly
df_sarimax_final = df_sarimax_res_d['Count'].resample(rule='W').sum().to_frame('Count')
# Sum number of holidays per week
df_sarimax_final['Count_Holiday'] = df_sarimax_res_d['Holiday_Name_fill'].resample(
    rule='W').sum().to_frame('Count_Holiday')

| Date                |   Count |   Count_Holiday |
|:--------------------|--------:|----------------:|
| 2016-08-07 00:00:00 |     182 |               0 |
| 2016-08-14 00:00:00 |     548 |               0 |
| 2016-08-21 00:00:00 |       0 |               1 |
| 2016-08-28 00:00:00 |       0 |               0 |
| 2016-09-04 00:00:00 |       6 |               0 |

Plot the weekly data and overlap it with holidays (grey vertical line indicates if that day is a holiday).

png

Build ARIMAX model:

fitted_model_5_ex = auto_arima(train_ex['Count'], 
                               exogenous = train_ex[['Count_Holiday']], 
                               seasonal = True, trace=False, m=range_pred)

ARIMA(order=(1, 1, 1), scoring_args={}, seasonal_order=(0, 0, 0, 52),
      suppress_warnings=True, with_intercept=False)

ARIMAX Predictions:

fitted_model_5_ex_fcast = fitted_model_5_ex.predict(
    n_periods=len(test_ex), exogenous = test_ex[['Count_Holiday']])

Plot Results:

png

Interestingly the holiday variable has slightly decreased the predictions during the holiday periods.

Evaluate:

print(all_results_plot.to_markdown())

| Model                               |     MAE |    RMSE |
|:------------------------------------|--------:|--------:|
| Baseline - train mean               | 966.743 | 972.358 |
| SES                                 | 106.043 | 125.102 |
| SES with seasonal component         | 209.201 | 260.403 |
| SES with trend component            | 115.664 | 134.36  |
| SES with trend + seasonal component | 170.096 | 206.173 |
| ARIMA                               | 252.765 | 273.249 |
| Prophet                             | 221.184 | 257.256 |
| ARIMA Exogenous                     | 238.963 | 259.187 |

ARIMA with an Exogenous variable has performed slightly better compared to ARIMA, but RMSE is still relatively high compared to other methods.

V. Conclusion

all_results_plot[['RMSE', 'MAE']].sort_values(ascending=True, by='RMSE').plot(
    kind = 'bar', rot=60);

png

All models performed significantly better than a baseline. The best results (lowest RMSE) achieved the simplest Simple Exponential Smoothing (SES) model, just with one smoothing parameter (alpha).

VI. Forecast the Unknown Future

In the last part, we will make forecasts into the unknown future above the data in the dataset. Thus, we will retrain the models using the whole data (train+test) and make predictions for another 52 weeks (which is from week of 2022-02-27, until the week of 2023-02-19). We won’t be evaluating models, since we don’t know the future but will do that in a couple of weeks/ months.

Define the data

We use the same dataframe df_pred which is already resampled weekly data, but this time we will build on the whole dataset.

df_pred_forecast = df_pred
print('From week: ', df_pred_forecast.index.min())
print('To week: ', df_pred_forecast.index.max())

From week:  2016-08-07 00:00:00
To week:  2022-02-20 00:00:00

Rebuild Models

1. Simple Exponential Smoothing - Single, Double and Triple

#1 SES
fitted_model_1_future = ExponentialSmoothing(
    df_pred_forecast['Count']).fit()
    
#2 DES (seasonal) 
fitted_model_2_future = ExponentialSmoothing(
    df_pred_forecast['Count'], seasonal='add').fit(use_boxcox=True)
    
#3 DES (trend with 'damped' True)
fitted_model_3_future = ExponentialSmoothing(
    df_pred_forecast['Count'], trend='add', damped=True)
    
#4 TES (trend + seasonal)
fitted_model_4_future = ExponentialSmoothing(
    df_pred_forecast['Count'], trend='add', seasonal='add')

2. ARIMA

# AutoARIMA
fitted_model_5_future = auto_arima(df_pred_forecast['Count'], start_p=0, 
                    start_q=0, seasonal=True, trace=False, m=range_pred)

ARIMA(order=(1, 1, 1), scoring_args={}, seasonal_order=(0, 0, 0, 52),
      suppress_warnings=True, with_intercept=False)

3. Facebook Prophet Model

In the dataset, the very last data points approach values closely to zero and Prophet forecasts negative values for the future. We will use a Box-Cox transformation for prediction and inverse the transformation for plotting.

# Build a model
m_future_model_6 = Prophet(seasonality_mode= 'multiplicative')
# Fit to data
m_future_model_6.fit(Prophet_full)
# Assign periods to predict and freq
future_future_fst_6= m_future_model_6.make_future_dataframe(
    periods = range_pred, freq='W') 
# Forecast
future_future_fst_6 = m_future_model_6.predict(future_future_fst_6)

4. ARIMAX - with exogenous variable

Build the model:

# Combine train and test data used in previous ARIMAX model
df_pred_forecast_ex = train_ex[['Count', 'Count_Holiday']].append(
    test_ex[['Count', 'Count_Holiday']])

# AutoARIMA with the same input values as for AutoARIMA
fitted_model_7_ex = auto_arima(df_pred_forecast_ex['Count'], 
                               exogenous = df_pred_forecast_ex[['Count_Holiday']], 
                               seasonal = True, trace=False, m=range_pred)

ARIMA(order=(2, 1, 2), scoring_args={}, seasonal_order=(0, 0, 0, 52),
      suppress_warnings=True, with_intercept=False)

Define the data for predictions:

Since we use an exogenous variable we need to attach a column with holidays for the future dates to support predictions.

# Define date ranges for prediction
dates_to_predict = pd.date_range(start=df.index.max().strftime('%Y-%m-%d'), 
                                 periods=365, freq='D')
# Attach a holiday column 
dates_to_predict = dates_to_predict.join(holiday_df_all)

Predictions:

fitted_model_7_ex_future = fitted_model_7_ex.predict(
    n_periods=len(dates_to_predict), 
    exogenous = dates_to_predict[['Count_Holiday']])

Combine Forecasts and Plot the future

# 1. SES
forecast_future['SES (FUTURE)'] = fitted_model_1_fcast_future
forecast_future['SES with seasonal component (FUTURE)'] = fitted_model_2_fcast_future
forecast_future['SES with trend component (FUTURE)'] = fitted_model_3_fcast_future
forecast_future['SES with trend + seasonal component (FUTURE)'] = fitted_model_4_fcast_future
# 2. ARIMA
forecast_future['ARIMA (FUTURE)'] = fitted_model_5_fcast_future
# 3. Prophet
forecast_future = pd.merge(forecast_future, forecast_future_to_store_m_6, 
                           left_index=True, right_on='Date')
forecast_future = forecast_future.set_index('Date')
# 4. ARIMAX
forecast_future['ARIMA Exogenous (FUTURE)'] = fitted_model_7_ex_future

# Plot
forecast_future.plot(figsize=(15,10));

png

In a couple of weeks/ months, we will come back to those forecasted values and see which method was the closest to the truth.