Facebook Time Series Analysis
This post contains the analysis of the Messenger group chat with multiple participants for the time window from 2016-08-04 11:03:40
to 2022-02-15 14:38:42
, which in total is a time span of 2021 days. All personal data has been stripped out, anonymized and replaced by names from The Family Guy without affecting the data itself (especially time series). The below analysis consists of the following:
- Data Import (combine json files and load them to Python)
- Descriptive Statistics (exploratory data analysis, along with text mining)
- Time Series Analysis (deep dive into time series techniques)
- Time Series Predictions (Apply different methods for Time Series Predictions: Simple, Double and Triple Exponential Smoothing methods, ARIMA: Seasonal ARIMA (SARIMA) & Seasonal Exogenous ARIMA (SARIMAX) and Facebook Prophet Model.
- Conclude the Predictions (Compare the evaluation of Time Series Models)
- Forecast the Future (Retrain all the models on the whole dataset and will forecast the future from
Feb, 2022
untilFeb 2023
, so the results can be validated in a couple of months).
To reduce the length of the report, most of the code generating the analysis was removed, if you are interested in viewing the whole analysis and Python code you can find the whole Jupyter Notebook on my Github.
I. Data Import
Once you request the data from Facebook it can be delivered in multiple .json
files depending on the size of the conversation. Below we start with reading the completed .csv
file, if you are interested in the function which combines all .json
files into a csv refer to my Github.
Read the .csv
for analysis.
# Read data
df = pd.read_csv('output_anonym.csv', names = ['Date', 'Name', 'Text'])
# Convert as datetime
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
print(df.head().to_markdown())
Anonymized data:
| Date | Name | Text |
|:--------------------|:--------------|:---------------|
| 2022-02-15 14:38:42 | Brian Griffin | blah blah blah |
| 2022-02-14 23:53:50 | Peter Griffin | blah blah blah |
| 2022-02-14 23:51:39 | Peter Griffin | blah blah blah |
| 2022-02-14 23:12:25 | Adam West | blah blah blah |
| 2022-02-14 22:51:19 | Lois Griffin | blah blah blah |
Data contains:
Date
: timestamp - the date and time when a message was sentName
: info who sent a message (anonymized)Text
: a string of the message (anonymized)
For further analysis data is resampled on the Name variable counted for different time periods: Days, Weeks, Months, Years, indicating the number of messages sent per resampled period, e.g. if resampled daily, then each row would store a number of messages sent per a particular person for every day.
II. Descriptive Statistics
Analyzed data is captured for the following time period:
print('From: ', df.index.min())
print('To: ', df.index.max())
print('In total, captured: ', df.index.max() - df.index.min())
From: 2016-08-04 11:03:40
To: 2022-02-15 14:38:42
In total, captured: 2021 days 03:35:02
High-Level Summary:
- The Count is the number of messages sent.
- The Unique indicates the number of unique messages sent.
- The Top is the most common value/ message.
- The Freq is the most common value’s frequency
Plotting
For the purpose of plotting data is resampled daily
df_day = df['Name'].resample(rule='D').count().to_frame('Count')
print(df_day.head().to_markdown())
| Date | Count |
|:--------------------|--------:|
| 2016-08-04 00:00:00 | 36 |
| 2016-08-05 00:00:00 | 146 |
| 2016-08-06 00:00:00 | 0 |
| 2016-08-07 00:00:00 | 0 |
| 2016-08-08 00:00:00 | 0 |
Overall Messages Sent
df_day['Count'].plot(figsize=(15,6), title = 'Number of Messages sent (Daily)');
Messages Per Weekday
# Barplot
df_day.groupby('Day')['Count'].sum().plot.bar();
# Boxplot
ax = df_day.boxplot(by='Day', grid=False, figsize=(15,6));
On the barplot can be seen that Friday has the highest sum of messages sent, while boxplots' green lines show the median for each day with dots showing outliers (there was a Saturday with more than 1400 messages sent.
Hour Activity
# Resample by Hour
df_hour = df['Name'].resample(rule='H').count().to_frame('Count')
# Barplot
df_hour.groupby('Hour')['Count'].sum().plot.bar();
# Boxplot
ax = df_hour.boxplot(by='Hour', grid=False, figsize=(15,6));
Separate the data
Since data consist of multiple participants, it can be split into multiple time series and analyzed separately. The analysis is done on data resampled monthly - refer to the table below.
# Resample by month
df_sep = df[['Name']].groupby('Name').resample('M').count()
# Pivot
df_sep = pd.pivot_table(df_sep, values='Count', index=['Date'],
columns=['Name'], aggfunc=np.sum, fill_value=0)
# Show a sample
print(df_sep.iloc[:,[0,2,6,9,10]].head().to_markdown())
| Date | Adam West | Brian Griffin | John Herbert | Peter Griffin | Stewie Griffin |
|:--------------------|------------:|----------------:|---------------:|----------------:|-----------------:|
| 2016-08-31 00:00:00 | 184 | 148 | 50 | 100 | 142 |
| 2016-09-30 00:00:00 | 210 | 388 | 228 | 28 | 242 |
| 2016-10-31 00:00:00 | 670 | 1278 | 578 | 300 | 798 |
| 2016-11-30 00:00:00 | 1428 | 2352 | 1192 | 210 | 1024 |
| 2016-12-31 00:00:00 | 880 | 1702 | 760 | 356 | 708 |
Member’s activity - number of messages sent per person
ax = df_sep.boxplot();
Daily Activity
# Groupby and count
df_radar_plot = df_radar_plot.groupby(['Name','Hour'])['Hour'].count().reset_index(
name='count')
# Pivot
df_radar_plot = pd.pivot_table(df_radar_plot, values='count', index=['Hour'],
columns=['Name'], fill_value=0)
| Hour | Adam West | Brian Griffin | Glenn Quagmire | John Herbert | Peter Griffin | Stewie Griffin |
|-------:|------------:|----------------:|-----------------:|---------------:|----------------:|-----------------:|
| 0 | 1.50853 | 1.22412 | 1.15415 | 0.755972 | 2.78212 | 0.613305 |
| 1 | 0.285149 | 0.266563 | 0.331653 | 0.0907167 | 0.982438 | 0.156455 |
| 2 | 0.124178 | 0.0698758 | 0.132661 | 0.0302389 | 0.460789 | 0.0750986 |
| 3 | 0.193166 | 0.14234 | 0.119395 | 0.0755972 | 0.660755 | 0.0187746 |
| 4 | 0.0413926 | 0.0388199 | 0.198992 | 0 | 0.321683 | 0.0125164 |
Cluster Analysis
The hierarchical clustering approach divides similar objects into groups called clusters by calculating the Euclidean distance between objects.
# Data format required for the algorithm
print(df_sep.T.iloc[:,[0,1,2,3]].head(3).to_markdown())
| Name | 2016-08-31 00:00:00 | 2016-09-30 00:00:00 | 2016-10-31 00:00:00 | 2016-11-30 00:00:00 |
|:---------------|----------------------:|----------------------:|----------------------:|----------------------:|
| Adam West | 184 | 210 | 670 | 1428 |
| Bonnie Swanson | 0 | 0 | 172 | 268 |
| Brian Griffin | 148 | 388 | 1278 | 2352 |
# Transpose
hc_df = df_sep.T
# Choose linkage
link = shc.linkage(hc_df, method='complete', metric='euclidean')
# Create Dendrogram
dend = shc.dendrogram(link, labels = hc_df.index.tolist(), leaf_rotation=70)
The algorithm divided group members into three main clusters (grey dashed line indicates main clusters).
Text Mining
Word Cloud
There are a plethora of techniques to analyze text data. To avoid analyzing sensitive information here we will focus on WordCloud only and assume the group chat consists of Shakespeare Sonnets and the text looks as follow:
shakespeare_txt = pd.read_fwf('Shakespeare.txt')
#shakespeare_txt
print(shakespeare_txt.head(10).to_markdown())
| | THE SONNETS |
|---:|:------------------------------------------------------|
| 0 | From fairest creatures we desire increase, |
| 1 | That thereby beauty's rose might never die, |
| 2 | But as the riper should by time decease, |
| 3 | His tender heir might bear his memory: |
| 4 | But thou contracted to thine own bright eyes, |
| 5 | Feed'st thy light's flame with self-substantial fuel, |
| 6 | Making a famine where abundance lies, |
| 7 | Thy self thy foe, to thy sweet self too cruel: |
| 8 | Thou that art now the world's fresh ornament, |
| 9 | And only herald to the gaudy spring, |
First of all, we need to remove the stopwords from the whole text.
#from stopwords_pl import list_words_pl -> if analysing messenger data
list_words_en[:10]
["'ll", "'tis", "'twas", "'ve", '10', '39', 'a', "a's", 'able', 'ableabout']
# Convert to plain text
text = " ".join(review for review in shakespeare_txt['THE SONNETS'])
# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white",
width=800, height=400).generate(text)
The size of the word indicates how often it appears in the text.
The most common words used
# Choose Top
word_count_top = word_count[:20]
# Barplot
ax = word_count_top.set_index("Word").plot.bar();
III. Time Series Analysis
Visual Analysis
Resampling with Simple Moving Average
Investigate the data by looking at a different frequency of the time series observations with a Simple Moving Average included
# Daily
df_day = df['Name'].resample(rule='D').count().to_frame('Count')
df_day['90 days of SMA'] = df_day['Count'].rolling(window = 90).mean()
# Weekly
df_wkl = df['Name'].resample(rule='W').count().to_frame('Count')
df_wkl['20 Weeks of SMA'] = df_wkl['Count'].rolling(window = 20).mean()
# Monthly
df_month = df['Name'].resample(rule='M').count().to_frame('Count')
df_month['6 months of SMA'] = df_month['Count'].rolling(window = 6).mean()
# Yearly
df_year = df['Name'].resample(rule='Y').count().to_frame(
'Count').plot.line(title = 'Resampled Yearly', ax=axes[1,1]);
The rest of the time series analysis will be based on weekly resampled data.
Expanding Window
Expanding Window represents a data point on the x-axis as an average of messages sent until this point in time. Thus, the last point is an average of all the dataset. On the plot can be seen the line was going up until 2021 when the curve started flattening.
df_wkl['Count'].expanding().sum().plot.line(figsize=(15,5));
ETS Decomposition
# Resample by week
df_ETS = df['Name'].resample(rule='W').count().to_frame('Count')
# ETS
seasonal_decompose(df_ETS['Count']).plot();
A couple of observations:
- Data is non-stationary.
- Trend is decreasing; being flat on top until the end of 2018 started to go down in 2019 and again from 2020.
- Seasonality is present. Activity is relatively low at the beginning of each year, while it has a peak at the end of each year.
- Data contains a lot of noise, which can be explained neither by the trend nor seasonality, meaning external factors contribute as well.
ETS Decomposition: Zoom-in to the seasonal plot for one year only
# Zoom-in the one year
result.seasonal.loc['2017-01-01':'2017-12-31'].plot(figsize=(15,5));
Lag Plot
df_test = df['Name'].resample(rule='W').count().to_frame('Count')
lag_plot = lag_plot(df_test['Count'], lag=1)
Lag Plot represents a number of messages sent at time t on the X-axis vs at time t+1 on the Y-axis. A slight linear pattern suggests autocorrelation in data and that the data is not random.
Statistical Analysis
Beyond the visual representation of time series, we investigate it using statistical tests.
Test for Stationarity
Test whether the Time Series which we analyse is Stationary or not using a function below.
def adf_test(series,title=''):
print(f'Augmented Dickey-Fuller Test: {title}')
result = adfuller(series.dropna(),autolag='AIC')
labels = ['ADF test statistic','p-value','# lags used','# observations']
out = pd.Series(result[0:4],index=labels)
for key,val in result[4].items():
out[f'critical value ({key})']=val
print(out.to_string())
if result[1] <= 0.05:
print("Strong evidence against the null hypothesis")
print("Reject the null hypothesis")
print("Data has no unit root and is stationary")
else:
print("Weak evidence against the null hypothesis")
print("Fail to reject the null hypothesis")
print("Data has a unit root and is non-stationary")
adf_test(df_wkl['Count'])
Augmented Dickey-Fuller Test:
ADF test statistic -2.478960
p-value 0.120678
# lags used 6.000000
# observations 283.000000
critical value (1%) -3.453670
critical value (5%) -2.871808
critical value (10%) -2.572241
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary
The statistical test confirmed what we could see on the ETS plot - the data is non-stationary.
Test for Causality
Since the data can be split into multiple time series, we can test the causality between them (check if lagged x-values can explain the variation in y). We will use the Granger Causality Test to determine if time series have a casual impact on each other. First, however, we have to make sure that the time series are non-stationary.
Choose two time series (Stewie and Peter) to test for causality and stationarity.
df_sep_wkl[['Stewie Griffin', 'Peter Griffin']]
| Date | Stewie Griffin | Peter Griffin |
|:--------------------|-----------------:|----------------:|
| 2016-08-07 00:00:00 | 28 | 8 |
| 2016-08-14 00:00:00 | 114 | 92 |
| 2016-08-21 00:00:00 | 0 | 0 |
| 2016-08-28 00:00:00 | 0 | 0 |
| 2016-09-04 00:00:00 | 0 | 0 |
# Stewie Griffin
gct_df_one = gct_df.iloc[0:, :1]
# Peter Griffin
gct_df_two = gct_df.iloc[0:, 1:]
# Plot
ax = gct_df_one.plot(figsize=(15,4));
gct_df_two.plot(figsize=(15,4), ax=ax);
Check Stewie’s time series whether it’s Stationary
print('First Person - Stewie Griffin')
adf_test(gct_df_one)
First Person - Stewie Griffin
Augmented Dickey-Fuller Test:
ADF test statistic -2.661844
p-value 0.080877
# lags used 5.000000
# observations 284.000000
critical value (1%) -3.453587
critical value (5%) -2.871771
critical value (10%) -2.572222
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary
Check Peter’s time series whether it’s Stationary
print('\n Second Person - Peter Griffin')
adf_test(gct_df_two)
Second Person - Peter Griffin
Augmented Dickey-Fuller Test:
ADF test statistic -3.104510
p-value 0.026221
# lags used 5.000000
# observations 284.000000
critical value (1%) -3.453587
critical value (5%) -2.871771
critical value (10%) -2.572222
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary
Because Stewie’s Time Series is Non-Stationary, we need to find which order difference would make both time series stationary.
# test k_diff=1
gct_df_diff = diff(gct_df, k_diff=1)
print('First Person - Stewie Griffin')
adf_test(gct_df_diff.iloc[0:, :1])
print('\n Second Person - Peter Griffin')
adf_test(gct_df_diff.iloc[0:, 1:])
First Person - Stewie Griffin
Augmented Dickey-Fuller Test:
ADF test statistic -1.290186e+01
p-value 4.231711e-24
# lags used 4.000000e+00
# observations 2.840000e+02
critical value (1%) -3.453587e+00
critical value (5%) -2.871771e+00
critical value (10%) -2.572222e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary
Second Person - Peter Griffin
Augmented Dickey-Fuller Test:
ADF test statistic -1.194527e+01
p-value 4.445467e-22
# lags used 4.000000e+00
# observations 2.840000e+02
critical value (1%) -3.453587e+00
critical value (5%) -2.871771e+00
critical value (10%) -2.572222e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary
After differencing the time series by 1, both are stationary now. Once time series are prepared, we can define the statistical hypothesis.
Null Hypothesis (H0): Person 2 (Peter Griffin) does not granger cause person 1 (Stewie Griffin). Alternative Hypothesis (H1): Person 2 (Peter Griffin) granger causes person 1 (Stewie Griffin).
# Test (Peter -> Stewie)
grangercausalitytests(gct_df_diff.loc[:,['Peter Griffin', 'Stewie Griffin']],
maxlag = 3);
Granger Causality
number of lags (no zero) 1
ssr based F test: F=11.3455 , p=0.0009 , df_denom=285, df_num=1
ssr based chi2 test: chi2=11.4649 , p=0.0007 , df=1
likelihood ratio test: chi2=11.2426 , p=0.0008 , df=1
parameter F test: F=11.3455 , p=0.0009 , df_denom=285, df_num=1
Granger Causality
number of lags (no zero) 2
ssr based F test: F=7.8552 , p=0.0005 , df_denom=282, df_num=2
ssr based chi2 test: chi2=15.9889 , p=0.0003 , df=2
likelihood ratio test: chi2=15.5594 , p=0.0004 , df=2
parameter F test: F=7.8552 , p=0.0005 , df_denom=282, df_num=2
Granger Causality
number of lags (no zero) 3
ssr based F test: F=5.2989 , p=0.0014 , df_denom=279, df_num=3
ssr based chi2 test: chi2=16.2955 , p=0.0010 , df=3
likelihood ratio test: chi2=15.8482 , p=0.0012 , df=3
parameter F test: F=5.2989 , p=0.0014 , df_denom=279, df_num=3
P-value is < 0.05, which indicates that we can reject H0 and conclude that: Messages sent by person 2 (Peter Griffin) granger cause replies of person 1 (Stewie Griffin).
Let’s test the same but in the opposite direction: Null Hypothesis (H0) : Person 1 (Stewie Griffin) does not granger cause person 2 (Peter Griffin). Alternative Hypothesis (H1) : Person 1 (Stewie Griffin) granger causes person 2 (Peter Griffin).
# Test (Stewie -> Peter)
grangercausalitytests(gct_df_diff.loc[:,['Stewie Griffin', 'Peter Griffin']],
maxlag = 3);
Granger Causality
number of lags (no zero) 1
ssr based F test: F=2.2611 , p=0.1338 , df_denom=285, df_num=1
ssr based chi2 test: chi2=2.2849 , p=0.1306 , df=1
likelihood ratio test: chi2=2.2759 , p=0.1314 , df=1
parameter F test: F=2.2611 , p=0.1338 , df_denom=285, df_num=1
Granger Causality
number of lags (no zero) 2
ssr based F test: F=1.3225 , p=0.2681 , df_denom=282, df_num=2
ssr based chi2 test: chi2=2.6919 , p=0.2603 , df=2
likelihood ratio test: chi2=2.6793 , p=0.2619 , df=2
parameter F test: F=1.3225 , p=0.2681 , df_denom=282, df_num=2
Granger Causality
number of lags (no zero) 3
ssr based F test: F=0.5723 , p=0.6336 , df_denom=279, df_num=3
ssr based chi2 test: chi2=1.7601 , p=0.6237 , df=3
likelihood ratio test: chi2=1.7547 , p=0.6248 , df=3
parameter F test: F=0.5723 , p=0.6336 , df_denom=279, df_num=3
This time we can see the opposite and don’t have enough evidence to reject H0.
IV. Predictions
This section will compare several time series models and evaluate them to check which one performs the best.
We will be working on weekly resampled data and will be modeling the number of messages sent by the group per week. The whole dataset is presented below.
# Assign the data
df_pred = df['Name'].resample(rule='W').count().to_frame('Count')
# Plot
df_pred.plot(figsize=(15,5));
Data Preparation
Train/Test Split
We will be evaluating the model on the last 52 weeks of the data (the last ~1 year serves as a testing set and previous data as a training set).
range_pred = 52
# Train/Test split
train = df_pred[:-range_pred]
test = df_pred[-range_pred:]
# Plot
train['Count'].plot(legend = True, label='Train', figsize=(15,5));
test['Count'].plot(legend = True, label='Test', figsize=(15,5));
print('Train Set Dates')
print(train.index.min())
print(train.index.max())
print('Test Set Dates')
print(test.index.min())
print(test.index.max())
Train Set Dates
2016-08-07 00:00:00
2021-02-21 00:00:00
Test Set Dates
2021-02-28 00:00:00
2022-02-20 00:00:00
Check for zero values
Weeks with zero messages will be replaced by the previous week’s value.
# Replace 0 by the previous' row value
train['Count'] = train['Count'].replace(to_replace=0, method='ffill')
train[train['Count'] == 0].count()
Count 0
dtype: int64
Build and Evaluate Models
Several models will be built and evaluated based on MAE
and RMSE
defined in the function below.
def eval_models(true_test, pred_test, model_name):
# Calculate Metrics
MAE = mean_absolute_error(true_test, pred_test)
RMSE = np.sqrt(mean_squared_error(true_test, pred_test))
# Save
data = [{'Model': model_name,
'MAE': MAE,
'RMSE': RMSE }]
models_results = pd.DataFrame(data)
return models_results
0. Baseline - train set average
Further models created will be compared to the baseline (average of the train set). Simply assuming, without any modeling we would take the average of the past for predicting the future.
# Check the mean of train set
test['Train mean'] = train['Count'].mean()
train['Count'].mean()
1161.0504201680671
Plot Results:
Evaluate:
eval_models(test['Count'], test['Train mean'], "Baseline - train mean")
| | Model | MAE | RMSE |
|---:|:----------------------|--------:|--------:|
| 0 | Baseline - train mean | 966.743 | 972.358 |
1. Simple Exponential Smoothing - Single, Double and Triple
Building four models:
- SES with weights parameter (alpha). Larger alpha (i.e., close to 1) means more weight is given to the more recent observations.
- DES, where additionally trend/seasonal component is added.
- TES, where both trend and seasonal components are included in the model. (When seasonal components are included, Box-Cox transformation is used to avoid predictions being negative)
#1 SES
fitted_model_1 = ExponentialSmoothing(train['Count']).fit()
#2 DES (seasonal)
fitted_model_2 = ExponentialSmoothing(train['Count'], seasonal='add').fit(
use_boxcox=True)
#3 DES (trend with 'damped' True)
fitted_model_3 = ExponentialSmoothing(train['Count'], trend='add', damped=True)
#4 TES (trend + seasonal)
fitted_model_4 = ExponentialSmoothing(train['Count'], trend='add', seasonal='add')
Plot Results:
Zoom-in to the test set only:
Evaluate:
| | Model | MAE | RMSE |
|---:|:------------------------------------|--------:|--------:|
| 0 | Baseline - train mean | 966.743 | 972.358 |
| 0 | SES | 106.043 | 125.102 |
| 0 | SES with seasonal component | 209.201 | 260.403 |
| 0 | SES with trend component | 115.664 | 134.36 |
| 0 | SES with trend + seasonal component | 170.096 | 206.173 |
All Exponential Smoothing techniques performed much better than a baseline approach. The simplest model among SES models with just one parameter alpha performed the best.
2. ARIMA
Let’s check one of the most popular model for time series forecasting - the ARIMA method. It will test parameters using the auto_arima function based on AIC and we will build a model later.
fitted_model_5 = auto_arima(train['Count'], start_p=0, start_q=0,
seasonal=True, trace=True, m=range_pred)
Best model: ARIMA(1,1,1)(0,0,0)[52]
# Model Prediction
fitted_model_5_fcast = fitted_model_5.predict(n_periods=len(test))
Plot Results:
Evaluate:
print(all_results.to_markdown())
| | Model | MAE | RMSE |
|---:|:------------------------------------|--------:|--------:|
| 0 | Baseline - train mean | 966.743 | 972.358 |
| 0 | SES | 106.043 | 125.102 |
| 0 | SES with seasonal component | 209.201 | 260.403 |
| 0 | SES with trend component | 115.664 | 134.36 |
| 0 | SES with trend + seasonal component | 170.096 | 206.173 |
| 0 | ARIMA | 252.765 | 273.249 |
Surprisingly, ARIMA performed very poorly compared to the other models (very high RMSE). One reason could be that no Seasonal ARIMA was chosen, but just ARIMA, despite the fact that there is a seasonality in our data. Weekly data could be challenging for modeling because the seasonal period (the number of weeks in a year) can change depending on the year and this had to be fixed in the auto_arima function as a parameter m
.
3. Facebook Prophet
Let’s test Facebook library Prophet
to forecast. It needs data to be converted into a specific format before modeling.
print(train_propeth.head().to_markdown())
| | ds | y |
|---:|:--------------------|----:|
| 0 | 2016-08-07 00:00:00 | 182 |
| 1 | 2016-08-14 00:00:00 | 548 |
| 2 | 2016-08-21 00:00:00 | 548 |
| 3 | 2016-08-28 00:00:00 | 548 |
| 4 | 2016-09-04 00:00:00 | 6 |
print(test_propeth.head().to_markdown())
| | ds | y |
|---:|:--------------------|----:|
| 0 | 2021-02-28 00:00:00 | 150 |
| 1 | 2021-03-07 00:00:00 | 224 |
| 2 | 2021-03-14 00:00:00 | 100 |
| 3 | 2021-03-21 00:00:00 | 280 |
| 4 | 2021-03-28 00:00:00 | 216 |
Build the model:
# Build a model
m = Prophet(seasonality_mode= 'multiplicative')
# Fit to train data
m.fit(train_propeth)
# Assign periods to predict and freq
future = m.make_future_dataframe(periods = range_pred, freq='W')
Additional Capabilities of Prophet:
Prophet has multiple built-in functions to investigate the time series. Let’s check out two of them - the Components Plot
and Changepoints Plot
.
# Components Plot
m.plot_components(forecast);
Changepoint Plot
- Show how the trend changed, the red line indicating when there were significant trend changes.
fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast)
Plot Results - Built-in Prophet capabilities:
m.plot(forecast);
Plot Results - Zoom in to the test set only using our previous plotting approach:
Evaluate:
print(all_results.to_markdown())
| | Model | MAE | RMSE |
|---:|:------------------------------------|--------:|--------:|
| 0 | Baseline - train mean | 966.743 | 972.358 |
| 0 | SES | 106.043 | 125.102 |
| 0 | SES with seasonal component | 209.201 | 260.403 |
| 0 | SES with trend component | 115.664 | 134.36 |
| 0 | SES with trend + seasonal component | 170.096 | 206.173 |
| 0 | ARIMA | 252.765 | 273.249 |
| 0 | Prophet | 221.184 | 257.256 |
4. ARIMAX - with exogenous variable
The last method we wanna test out is the ARIMA model again, but with an exogenous variable - holiday. The assumption is that number of messages sent during a week could be affected by the number of holidays during that week.
Prepare the data:
# Holiday .csv contains polish holidays for years 2001 to 2025
holiday_df_all = pd.read_csv('holiday.csv')
# Filter holidays for dates in our dataset
holiday_df = holiday_df_all.loc['2016-08-04':'2022-02-15']
| Date | Holiday_Name |
|:--------------------|:---------------------------------------|
| 2016-08-15 00:00:00 | Wniebowziecie Najswietszej Marii Panny |
| 2016-11-01 00:00:00 | Uroczystosc Wszystkich Swietych |
| 2016-11-11 00:00:00 | Narodowe Swieto Niepodleglosci |
| 2016-12-25 00:00:00 | pierwszy dzien Bozego Narodzenia |
| 2016-12-26 00:00:00 | drugi dzien Bozego Narodzenia |
Add the holiday
column into the our dataset.
# Resample daily
df_sarimax_res_d = df_sarimax['Name'].resample(rule='D').count().to_frame('Count')
# To resampled daily data attach holiday column
df_sarimax_res_d = df_sarimax_res_d.join(holiday_df)
# Fill Nan with zeros
df_sarimax_res_d['Holiday_Name_fill'] = np.where(
df_sarimax_res_d['Holiday_Name'].isnull(), 0, 1)
| Date | Count | Holiday_Name | Holiday_Name_fill |
|:--------------------|--------:|---------------:|--------------------:|
| 2016-08-04 00:00:00 | 36 | nan | 0 |
| 2016-08-05 00:00:00 | 146 | nan | 0 |
| 2016-08-06 00:00:00 | 0 | nan | 0 |
| 2016-08-07 00:00:00 | 0 | nan | 0 |
| 2016-08-08 00:00:00 | 0 | nan | 0 |
Plot the daily data and overlap it with holidays (a grey vertical line indicates if that day is a holiday). A couple of blue peaks on the plot overlap with holidays.
ax = df_sarimax_res_d['Count'].plot()
# Overlap if holiday
for x in df_sarimax_res_d.query('Holiday_Name_fill>=1').index:
ax.axvline(x=x, color='k', alpha = 0.3);
Now, prepare the data for prediction (resampled weekly). We need to count a number of holidays for each week and attach this info to the train and test dataset.
# Resample Weekly
df_sarimax_final = df_sarimax_res_d['Count'].resample(rule='W').sum().to_frame('Count')
# Sum number of holidays per week
df_sarimax_final['Count_Holiday'] = df_sarimax_res_d['Holiday_Name_fill'].resample(
rule='W').sum().to_frame('Count_Holiday')
| Date | Count | Count_Holiday |
|:--------------------|--------:|----------------:|
| 2016-08-07 00:00:00 | 182 | 0 |
| 2016-08-14 00:00:00 | 548 | 0 |
| 2016-08-21 00:00:00 | 0 | 1 |
| 2016-08-28 00:00:00 | 0 | 0 |
| 2016-09-04 00:00:00 | 6 | 0 |
Plot the weekly data and overlap it with holidays (grey vertical line indicates if that day is a holiday).
Build ARIMAX model:
fitted_model_5_ex = auto_arima(train_ex['Count'],
exogenous = train_ex[['Count_Holiday']],
seasonal = True, trace=False, m=range_pred)
ARIMA(order=(1, 1, 1), scoring_args={}, seasonal_order=(0, 0, 0, 52),
suppress_warnings=True, with_intercept=False)
ARIMAX Predictions:
fitted_model_5_ex_fcast = fitted_model_5_ex.predict(
n_periods=len(test_ex), exogenous = test_ex[['Count_Holiday']])
Plot Results:
Interestingly the holiday
variable has slightly decreased the predictions during the holiday periods.
Evaluate:
print(all_results_plot.to_markdown())
| Model | MAE | RMSE |
|:------------------------------------|--------:|--------:|
| Baseline - train mean | 966.743 | 972.358 |
| SES | 106.043 | 125.102 |
| SES with seasonal component | 209.201 | 260.403 |
| SES with trend component | 115.664 | 134.36 |
| SES with trend + seasonal component | 170.096 | 206.173 |
| ARIMA | 252.765 | 273.249 |
| Prophet | 221.184 | 257.256 |
| ARIMA Exogenous | 238.963 | 259.187 |
ARIMA with an Exogenous variable has performed slightly better compared to ARIMA, but RMSE is still relatively high compared to other methods.
V. Conclusion
all_results_plot[['RMSE', 'MAE']].sort_values(ascending=True, by='RMSE').plot(
kind = 'bar', rot=60);
All models performed significantly better than a baseline. The best results (lowest RMSE) achieved the simplest Simple Exponential Smoothing (SES) model, just with one smoothing parameter (alpha).
VI. Forecast the Unknown Future
In the last part, we will make forecasts into the unknown future above the data in the dataset. Thus, we will retrain the models using the whole data (train+test) and make predictions for another 52 weeks (which is from week of 2022-02-27, until the week of 2023-02-19). We won’t be evaluating models, since we don’t know the future but will do that in a couple of weeks/ months.
Define the data
We use the same dataframe df_pred
which is already resampled weekly data, but this time we will build on the whole dataset.
df_pred_forecast = df_pred
print('From week: ', df_pred_forecast.index.min())
print('To week: ', df_pred_forecast.index.max())
From week: 2016-08-07 00:00:00
To week: 2022-02-20 00:00:00
Rebuild Models
1. Simple Exponential Smoothing - Single, Double and Triple
#1 SES
fitted_model_1_future = ExponentialSmoothing(
df_pred_forecast['Count']).fit()
#2 DES (seasonal)
fitted_model_2_future = ExponentialSmoothing(
df_pred_forecast['Count'], seasonal='add').fit(use_boxcox=True)
#3 DES (trend with 'damped' True)
fitted_model_3_future = ExponentialSmoothing(
df_pred_forecast['Count'], trend='add', damped=True)
#4 TES (trend + seasonal)
fitted_model_4_future = ExponentialSmoothing(
df_pred_forecast['Count'], trend='add', seasonal='add')
2. ARIMA
# AutoARIMA
fitted_model_5_future = auto_arima(df_pred_forecast['Count'], start_p=0,
start_q=0, seasonal=True, trace=False, m=range_pred)
ARIMA(order=(1, 1, 1), scoring_args={}, seasonal_order=(0, 0, 0, 52),
suppress_warnings=True, with_intercept=False)
3. Facebook Prophet Model
In the dataset, the very last data points approach values closely to zero and Prophet forecasts negative values for the future. We will use a Box-Cox transformation for prediction and inverse the transformation for plotting.
# Build a model
m_future_model_6 = Prophet(seasonality_mode= 'multiplicative')
# Fit to data
m_future_model_6.fit(Prophet_full)
# Assign periods to predict and freq
future_future_fst_6= m_future_model_6.make_future_dataframe(
periods = range_pred, freq='W')
# Forecast
future_future_fst_6 = m_future_model_6.predict(future_future_fst_6)
4. ARIMAX - with exogenous variable
Build the model:
# Combine train and test data used in previous ARIMAX model
df_pred_forecast_ex = train_ex[['Count', 'Count_Holiday']].append(
test_ex[['Count', 'Count_Holiday']])
# AutoARIMA with the same input values as for AutoARIMA
fitted_model_7_ex = auto_arima(df_pred_forecast_ex['Count'],
exogenous = df_pred_forecast_ex[['Count_Holiday']],
seasonal = True, trace=False, m=range_pred)
ARIMA(order=(2, 1, 2), scoring_args={}, seasonal_order=(0, 0, 0, 52),
suppress_warnings=True, with_intercept=False)
Define the data for predictions:
Since we use an exogenous variable we need to attach a column with holidays for the future dates to support predictions.
# Define date ranges for prediction
dates_to_predict = pd.date_range(start=df.index.max().strftime('%Y-%m-%d'),
periods=365, freq='D')
# Attach a holiday column
dates_to_predict = dates_to_predict.join(holiday_df_all)
Predictions:
fitted_model_7_ex_future = fitted_model_7_ex.predict(
n_periods=len(dates_to_predict),
exogenous = dates_to_predict[['Count_Holiday']])
Combine Forecasts and Plot the future
# 1. SES
forecast_future['SES (FUTURE)'] = fitted_model_1_fcast_future
forecast_future['SES with seasonal component (FUTURE)'] = fitted_model_2_fcast_future
forecast_future['SES with trend component (FUTURE)'] = fitted_model_3_fcast_future
forecast_future['SES with trend + seasonal component (FUTURE)'] = fitted_model_4_fcast_future
# 2. ARIMA
forecast_future['ARIMA (FUTURE)'] = fitted_model_5_fcast_future
# 3. Prophet
forecast_future = pd.merge(forecast_future, forecast_future_to_store_m_6,
left_index=True, right_on='Date')
forecast_future = forecast_future.set_index('Date')
# 4. ARIMAX
forecast_future['ARIMA Exogenous (FUTURE)'] = fitted_model_7_ex_future
# Plot
forecast_future.plot(figsize=(15,10));
In a couple of weeks/ months, we will come back to those forecasted values and see which method was the closest to the truth.