The latest documentation can be found at https://gtbpy.readthedocs.io/en/latest/.
GTBpy is a toolbox developed for the thesis "Predicting Housing Prices in Canada Using News Sentiments, Bubble Indicators, and Google Trends," authored by Mohammad Ebrahimi under the supervision of Simon van Norden and Vincent Grégoire (2024, HEC Montréal). This toolbox provides statistical tests for detecting financial bubbles and comparing forecast performance. Additionally, it introduces a novel predictive index derived from Google Trends data for a target variable.
The financial bubble detection tests implemented in GTBpy are based on the following academic studies:
- PWY (SADF) test: Peter C.B. Phillips, Yangru Wu, and Jun Yu (2011). Explosive behavior in the 1990s Nasdaq: when did exuberance escalate stock values?, International Economic Review, 52, 201-226.
- PSY (GSADF) test: Peter C.B. Phillips, Shuping Shi, and Jun Yu (2015). Testing for Multiple Bubbles: Limit Theory of Real Time Detectors, International Economic Review, 56, 1079-1134.
- WHL test: Emily J. Whitehouse, David I. Harvey, and Stephen J. Leybourne (2023). Real‐Time Monitoring of Bubbles and Crashes, Oxford Bulletin of Economics and Statistics, Department of Economics, University of Oxford, 85(3), 482-513.
GTBpy also includes statistical tests for comparing the performance of competing forecasts, based on the following papers:
- Equal MAFE/MSFE test: David I. Harvey, Stephen J. Leybourne, Paul Newbold (1997). Testing the equality of prediction mean squared errors, International Journal of Forecasting, 13(2), 281-291.
- Forecast Encompassing test: David I. Harvey, Stephen J. Leybourne, Paul Newbold (1998). Tests for Forecast Encompassing, Journal of Business & Economic Statistics, 16(2) , 254-259
In addition to the above statistical tools, GTBpy introduces a new predictive index constructed from Google Trends data for a target variable. This methodology is inspired by the following study:
- Google Trends: Stig V. Møller, Thomas Pedersen, Erik C. M. Schutte, Allan Timmermann (2023). Search and Predictability of Prices in the Housing Market, Management Science 70(1), 415-438.
In this documentation, we first demonstrate the auto-regressive model
GTBpy uses to forecast the target variable. Second, we detail the
computation of the
The forecasting model can be specified as an autoregressive model with exogenous variables (ARX), where the real estate price index is regressed on its lagged values and the exogenous variable lagged values. The reason is that we aim to improve the forecasting power of a simple AR model by adding the exogenous variable. This specification allows the model to leverage historical price data and predictor indices to predict future price movements. Our model also includes constant and seasonal dummies. We forecast the HPI for 1, 3, 6, and 12 months ahead.
where $Y_{t + h}\ $ is the forecasted HPI (or preprocessed HPI),
The Generalized Supremum Augmented Dickey-Fuller (GSADF) test is an extension of the Supremum Augmented Dickey-Fuller (SADF) test, which is used to identify the presence of explosive behavior or bubbles in time series data. The GSADF test enhances the flexibility and robustness of the SADF test by using a backward recursive approach, allowing for the detection of multiple bubbles within a single time series. We outline the steps and formulas involved in calculating the BADF and SADF test statistics proposed by Phillips, Wu & Yu (2011) and the BSADF and GSADF test statistics proposed by Phillips, Shi & Yu (2015).
All mentioned test statistics are based on the Augmented Dickey-Fuller (ADF) unit-root test, which is in turn based on the following regression:
In the context of this model,
The SADF test, introduced by Phillips, Wu, and Yu (2011), is a right-tailed unit root test applied recursively over a sequence of forward-expanding windows. The SADF test statistic is the supremum value of the Augmented Dickey-Fuller (ADF) statistic over these windows. The SADF test is defined as:
where
The GSADF test extends the SADF test by incorporating a backward
recursive mechanism. For each endpoint
where the minimum window size is indicated by
The BSADF test statistic is used to identify periods of explosive behavior or bubbles in the time series. If the BSADF statistic exceeds the critical value from the right-tailed unit root distribution, it indicates the presence of an explosive period within the time series. Multiple such periods can be identified by applying the BSADF test statistic recursively over different sub-samples of the data.
The GSADF test provides a more flexible and powerful method for detecting bubbles in time series data compared to the traditional SADF test. By allowing for multiple starting points and applying a backward recursive mechanism, the GSADF test can identify periods of explosive behavior more effectively, making it a valuable tool in econometric analysis (PSY, 2015).
import pandas as pd
import numpy as np
from GTBpy.bubble_indicators import PWY_SADF, PSY_GSADF, WHL
from GTBpy.google_trends_functions import GoogleTrend, plot_improvement
data_psy = pd.read_excel('SP_DV.xlsx')
data_whl = pd.read_csv('US price:rent.csv', parse_dates=[5])[['TIME', 'Value']]
data_whl = data_whl.set_index('TIME')pwy_obj = PWY_SADF(data_psy, njit=True)
sadf, badfs = pwy_obj.stats()
cv_sadf, cv_badf = pwy_obj.critical_values(m=500)
pwy_obj.plot()psy_obj = PSY_GSADF(data_psy, njit=True)
gsadf, bsadfs = psy_obj.stats()
cv_gsadf, cv_bsadfs = psy_obj.critical_values(m=500)
psy_obj.plot()The detection of financial bubbles and crashes is critical for
mitigating economic damage. Whitehouse, Harvey, and Leybourne (2022)
propose real-time monitoring procedures to detect such phenomena. This
document focuses on the calculation of two key statistics:
The
where
In practice, the
where
The
Consider the model expressed in the first differences near the endpoint of the explosive regime:
Here,
This statistic is standardized to account for possible variance changes, resulting in:
The critical value for this statistic,
Monitoring for a stationary collapse begins after the detection of an explosive regime. A collapse is detected if:
Whitehouse et al. demonstrate the efficacy of these statistics through
simulations and empirical application to the US housing market. The
The
Overall, while both sets of indicators aim to identify bubble periods, the PSY indicators provide a more stable and gradual representation, potentially capturing longer-term bubble dynamics. The WHL indicators, with their high volatility, seem to be more responsive to short-term market movements. The choice between these methods may depend on whether the focus is on capturing broad, sustained bubbles or detecting frequent, short-term deviations.
whl_obj = WHL(data_whl['Value'])
whl_obj.plot()In the following, we construct multiple versions of the House Search Index (HSI) and select the best one to use as an exogenous variable in the forecasting model. In this regard, we get 30 relevant search query data from Google Trends. They include those suggested by Moller et al. (2023) and what can be obtained by following the paper's approach for Canada. The data contains monthly observations from 2004-01 to 2023-11. We only use data until 2019-12 to preprocess the data and train the model.
Some of the related terms may have more noise in their measurement than others. To eliminate the noise and more precisely estimate latent demand, Moller et al. (2023) employ a targeted principal component analysis (PCA) method that ensures only the most relevant search indices are included in calculating the latent demand factor. Specifically, their approach follows Bai and Ng (2008) as they use the elastic net estimator of Zou and Hastie (2005) to select the 10 most relevant search indices and then apply principal component analysis to summarize the most important information from these indices into one common component. Moller et al. (2023) interpret this principal component as a summary measure for housing search and refer to it as the HSI.
Moller et al. (2023) admit their main goal is to produce a simple and easy-to-interpret index of housing search, which is why they used a simple targeted PCA approach. Moreover, they claim the predictive results that they reported are generally highly robust to using more advanced machine learning techniques. Like Moller et al. (2023), we adopt the targeted principal component analysis (PCA) method to calculate the HSI. While Moller et al. (2023) construct one HSI, we construct various potential HSIs from different lags and combinations of the Google Trend series and select the best HSI for forecasting purposes.
To find the best HSI, we take the following steps: First, we preprocess Google Trends search volume data and the HPI, including removing seasonal and trend factors. Second, we compute several HSIs by extracting the first principal component of search indices volumes. Third, we find the best lags for each HSI in the forecasting model using information criteria BIC. Finally, we pick the best HSI based on the cross-validation MSE in the forecasting model over the training sample.
We first convert the HPI series to its natural logarithm to study its proportional changes. It is well-known that house prices display strong seasonal variation, with high prices during spring and summer and low prices during fall and winter. Furthermore, we remove seasonality and trend by regressing each series on monthly dummy variables and a linear time trend factor. Then, we study the residuals from this regression.
We transform the search indices as follows: First, to comply with changes in the HPI, we remove seasonality and trend in the same way.
While HPI varies over time, search volume data are very volatile. To address this issue, we smooth search volumes by converting them into their 12-month moving average.
Google Trends Preprocessing Graphs for h=12
GT_object = GoogleTrend(relevant_df, '2019-12-31', verbose=1, seed=1, pca_lag=True)
GT_object.prepare_gtrends(detrend=True, deseasonal=True, method='reg', regression='ct', log=False, smooth=12)
GT_object.prepare_hpi(detrend=True, deseasonal=True, method='reg', regression='ct', log=True)
GT_object.plot_gtrends()After preprocessing "house for sale" search volume and before preprocessing HPI

After preprocessing "house for sale" search volume and HPI

Computing HSIs consists of several steps and multiple layers. We
constitute a pool of input data, naming it
In layer k, we adopt the following procedure. First, for each forecast
horizon h, we try to predict HPI by regressing HPI on each series
Second, for each forecast horizon
Third, for each forecast horizon
Finally, we check if the minimum
After we finish creating HSIs, we define a set of best candidate
predictors for each forecast horizon. In this regard, we constitute a
set
To select the best predictor to use in the forecasting model, we first
need to specify the lags in the forecasting model (1). We find the best
combination of HPI lags and HSI lags by Bayesian Information Criterion
(BIC) for each candidate predictor in
After specifying the lags of each predictor of
Although each HSI is intended for a specific horizon, we consider all
candidate HSIs and search indices in
This flowchart illustrates the process of generating Ultimate Housing
Search Indices (UHSIs) and their integration into the forecasting model.
The process begins with the preprocessing of Google Trends (GTs) data,
followed by the addition of lagged features to the pool. HSIs are then
created through weighted random draws from the pool of GTs and their
lags, followed by bivariate regression across multiple forecast horizons
(h = 1, 3, 6, 12). The HSIs are generated as the first principal
component of the selected features. Layers are iteratively added to the
procedure until the minimum cross-validated (CV) error of the HSIs in
the most recent layer begins to increase, indicating the optimal depth
of the model. Finally, the "Ultimate HSI" (UHSI) is derived from a set
of best HSIs for each horizon h (Bh) and used as a predictor in the
final forecasting model.
Forecasting by UHSIs Graphs
GT_object.compute_hsi(layer=1, input_pool=True, auto_layer=False, n_input=10, n_hsi=20, max_lag=3, random='rnd', var_select='CV', selection_h_list=[0,1,3,6,12])
GT_object.compute_hsi(layer=2, input_pool=True, auto_layer=True, n_input=10, n_hsi=20, max_lag=1, random='rnd', var_select='CV', selection_h_list=[0,1,3,6,12])
GT_object.lag_setting(lag_select='IC')
GT_object.forecast(h_list=[1,3,6,12], seasonal=True, hsi_CV_select=True, fit_intercept=True, cv=5, n_iter=20, original_scale=False)
GT_object.plot_forecast('UHSI_1')GT_object.plot_forecast('UHSI_3')GT_object.plot_forecast('UHSI_6')GT_object.plot_forecast('UHSI_12')To assess the forecast accuracy and encompassing ability of competing models, we employ the Modified Diebold-Mariano (MDM) statistic developed by Harvey, Leybourne, and Newbold (1998). The MDM test, an extension of the original Diebold-Mariano (DM) test, is particularly suited for comparing the predictive accuracy of two competing forecasts, allowing for adjustments in finite sample distributions. This methodology ensures a robust comparison by considering the potential autocorrelation and heteroskedasticity in forecast errors.
The following steps outline the process: First, we generate
out-of-sample forecasts from two competing models,
Similarly, for the null hypothesis
Finally, for the null hypothesis
The test statistic MDM is then defined as:
where h is the forecast horizon,
We evaluate the forecasting power of the selected predictors. We establish an autoregressive model without any exogenous variables as the base model. Then, we assess the impact of incorporating UHSIs derived from Google Trends into the base model. These indices are based on the HSI introduced by Moller et al. (2023), who demonstrated that HSI effectively forecasts detrended and deseasonalized HPI. Accordingly, we apply the same preprocessing techniques to HPI in the third section.
We also report Mean Absolute Forecast Error (MAFE) and Mean Squared Forecast Error (MSFE) for all forecast models. Subsequently, we quantify the improvement in MAFE or MSFE achieved by the forecast model with predictors expressed as a percentage relative to the base model. Additionally, we employ the Modified Diebold-Mariano (MDM) test to compare all forecasting models against the base model, testing various null hypotheses, and we report the corresponding p-values. The null hypotheses are as follows.
Google Trends data are available from January 2004, and our data spans
from January 2004 to November 2023. We used data until the end of 2019
to estimate the model and from January 2020 to November 2023 to
calculate forecasting measurements like MAFE and MSFE. To find the lags
for the predictors on the right side of the forecasting model, we
consider all combinations of HPI and exogenous variable consecutive lags
and choose
First, we start forecasting with a set named G1 which is a set of search queries relevant to the real estate market that a typical homebuyer uses while searching for a home, representing the market demand. Second, to check whether the results are reliable, we repeat the forecasting using a set of irrelevant keywords called G2 and examine how the new results are different. Finally, we investigate the HSI algorithm's performance in picking the most informative search data. In this regard, we combine all relevant and irrelevant keywords and use them to generate HSIs for forecasting. We call the last set G3.
G1_rel = GoogleTrend(relevant_df, '2019-12-31', verbose=1, seed=1, pca_lag=True)
G1_rel.prepare_gtrends(detrend=True, deseasonal=True, method='reg', regression='ct', log=False, smooth=12)
G1_rel.prepare_hpi(detrend=True, deseasonal=True, method='reg', regression='ct', log=True)
G1_rel.compute_hsi(layer=1, input_pool=True, auto_layer=False, n_input=10, n_hsi=20, max_lag=3, random='rnd', var_select='CV', selection_h_list=[0,1,3,6,12])
G1_rel.compute_hsi(layer=2, input_pool=True, auto_layer=True, n_input=10, n_hsi=20, max_lag=1, random='rnd', var_select='CV', selection_h_list=[0,1,3,6,12])
G1_rel.lag_setting(lag_select='IC')
G1_rel.forecast(h_list=[1,3,6,12], seasonal=True, hsi_CV_select=True, fit_intercept=True, cv=5, n_iter=20, original_scale=False)
G1_rel.results(results_table=True, head=True)Table 1: Forecast by UHSIs using relevant keywords
Forecast Horizon |
Predictor | m | n | MAFE | MSFE | MAFE Improvement |
MSFE Improvement |
p-value | ||
|---|---|---|---|---|---|---|---|---|---|---|
| H0,1 | H0,2 | H0,3 | ||||||||
| h=1 | None | 3 | 0 | 0.00850 | 0.00010 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 3 | 0.00825 | 0.00009 | 2.9% | 12.7% | 0.6684 | 0.2044 | 0.3378 | |
| UHSI_3 | 3 | 1 | 0.00812 | 0.00009 | 4.5% | 9.8% | 0.0782 | 0.0162 | 0.0474 | |
| UHSI_6 | 3 | 1 | 0.00813 | 0.00009 | 4.3% | 9.3% | 0.0814 | 0.0190 | 0.0517 | |
| UHSI_12 | 3 | 0 | 0.00850 | 0.00010 | -- | -- | -- | -- | -- | |
| h=3 | None | 3 | 0 | 0.02292 | 0.00082 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 3 | 0.02061 | 0.00063 | 10.1% | 23.1% | 0.4451 | 0.2999 | 0.5129 | |
| UHSI_3 | 3 | 3 | 0.02088 | 0.00064 | 8.9% | 21.4% | 0.4952 | 0.3325 | 0.5867 | |
| UHSI_6 | 3 | 1 | 0.02131 | 0.00070 | 7.0% | 14.1% | 0.2896 | 0.1956 | 0.2725 | |
| UHSI_12 | 3 | 1 | 0.02089 | 0.00067 | 8.9% | 17.5% | 0.2977 | 0.1969 | 0.2878 | |
| h=6 | None | 3 | 0 | 0.03862 | 0.00240 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 1 | 0.03065 | 0.00158 | 20.7% | 34.1% | 0.3136 | 0.3540 | 0.4495 | |
| UHSI_3 | 3 | 1 | 0.03040 | 0.00155 | 21.3% | 35.5% | 0.3156 | 0.3454 | 0.4458 | |
| UHSI_6 | 3 | 1 | 0.03388 | 0.00180 | 12.3% | 25.1% | 0.4149 | 0.3334 | 0.4259 | |
| UHSI_12 | 3 | 1 | 0.03114 | 0.00160 | 19.4% | 33.2% | 0.3231 | 0.3353 | 0.4257 | |
| h=12 | None | 3 | 0 | 0.05580 | 0.00488 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 1 | 0.03981 | 0.00206 | 28.7% | 57.8% | 0.5903 | 0.5231 | 0.6550 | |
| UHSI_3 | 3 | 1 | 0.03982 | 0.00207 | 28.6% | 57.6% | 0.5879 | 0.5193 | 0.6546 | |
| UHSI_6 | 3 | 1 | 0.04693 | 0.00311 | 15.9% | 36.2% | 0.6471 | 0.5287 | 0.6594 | |
| UHSI_12 | 3 | 1 | 0.04121 | 0.00223 | 26.1% | 54.2% | 0.5914 | 0.5158 | 0.6426 | |
. This table reports the optimal number of lags (m, n), MAFE, and MSFE
for predicting the logarithm of HPI adjusted for seasonality and time
trends in Canada using autoregressive models. The models evaluated
include a base model with no external predictors and several models that
incorporate different UHSIs driven from relevant keywords search data as
external predictors. Additionally, the table shows the percentage
improvement in forecast accuracy (MAFE and MSFE) of the models with
UHSIs relative to the base model. The p-values associated with the null
hypotheses test whether the inclusion of each UHSI significantly
improves the model's forecasting performance compared to the base
model.
We employ the Bayesian Information Criterion (BIC) to determine the best
lag structure. However, the selection process indicated that no lag was
optimal for UHSI_12 in
As demonstrated in the tables, HSIs provide greater predictive
improvements over extended periods. While the MAFE and MSFE improvement
increases by longer horizons, the p-values of null hypotheses of equal
MAFE or MSFE increase too. This means that we can only reject some null
hypotheses at the 0.05 and 0.1 significance levels in
The data we used so far to generate HSIs were search frequencies for a set of relevant keywords typically used by individuals seeking homes. The accuracy of the predictions was notably high, suggesting a strong correlation between search trends and real estate market movements. To validate the reliability of these results, a comparative analysis was conducted using predictions based on irrelevant keywords. This approach aimed to assess whether the predictive accuracy observed with the relevant keywords was genuinely indicative of market dynamics or merely coincidental.
For the forecast horizon
G2_irr = GoogleTrend(irrelevant_df, '2019-12-31', verbose=1, seed=1, pca_lag=True)
G2_irr.prepare_gtrends(detrend=True, deseasonal=True, method='reg', regression='ct', log=False, smooth=12)
G2_irr.prepare_hpi(detrend=True, deseasonal=True, method='reg', regression='ct', log=True)
G2_irr.compute_hsi(layer=1, input_pool=True, auto_layer=False, n_input=10, n_hsi=20, max_lag=3, random='rnd', var_select='CV', selection_h_list=[0,1,3,6,12])
G2_irr.compute_hsi(layer=2, input_pool=True, auto_layer=True, n_input=10, n_hsi=20, max_lag=1, random='rnd', var_select='CV', selection_h_list=[0,1,3,6,12])
G2_irr.lag_setting(lag_select='IC')
G2_irr.forecast(h_list=[1,3,6,12], seasonal=True, hsi_CV_select=True, fit_intercept=True, cv=5, n_iter=20, original_scale=False)
G2_irr.results(results_table=True, head=True)Table 2: Forecast by UHSIs using irrelevant keywords
Forecast Horizon |
Predictor | m | n | MAFE | MSFE | MAFE Improvement |
MSFE Improvement |
p-value | ||
|---|---|---|---|---|---|---|---|---|---|---|
| H0,1 | H0,2 | H0,3 | ||||||||
| h=3 | None | 3 | 0 | 0.02292 | 0.00082 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 1 | 0.02270 | 0.00085 | 1.0% | -4.7% | 0.8983 | 0.7457 | 0.5785 | |
| UHSI_3 | 3 | 1 | 0.02274 | 0.00086 | 0.8% | -4.8% | 0.9245 | 0.7270 | 0.4783 | |
| UHSI_6 | 3 | 1 | 0.02296 | 0.00086 | -0.2% | -5.4% | 0.9858 | 0.7055 | 0.4058 | |
| UHSI_12 | 3 | 1 | 0.02440 | 0.00093 | -6.4% | -13.9% | 0.6053 | 0.4394 | 0.2152 | |
| h=6 | None | 3 | 0 | 0.03862 | 0.00240 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 1 | 0.03815 | 0.00252 | 1.2% | -5.0% | 0.9467 | 0.8840 | 0.6918 | |
| UHSI_3 | 3 | 1 | 0.03593 | 0.00241 | 7.0% | -0.4% | 0.7361 | 0.9891 | 0.6985 | |
| UHSI_6 | 3 | 1 | 0.03557 | 0.00239 | 7.9% | 0.2% | 0.7327 | 0.9950 | 0.6218 | |
| UHSI_12 | 3 | 1 | 0.04056 | 0.00278 | -5.0% | -15.9% | 0.8492 | 0.6896 | 0.3284 | |
| h=12 | None | 3 | 0 | 0.0558 | 0.0049 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 2 | 0.0614 | 0.0056 | -10.0% | -14.4% | 0.7902 | 0.8131 | 0.6418 | |
| UHSI_3 | 3 | 2 | 0.0614 | 0.0054 | -10.1% | -10.0% | 0.8001 | 0.8663 | 0.6474 | |
| UHSI_6 | 3 | 2 | 0.0611 | 0.0053 | -9.6% | -7.6% | 0.8347 | 0.9067 | 0.6354 | |
| UHSI_12 | 3 | 1 | 0.0633 | 0.0051 | -13.5% | -4.7% | 0.8043 | 0.9492 | 0.3717 | |
. This table reports the optimal number of lags (m, n), MAFE, and MSFE
for predicting the logarithm of HPI adjusted for seasonality and time
trends in Canada using autoregressive models. The models evaluated
include a base model with no external predictors and several models that
incorporate different UHSIs driven from irrelevant keywords search data
as external predictors. Additionally, the table shows the percentage
improvement in forecast accuracy (MAFE and MSFE) of the models with
UHSIs relative to the base model. The p-values associated with the null
hypotheses test whether the inclusion of each UHSI significantly
improves the model's forecasting performance compared to the base
model.
The comparison between the forecasts generated from relevant and irrelevant keywords revealed an intrinsic relationship between search behavior and market dynamics, rather than mere coincidence. This suggests that Google Trends data, when carefully selected to reflect pertinent a search topic, provides a meaningful and reliable indicator of real estate market movements. Consequently, the strong predictive performance observed in this study highlights the potential of leveraging online search behavior as a valuable tool in forecasting real estate prices, offering insights that are both practical and theoretically sound.
The previous analysis comparing relevant and irrelevant keywords demonstrated that the relevance of search terms is crucial for achieving accurate real estate price forecasts. The results highlighted a significant difference in predictive power, with relevant keywords leading to more accurate predictions, while irrelevant keywords showed no forecasting capability.
In this section, we extend the analysis by combining both relevant and irrelevant keywords to evaluate whether the Housing Sentiment Index (HSI) generation algorithm can effectively discern the predictive value of the keywords. Specifically, we aim to determine if the algorithm can identify and prioritize the most predictive keywords from the mixed dataset, thereby enhancing the overall forecasting accuracy. This assessment will provide insights into the robustness and sensitivity of the HSI algorithm in selecting impactful indicators for real estate market predictions.
G3_full = GoogleTrend(relevant_and_irrelevant_df, '2019-12-31', verbose=1, seed=1, pca_lag=True)
G3_full.prepare_gtrends(detrend=True, deseasonal=True, method='reg', regression='ct', log=False, smooth=12)
G3_full.prepare_hpi(detrend=True, deseasonal=True, method='reg', regression='ct', log=True)
G3_full.compute_hsi(layer=1, input_pool=True, auto_layer=False, n_input=10, n_hsi=20, max_lag=3, random='rnd', var_select='CV', selection_h_list=[0,1,3,6,12])
G3_full.compute_hsi(layer=2, input_pool=True, auto_layer=True, n_input=10, n_hsi=20, max_lag=1, random='rnd', var_select='CV', selection_h_list=[0,1,3,6,12])
G3_full.lag_setting(lag_select='IC')
G3_full.forecast(h_list=[1,3,6,12], seasonal=True, hsi_CV_select=True, fit_intercept=True, cv=5, n_iter=20, original_scale=False)
G3_full.results(results_table=True, head=True)Table 3: Forecast by UHSIs using relevant and irrelevant keywords
Forecast Horizon |
Predictor | m | n | MAFE | MSFE | MAFE Improvement |
MSFE Improvement |
p-value | ||
|---|---|---|---|---|---|---|---|---|---|---|
| H0,1 | H0,2 | H0,3 | ||||||||
| h=1 | None | 3 | 0 | 0.00850 | 0.00010 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 1 | 0.00813 | 0.00009 | 4.4% | 11.0% | 0.1470 | 0.0199 | 0.0804 | |
| UHSI_3 | 3 | 1 | 0.00809 | 0.00009 | 4.8% | 10.6% | 0.0786 | 0.0154 | 0.0498 | |
| UHSI_6 | 3 | 1 | 0.00819 | 0.00009 | 3.6% | 7.5% | 0.1065 | 0.0429 | 0.1002 | |
| UHSI_12 | 3 | 1 | 0.00810 | 0.00009 | 4.6% | 9.6% | 0.1449 | 0.0555 | 0.1933 | |
| h=3 | None | 3 | 0 | 0.02292 | 0.00082 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 1 | 0.02105 | 0.00068 | 8.2% | 16.7% | 0.3007 | 0.1879 | 0.2657 | |
| UHSI_3 | 3 | 1 | 0.02083 | 0.00067 | 9.1% | 17.8% | 0.2694 | 0.1746 | 0.2528 | |
| UHSI_6 | 3 | 1 | 0.02133 | 0.00072 | 6.9% | 12.2% | 0.2661 | 0.2080 | 0.2859 | |
| UHSI_12 | 3 | 1 | 0.02101 | 0.00069 | 8.3% | 15.9% | 0.3217 | 0.2103 | 0.3217 | |
| h=6 | None | 3 | 0 | 0.03862 | 0.00240 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 1 | 0.03211 | 0.00166 | 16.9% | 30.7% | 0.3296 | 0.3561 | 0.4199 | |
| UHSI_3 | 3 | 1 | 0.03157 | 0.00162 | 18.3% | 32.4% | 0.3101 | 0.3291 | 0.3962 | |
| UHSI_6 | 3 | 1 | 0.03360 | 0.00184 | 13.0% | 23.3% | 0.3693 | 0.2959 | 0.3876 | |
| UHSI_12 | 3 | 1 | 0.03113 | 0.00163 | 19.4% | 32.1% | 0.3279 | 0.3179 | 0.4312 | |
| h=12 | None | 3 | 0 | 0.05580 | 0.00488 | -- | -- | -- | -- | -- |
| UHSI_1 | 3 | 1 | 0.04246 | 0.00246 | 23.9% | 49.6% | 0.5799 | 0.5224 | 0.6024 | |
| UHSI_3 | 3 | 1 | 0.04185 | 0.00243 | 25.0% | 50.1% | 0.5613 | 0.5157 | 0.5929 | |
| UHSI_6 | 3 | 1 | 0.04711 | 0.00317 | 15.6% | 35.0% | 0.6484 | 0.5407 | 0.6749 | |
| UHSI_12 | 3 | 1 | 0.04102 | 0.00218 | 26.5% | 55.4% | 0.6002 | 0.5196 | 0.6740 | |
. This table reports the optimal number of lags (m, n), MAFE, and MSFE
for predicting the logarithm of HPI adjusted for seasonality and time
trends in Canada using autoregressive models. The models evaluated
include a base model with no external predictors and several models that
incorporate different UHSIs driven by the combination of relevant and
irrelevant keywords search data as external predictors. Additionally,
the table shows the percentage improvement in forecast accuracy (MAFE
and MSFE) of the models with UHSIs relative to the base model. The
p-values associated with the null hypotheses test whether the inclusion
of each UHSI significantly improves the model's forecasting performance
compared to the base model.
The analysis involving a combination of relevant and irrelevant keywords yielded results that were nearly as accurate as those obtained using only relevant keywords. This finding suggests that the generating HSI algorithm is adept at identifying and prioritizing the most predictive keywords within the dataset. As evidenced by the tables, the HSIs still provide significant predictive improvements over extended periods, even when irrelevant keywords are included in the input data.
Figures 2 and 3 illustrate the Kernel Density Estimates (KDE) of the
improvements in MAFE and MSFE achieved by HSIs in
MAFE Improvement of HSIs in
plot_improvement([G1_rel, G2_irr, G3_full], measure='MAFE')Figure 2: KDE distribution of MAFE improvements for all forecast horizons
This Kernel Density Estimation (KDE) plot illustrates the distribution
of MAFE improvement percentages relative to the base model at all
forecast horizons for three groups of data: relevant keywords (blue),
irrelevant keywords (red), and a combination of relevant and irrelevant
keywords (orange). The figure shows the MAFE improvement distributions
for the HSIs in the
MSFE Improvement of HSIs in
plot_improvement([G1_rel, G2_irr, G3_full], measure='MSFE')Figure 3: KDE distribution of MSFE improvements for all forecast horizons
This Kernel Density Estimation (KDE) plot illustrates the distribution
of MSFE improvement percentages relative to the base model at all
forecast horizons for three groups of data: relevant keywords (blue),
irrelevant keywords (red), and a combination of relevant and irrelevant
keywords (orange). The figure shows the MSFE improvement distributions
for the HSIs in the










