Develop a profitable wind energy demand forecasting model in AI4IMPACT Deep Learning Datathon Financial Prediction
*Note: This article has also been posted on Towards Data Science and Youtube*
Led a 3-member team to build time series forecasting models for energy traders in 4 weeks boot camp using neural networks in Smojo.
Engineered the features by using DIFF window, combined with momentum, force, and statistical attributes to help the model perform better prediction.
Optimized the difference neural networks with input scaling subnet, dropout layer, and L2 regularization.
Achieved MAE/MSE of 0.5/0.3 using test loss while having no lag.
Deployed the model and performed CI/CD using Autocaffe Labs.
Maximized revenue over 1-week trading by producing a net profit of 8.157.153 euro cents, leading to the best 15 performing teams.
A.1. Background and Motivation
Creating a steady supply of energy is always vital as our modern society genuinely depends on this. That is why predictable sources of energy like fossil fuels, or nuclear power, is still favorable. However, the risk of energy shortfall will continually exist, driving us to use finance — to predict and avoid shortages in energy production that possibly cause power outages in the future. One can consider harnessing renewable energy to confront this challenge.
Among those renewable energy sources to date, some are dependent on the environment, such as wind energy. As the name implies, the energy will be produced by the wind blows varying in speed and direction. Unfortunately, being the only trigger of producing power, the wind blows are a clear objection to utilizing this alternative energy.
In short, we have three parties that take part in this energy matter — Grid Operators, Energy Producers, and Energy Traders. Grid operators are responsible for providing society with a steady supply of electrical energy. Otherwise, the government will fine them if there are power outages. Next, the suppliers will manage the risk of energy shortfall, wind energy producers. Typically, energy traders will help suppliers forecast energy production for sale to the grid beforehand. In other words, energy traders serve to maximize profits on behalf of their clients.
We will play the role of an energy trader. The goal is to get a T+18 hour energy forecast every hour! Using our energy forecasting model and the given trading algorithm, we will maximize profits for our client, which is wind energy producers. We will build the forecast model based on time-series datasets using deep learning (neural network) — specifically the difference network architecture.
A.3. Trading Algorithm
You need to make a T+18h forecast of energy production from your client’s wind farms. This forecast is central to your trading
Your client is paid 10 euro cents per kWh sold to the grid. You can only sell to the grid what you forecast or that date
If the actual energy production exceeds the forecast, this excess is absorbed by the grid, but your client is not compensated for this excess.
If the actual energy production is below the forecast, you must buy energy from the spot market (at 20 euro cents/kWh) to supply the grid. You are given a cash reserve at the start of 10,000,000 euro cents to buy energy from the spot market.
Examine the data provided along with the statistics.
Normalize the data and set the baseline risk (persistence based).
Fit the training and test set well on risk function. The test loss should beat the baseline risk.
Improve the model performance (keep minimizing the risk while reducing the lag/maintaining zero lag).
Check the best model for reproducibility.
B. Data Examination
We will work with two different datasets as follows:
B.1.1. Wind Energy Production
Source: Réseau de Transport d’Électricité (RTE), the French energy transmission authority
This dataset, named energy-ile-de-france, contains the consolidated near-realtime wind energy production (in kWh) for Île-de-France region surrounding Paris that have been averaged and standardized to a time base of 1 hour. The data is provided from 01 January 2017 to the present.
The data is not really regular, but we can still see some trends. For example, the spikes of energy are most common in the winter and the transition between seasons. So far, the largest energy produced occurred in winter 2019–2020 which is up to 89000 kWh. The basic statistics of the data are presented below.
Mean = 17560.44 kWh
Median = 10500.0 kWh
Max = 89000.0 kWh
Min = 0.0 kWh
Range = 89000.0
Standard Deviation = 19146.63
B.1.2. Wind Forecasts
Source: Terra Weather
The data comes with 2 different wind forecast models (A and B), for 8 location wind farms in the Île-de-France region. Hence, there are 16 forecasts where each has 2 variables: wind speed (m/s) and wind direction as a bearing (degrees North — ie. 45 degrees means the wind blows from the northeast). The forecasts are updated daily every 6 hours and have been interpolated to the time base of 1 hour.
The wind speed graph has a similar trend to the energy one, indicating that this forecast data can be useful to our model as input features. Referring to this, the strongest wind occurs in the winter which is up to 12 m/s. The basic statistics of wind data are presented below.
Compared to wind speed forecasts, the wind direction pattern is tough to decipher. But we will see that even such data can still bring benefit to our forecast model.
In the end, our raw data is arranged as follows:
B.1. Normalize the Data and Set the Baseline
To speed up the training process, we will normalize our data to have zero mean and variance 1 using the formula below.
Now we have each feature on the same scale. Note that we only normalize energy and wind speed. The wind direction values will have special treatment later.
Next, we will obtain the baseline based on persistence risk. We extract the baseline risk using mean squared error (MSE) and mean absolute error (MAE).
Persistence risk (MSE): 0.4448637
Persistence risk (MAE): 0.6486683
Using the difference network architecture, we fit the training and test set well on risk function (MSE and MAE). The difference network helps us achieve better learning to beat this baseline. As a reminder, the objective is to get energy forecasts with a lead time of 18 hours.
The Difference Neural Network Architecture — Designed by the author
The followings are a few hyperparameters we can turn on:
Windowing input features (Naive, DIFF, momentum and force inputs)
Statistic input features (Mean, SD, MAX, MIN, etc)
Optimizer (Adam, SGD)
Activation functions (Relu, Tanh)
# Hidden layers (2 to 5)
Regularization (Dropout, L2)
NN-Size (8 to 256 neurons with 2/3 reduction for the next layer)
Subnetworks (Input scaling, Autoencoder)
Type of perceptrons (normal, squared perceptron)
Losses (MAE, MSE, Momentum loss, force loss)
Indeed, the lists are quite overwhelming :). But bear with us, as you will know what each will contribute to the model we build. Since we have experimented many times, we will only show you the settings that improve our goal.
*Note: Each experiment uses 10000 max iterations and the early stopping method.
C.1. Experiment 1 Input Scaling Subnet + 4 Hidden Layers (with Dropout)
In the first experiment, we try to create a low MAE (and MSE) that will beat the baseline. Therefore, we want our network to be deep and big enough without overfitting. As a result, we use Adam to achieve better learning and add a regularization method called the dropout layer to prevent overfitting. We use a 4-layer network with the multi-configuration as follows:
Input scaling sub-network
# layers: 4
C.1.1. Feature selection
Windowing is a basic operation for time-series data. Thus, for the input features, we use a window consisting of 60 hours of past energy produced (T-60). Then we turn the window into DIFF-momentum-force inputs with a lead time of 18 hours. It will result in 72 features. This adjustment helps the model detect movement and its rate to perform better clustering.
We also add an average of 60 hours of past wind speed forecasted and the wind speed forecast at T+18h from each wind model. This generates 4 more features. Thus, we have 76 input features in total ready to feed to the input scaling subnet. Since we use relatively large inputs, this subnet reduces unwanted features before supplying it to the main network.
In summary, here is the list of our input features:
DIFF+momentum+force inputs of T-60h past energy produced with a lead time of 18h
the mean of T-60h of past wind speed forecasted (model A)
the wind speed forecast at T+18h (model A)
the same applies to model B
C.1.2. Best Config Loss
Best test loss / Persistence error MSE: 0.280589 / 0.4448637 MAE: 0.554845 / 0.6486683
Best NN-size: 128
Best dropout-prob: 0.05
Notice that we have beaten the persistence and achieved zero lag.
There is still a high gap between training and test loss. We can consider using Regularization and adding more features.
C.2. Experiment 2 Input Scaling Subnet + 4 Hidden Layers (with Dropout + L2 Regularization)
With the same model as before, we add the L2 regularization into our model. We also run multiconfiguration while taking the best hyperparameters into account.
Input scaling sub-network
Weight decay: 1.0E-4/1.0E-5/1.0E-6
# layers: 4
C.2.1. Feature selection
We add new input features from the wind direction forecast. Although it’s a bit nonsense to add direction data as our input, a steady wind direction does help. Thus, we do not want to normalize direction naively, yet we will use trigonometric functions to ‘normalize’ it. In addition to the previous one, now we have 84 input features in total.
In summary, these are the addition to our input features:
(mean) sin function of T-18h of past wind direction forecasted (model A)
(mean) cos function of T-18h of past wind direction forecasted (model A)
wind direction forecast at T+18h in sin function (model A)
wind direction forecast at T+18h in cos function (model A)
The same applies to model B
Best test loss / Persistence error MSE: 0.26769 / 0.4448637 MAE: 0.549824 / 0.6486683
Best NN-size: 128
Best dropout-prob: 0.1
Best Weight decay: 1.0E-4
Notice that we have produced a better test loss while maintaining zero lag (also increase the peak value of the lag graph).
We can still improve the performance by adding more input features or layers to the model.
C.3. Experiment 3 — Final Model
Input Scaling Subnet + 4 Hidden Layers (with Dropout + L2 Regularization)
By setting the best hyperparameters fixed, the followings are our network configuration:
Input scaling sub-network
Weight decay: 1.0E-4
# layers: 4
C.3.1. Feature selection
We include new statistic features as the new additional inputs, taken from energy and wind speed data. In the end, we have 88 input features in total.
the mean of T-60h of past energy produced
the standard deviation of T-60h of past energy produced
the standard deviation of T-60h of past wind speed forecasted (model A)
the standard deviation of T-60h of past wind speed forecasted (model B)
Best test loss / Persistence error MSE: 0.258521 / 0.4448637 MAE: 0.52758 / 0.6486683
Net profit in euro cents MSE: 1.392861351E9 MAE: 1.447243201E9
We achieve the best test loss using this last model while having no lag. As a result, we have the highest profit of all models.
We have better scatter plots of actual vs training/test predictions. Although we fit well on the training set, obtaining a better scatter plot of actual vs test prediction is still a challenge.
C.4. Check for Reproducibility
Previously, the final model above has been done 40 repeats of training where each takes a maximum of 10000 iterations. Note that we use MAE for the loss function as it gives a higher profit to the clients. The statistics of the test losses are shown below.
Mean = 0.540747
Median = 0.540757
Max = 0.550977
Min = 0.527580
Range = 0.023397
(Mean-Min)/Standard Deviation = 2.690480
C.5. Final Model Prediction
Adding more layers decrease the training error, but increase the test loss and lower the profit, although we have used regularization techniques. Hence, we stick to the 4 layers in the final model.
The Autoencoder subnet helps reduce the dimension of our input features. However, when added to the network with features no more than 100, it increases the test loss of our model.
The squared perceptron is supposed to provide faster and better learning than the ordinary one. However, during the experiment, it does not improve the performance in terms of lowering the error.
The momentum and force losses are supposed to help reduce lag. However, when we add the losses to the network, the lag graph does not change (still zero lag) and it makes the error higher since the network needs to minimize three losses altogether (test, momentum, and force losses).
The difference networks effectively build a forecasting model with time-series data, even with fewer inputs.
When it comes to historical data, the DIFF window, combined with momentum, force, and statistical features can help the model perform better prediction.
A bigger and deeper network supports the model to memorize well (be careful of overfitting).
Dropout layer (small dropout probability) and L2 regularization help the network handle overfitting problems, hence improving performance.
Although RMSE (or MSE) is also popular as the loss function in time series data, our model produces a higher profit when MAE is used. MSE is inclined to penalize outliers, while MAE is more linear with errors. Since the model has no outliers, MAE turns out to work best for our model.
This article is part of the project documentation during the Deep Learning Datathon 2020 organized by ai4impact.
Member: Diardano Raihan, Mitchell Edbert, M. Taufiq Ismail Hatta