Section - 7 Predictive Modeling

We finally have everything we need to start making predictive models now that the data has been cleaned and we have come up with a gameplan to understand the efficacy of the models.

7.1 Example Simple Model

We can start by making a simple linear regression model:

lm(formula = target_price_24h ~ ., data = cryptodata)
## 
## Call:
## lm(formula = target_price_24h ~ ., data = cryptodata)
## 
## Coefficients:
##      (Intercept)         symbolACH         symbolADA       symbolALPHA  
##   -82346.2387258        -1.8722454        -0.4545752        -1.2961357  
##       symbolARDR        symbolARPA         symbolAVA         symbolBAL  
##       -1.6499876        -1.8049337         0.2492045       -13.9415365  
##        symbolBAT        symbolBIST        symbolBIZZ         symbolBNT  
##       -1.3695813        -0.2931533        -1.5772084         0.7895229  
##        symbolBOR       symbolBOSON         symbolBRD         symbolBSV  
##      124.7917462         0.5015221        -1.7395418       102.6223398  
##        symbolBTC         symbolBTG         symbolCHZ         symbolCKB  
##    32837.5572187        41.6993852        -1.6437263        -1.8503339  
##        symbolCLT         symbolCND        symbolCOMP      symbolCROOLD  
##        8.2174494        18.1705420       247.4347341        -2.8430034  
##       symbolCRPT        symbolCTSI      symbolCVCOIN       symbolDAISY  
##       -3.6201377        -1.9437026        -0.8196805       -17.0570034  
##       symbolDASH         symbolDCR         symbolDGB        symbolDODO  
##      130.5960229        90.7630794        -1.7534905        -0.2579914  
##       symbolDOGE        symbolEGLD         symbolELF         symbolENJ  
##       -1.9134919       119.9440803        -1.3595630        -0.7435469  
##        symbolEOS         symbolETC         symbolETH         symbolETN  
##        1.2856992        27.5472849      2245.9835146        -7.5577418  
##        symbolETP         symbolEVX         symbolFIL          symbolHT  
##      -12.5643755         6.6350979        46.3831045         5.6697450  
##        symbolICP        symbolIDEX         symbolIHF         symbolINJ  
##       34.1154409       -71.5779811        -1.6132507         5.6635217  
##         symbolIQ         symbolIQN         symbolJST        symbolJULD  
##      -16.7544702        -5.4397214        -1.2716536        -1.8519775  
##        symbolJUV         symbolKLV         symbolKMD         symbolKNC  
##       12.5766678        -2.5852583        -1.2170517        -0.6859524  
##        symbolLEO         symbolLOC         symbolLPT         symbolLSK  
##       -1.6782266         2.4233069        31.9208626         2.6779699  
##        symbolLTC        symbolMAID        symbolMANA         symbolMDX  
##      114.7531519         4.5142240        -1.3221785        -6.5352914  
##       symbolMITH         symbolNAV         symbolNCT        symbolNEAR  
##       -1.8488470        -0.8483115        -1.8871638         2.8510915  
##        symbolNEO        symbolNEXO         symbolOAX       symbolOCEAN  
##       29.7963369        -0.6957688        -2.3468462         3.8527490  
##        symbolOGN         symbolONE         symbolONG         symbolORN  
##       -2.5187405        -1.7626006         8.6058188         3.5640266  
##       symbolPERP        symbolPOND        symbolRAMP        symbolREEF  
##        8.6184456        -3.8348818        -5.1787268        -6.5313621  
##        symbolRSR        symbolSBTC       symbolSENSO        symbolSOLO  
##       -1.8579432         5.6517016        -1.4634139        -3.2779268  
##        symbolSRN       symbolSTORJ         symbolSUN         symbolSWM  
##       -0.5005575       -12.7611379         7.1690612         3.6151526  
##        symbolSXP         symbolTLM         symbolTON         symbolTRX  
##       -0.2609520        -1.7197057        -2.1735419        -1.7963714  
##         symbolTV         symbolUNI         symbolVIB         symbolVLX  
##      -10.7782116        14.8062468        -2.5429862        -2.8622637  
##       symbolWAXP        symbolWBTC        symbolWIKI         symbolWRX  
##       -1.6768283     35975.1000243        -0.4101456        -1.0287934  
##        symbolXCH         symbolXEM         symbolXMR         symbolXUC  
##      128.5366875        -1.7416388       175.1904550        18.2910110  
##        symbolXVG         symbolXYM        symbolYUCT         symbolZAP  
##       -7.3594016        -1.7415073       -62.0358301        -6.4773950  
##        symbolZEC         symbolZIL         symbolZKS         symbolZRX  
##       87.7682748         0.1645970         9.6319295        -1.4078626  
##    date_time_utc              date         price_usd   lagged_price_1h  
##        0.0001074        -4.9014855         0.0499063         0.0208466  
##  lagged_price_2h   lagged_price_3h   lagged_price_6h  lagged_price_12h  
##        0.0487624         0.0424067         0.0482410         0.0296749  
## lagged_price_24h   lagged_price_3d      trainingtest     trainingtrain  
##        0.0660169         0.0410486      -108.9247698       -86.8558171  
##            split  
##      -62.7995106

We defined the formula for the model as target_price_24h ~ ., which means that we are want to make predictions for the target_price_24h field, and use (~) every other column found in the data (.). In other words, we specified a model that uses the target_price_24h field as the dependent variable, and all other columns (.) as the independent variables. Meaning, we are looking to predict the target_price_24h, which is the only column that refers to the future, and use all the information available at the time the rest of the data was collected in order to infer statistical relationships that can help us forecast the future values of the target_price_24h field when it is still unknown on new data that we want to make new predictions for.

In the example above we used the cryptodata object which contained all the non-nested data, and was a big oversimplification of the process we will actually use.

7.1.1 Using Functional Programming

From this point forward, we will deal with the new dataset cryptodata_nested, review the previous section where it was created if you missed it. Here is a preview of the data again:

cryptodata_nested
## # A tibble: 550 x 5
## # Groups:   symbol, split [550]
##    symbol split train_data          test_data          holdout_data      
##    <chr>  <dbl> <list>              <list>             <list>            
##  1 BTC        1 <tibble [218 x 11]> <tibble [73 x 11]> <tibble [76 x 11]>
##  2 ETH        1 <tibble [218 x 11]> <tibble [73 x 11]> <tibble [76 x 11]>
##  3 EOS        1 <tibble [218 x 11]> <tibble [73 x 11]> <tibble [76 x 11]>
##  4 LTC        1 <tibble [218 x 11]> <tibble [73 x 11]> <tibble [76 x 11]>
##  5 BSV        1 <tibble [218 x 11]> <tibble [73 x 11]> <tibble [76 x 11]>
##  6 ADA        1 <tibble [218 x 11]> <tibble [73 x 11]> <tibble [74 x 11]>
##  7 TRX        1 <tibble [218 x 11]> <tibble [73 x 11]> <tibble [76 x 11]>
##  8 ZEC        1 <tibble [218 x 11]> <tibble [73 x 11]> <tibble [76 x 11]>
##  9 HT         1 <tibble [218 x 11]> <tibble [73 x 11]> <tibble [75 x 11]>
## 10 XMR        1 <tibble [218 x 11]> <tibble [73 x 11]> <tibble [75 x 11]>
## # ... with 540 more rows

Because we are now dealing with a nested dataframe, performing operations on the individual nested datasets is not as straightforward. We could extract the individual elements out of the data using indexing, for example we can return the first element of the column train_data by running this code:

cryptodata_nested$train_data[[1]]
## # A tibble: 218 x 11
##    date_time_utc       date       price_usd target_price_24h lagged_price_1h
##    <dttm>              <date>         <dbl>            <dbl>           <dbl>
##  1 2021-08-13 00:00:01 2021-08-13    44416.           47812.          43962.
##  2 2021-08-13 01:00:00 2021-08-13    44488.           47642.          44416.
##  3 2021-08-13 02:00:01 2021-08-13    44860.           47551.          44488.
##  4 2021-08-13 03:00:00 2021-08-13    45269.           47542.          44860.
##  5 2021-08-13 04:00:01 2021-08-13    45164.           47538.          45269.
##  6 2021-08-13 05:00:00 2021-08-13    45257.           47445.          45164.
##  7 2021-08-13 06:00:00 2021-08-13    45230.           47516.          45257.
##  8 2021-08-13 07:00:01 2021-08-13    45825.           47527.          45230.
##  9 2021-08-14 00:00:01 2021-08-14    47812.           47062.          47544.
## 10 2021-08-14 01:00:00 2021-08-14    47642.           47183.          47812.
## # ... with 208 more rows, and 6 more variables: lagged_price_2h <dbl>,
## #   lagged_price_3h <dbl>, lagged_price_6h <dbl>, lagged_price_12h <dbl>,
## #   lagged_price_24h <dbl>, lagged_price_3d <dbl>

remove STORJ to resolve weird problem that arose March 3rd, 2021:

cryptodata_nested <- filter(cryptodata_nested, symbol != "STORJ")

As we already saw dataframes are really flexible as a data structure. We can create a new column in the data to store the models themselves that are associated with each row of the data. There are several ways that we could go about doing this (this tutorial itself was written to execute the same commands using three fundamentally different methodologies), but in this tutorial we will take a functional programming approach. This means we will focus the operations we will perform on the actions we want to take themselves, which can be contrasted to a for loop which emphasizes the objects more using a similar structure that we used in the example above showing the first element of the train_data column.

When using a functional programming approach, we first need to create functions for the operations we want to perform. Let’s wrap the lm() function we used as an example earlier and create a new custom function called linear_model, which takes a dataframe as an input (the train_data we will provide for each row of the nested dataset), and generates a linear regression model:

linear_model <- function(df){
  lm(target_price_24h ~ . -date_time_utc -date, data = df)
}

We can now use the map() function from the purrr package in conjunction with the mutate() function from dplyr to create a new column in the data which contains an individual linear regression model for each row of train_data:

mutate(cryptodata_nested, lm_model = map(train_data, linear_model))
## # A tibble: 550 x 6
## # Groups:   symbol, split [550]
##    symbol split train_data          test_data         holdout_data      lm_model
##    <chr>  <dbl> <list>              <list>            <list>            <list>  
##  1 BTC        1 <tibble [218 x 11]> <tibble [73 x 11~ <tibble [76 x 11~ <lm>    
##  2 ETH        1 <tibble [218 x 11]> <tibble [73 x 11~ <tibble [76 x 11~ <lm>    
##  3 EOS        1 <tibble [218 x 11]> <tibble [73 x 11~ <tibble [76 x 11~ <lm>    
##  4 LTC        1 <tibble [218 x 11]> <tibble [73 x 11~ <tibble [76 x 11~ <lm>    
##  5 BSV        1 <tibble [218 x 11]> <tibble [73 x 11~ <tibble [76 x 11~ <lm>    
##  6 ADA        1 <tibble [218 x 11]> <tibble [73 x 11~ <tibble [74 x 11~ <lm>    
##  7 TRX        1 <tibble [218 x 11]> <tibble [73 x 11~ <tibble [76 x 11~ <lm>    
##  8 ZEC        1 <tibble [218 x 11]> <tibble [73 x 11~ <tibble [76 x 11~ <lm>    
##  9 HT         1 <tibble [218 x 11]> <tibble [73 x 11~ <tibble [75 x 11~ <lm>    
## 10 XMR        1 <tibble [218 x 11]> <tibble [73 x 11~ <tibble [75 x 11~ <lm>    
## # ... with 540 more rows

Awesome! Now we can use the same tools we learned in the high-level version to make a wider variety of predictive models to test

7.2 Caret

Refer back to the high-level version of the tutorial for an explanation of the caret package, or consult this document: https://topepo.github.io/caret/index.html

7.2.1 Parallel Processing

R is a single thredded application, meaning it only uses one CPU at a time when performing operations. The step below is optional and uses the parallel and doParallel packages to allow R to use more than a single CPU when creating the predictive models, which will speed up the process considerably:

cl <- makePSOCKcluster(detectCores()-1)
registerDoParallel(cl)

7.2.2 More Functional Programming

Now we can repeat the process we used earlier to create a column with the linear regression models to create the exact same models, but this time using the caret package.

linear_model_caret <- function(df){
  
  train(target_price_24h ~ . -date_time_utc -date, data = df,
        method = 'lm',
        trControl=trainControl(method="none"))
  
}

We specified the method as lm for linear regression. See the high-level version for a refresher on how to use different methods to make different models: https://cryptocurrencyresearch.org/high-level/#/method-options. the trControl argument tells the caret package to avoid additional resampling of the data. As a default behavior caret will do re-sampling on the data and do hyperparameter tuning to select values to use for the paramters to get the best results, but we will avoid this discussion for this tutorial. See the official caret documentation for more details.

Here is the full list of models that we can make using the caret package and the steps described the high-level version of the tutorial:

We can now use the new function we created linear_model_caret in conjunction with map() and mutate() to create a new column in the cryptodata_nested dataset called lm_model with the trained linear regression model for each split of the data (by cryptocurrency symbol and split):

cryptodata_nested <- mutate(cryptodata_nested, 
                            lm_model = map(train_data, linear_model_caret))

We can see the new column called lm_model with the nested dataframe grouping variables:

select(cryptodata_nested, lm_model)
## # A tibble: 550 x 3
## # Groups:   symbol, split [550]
##    symbol split lm_model
##    <chr>  <dbl> <list>  
##  1 BTC        1 <train> 
##  2 ETH        1 <train> 
##  3 EOS        1 <train> 
##  4 LTC        1 <train> 
##  5 BSV        1 <train> 
##  6 ADA        1 <train> 
##  7 TRX        1 <train> 
##  8 ZEC        1 <train> 
##  9 HT         1 <train> 
## 10 XMR        1 <train> 
## # ... with 540 more rows

And we can view the summarized contents of the first trained model:

cryptodata_nested$lm_model[[1]]
## Linear Regression 
## 
## 218 samples
##  10 predictor
## 
## No pre-processing
## Resampling: None

7.2.3 Generalize the Function

We can adapt the function we built earlier for the linear regression models using caret, and add a parameter that allows us to specify the method we want to use (as in what predictive model):

model_caret <- function(df, method_choice){
  
  train(target_price_24h ~ . -date_time_utc -date, data = df,
        method = method_choice,
        trControl=trainControl(method="none"))

}

7.2.4 XGBoost Models

Now we can do the same thing we did earlier for the linear regression models, but use the new function called model_caret using the map2() function to also specify the model as xgbLinear to create an XGBoost model:

cryptodata_nested <- mutate(cryptodata_nested, 
                            xgb_model = map2(train_data, "xgbLinear", model_caret))

We won’t dive into the specifics of each individual model as the correct one to use may depend on a lot of factors and that is a discussion outside the scope of this tutorial. We chose to use the XGBoost model as an example because it has recently gained a lot of popularity as a very effective framework for a variety of problems, and is an essential model for any data scientist to have at their disposal.

There are several possible configurations for XGBoost models, you can find the official documentation here: https://xgboost.readthedocs.io/en/latest/parameter.html

7.2.5 Neural Network Models

We can keep adding models. As we saw, caret allows for the usage of over 200 predictive models. Let’s make another set of models, this time setting the method to dnn to create deep neural networks :

cryptodata_nested <- mutate(cryptodata_nested, 
                            nnet_model = map2(train_data, "dnn", model_caret))

Again, we will not dive into the specifics of the individual models, but a quick Google search will return a myriad of information on the subject.

7.2.6 Random Forest Models

Next let’s use create Random Forest models using the method ctree

cryptodata_nested <- mutate(cryptodata_nested, 
                            rf_model = map2(train_data, "ctree", model_caret))

7.2.7 Principal Component Regression

For one last set of models, let’s make Principal Component Regression models using the method pcr

cryptodata_nested <- mutate(cryptodata_nested, 
                            pcr_model = map2(train_data, "pcr", model_caret))

7.2.8 Caret Options

Caret offers some additional options to help pre-process the data as well. We outlined an example of this in the high-level version of the tutorial when showing how to make a Support Vector Machine model, which requires the data to be centered and scaled to avoid running into problems (which we won’t discuss further here).

7.3 Make Predictions

Awesome! We have trained the predictive models, and we want to start getting a better understanding of how accurate the models are on data they have never seen before. In order to make these comparisons, we will want to make predictions on the test and holdout datasets, and compare those predictions to what actually ended up happening.

In order to make predictions, we can use the prediict() function, here is an example on the first elements of the nested dataframe:

predict(object = cryptodata_nested$lm_model[[1]],
        newdata = cryptodata_nested$test_data[[1]],
        na.action = na.pass)
##        1        2        3        4        5        6        7        8 
## 47579.57 47149.63 47495.55 47405.16 47457.76 47922.32 48001.12 47922.87 
##        9       10       11       12       13       14       15       16 
## 48319.67 48589.23 48754.34 48663.80 48563.80 48529.82 48633.60 48745.70 
##       17       18       19       20       21       22       23       24 
## 48619.46 48525.38 48771.85 48697.62 48336.68 47957.43 47639.30 47813.47 
##       25       26       27       28       29       30       31       32 
## 47536.00 47690.10 47411.35 47708.66 47488.22 47292.91 47159.73 47035.42 
##       33       34       35       36       37       38       39       40 
## 46677.21 46861.83 46941.35 47042.32 47098.80 47354.46 47299.14 47154.74 
##       41       42       43       44       45       46       47       48 
## 47109.55 46721.45 47070.57 46833.02 46190.25 46358.22 46299.17 45725.51 
##       49       50       51       52       53       54       55       56 
## 45802.73 46043.51 46308.76 46260.36 46085.04 46122.70 45863.90 46108.96 
##       57       58       59       60       61       62       63       64 
## 46472.23 46314.57 46159.24 46003.67 46483.91 46175.75 45894.38 45431.92 
##       65       66       67       68       69       70       71       72 
## 45324.38 45792.47 45678.44 45927.84 46095.19 47003.54 47375.64 47195.02 
##       73 
## 47312.85

Now we can create a new custom function called make_predictions that wraps this functionality in a way that we can use with map() to iterate through all options of the nested dataframe:

make_predictions <- function(model, test){
  
  predict(object  = model, newdata = test, na.action = na.pass)
  
}

Now we can create the new columns lm_test_predictions and lm_holdout_predictions with the predictions:

cryptodata_nested <- mutate(cryptodata_nested, 
                            lm_test_predictions =  map2(lm_model,
                                                   test_data,
                                                   make_predictions),
                            
                            lm_holdout_predictions =  map2(lm_model,
                                                      holdout_data,
                                                      make_predictions))

The predictions were made using the models that had only seen the training data, and we can start assessing how good the model is on data it has not seen before in the test and holdout sets. Let’s view the results from the previous step:

select(cryptodata_nested, lm_test_predictions, lm_holdout_predictions)
## # A tibble: 550 x 4
## # Groups:   symbol, split [550]
##    symbol split lm_test_predictions lm_holdout_predictions
##    <chr>  <dbl> <list>              <list>                
##  1 BTC        1 <dbl [73]>          <dbl [76]>            
##  2 ETH        1 <dbl [73]>          <dbl [76]>            
##  3 EOS        1 <dbl [73]>          <dbl [76]>            
##  4 LTC        1 <dbl [73]>          <dbl [76]>            
##  5 BSV        1 <dbl [73]>          <dbl [76]>            
##  6 ADA        1 <dbl [73]>          <dbl [74]>            
##  7 TRX        1 <dbl [73]>          <dbl [76]>            
##  8 ZEC        1 <dbl [73]>          <dbl [76]>            
##  9 HT         1 <dbl [73]>          <dbl [75]>            
## 10 XMR        1 <dbl [73]>          <dbl [75]>            
## # ... with 540 more rows

Now we can do the same for the rest of the models:

cryptodata_nested <- mutate(cryptodata_nested, 
                            # XGBoost:
                            xgb_test_predictions =  map2(xgb_model,
                                                         test_data,
                                                         make_predictions),
                            # holdout
                            xgb_holdout_predictions =  map2(xgb_model,
                                                            holdout_data,
                                                            make_predictions),
                            # Neural Network:
                            nnet_test_predictions =  map2(nnet_model,
                                                          test_data,
                                                          make_predictions),
                            # holdout
                            nnet_holdout_predictions =  map2(nnet_model,
                                                             holdout_data,
                                                             make_predictions),
                            # Random Forest:
                            rf_test_predictions =  map2(rf_model,
                                                               test_data,
                                                               make_predictions),
                            # holdout
                            rf_holdout_predictions =  map2(rf_model,
                                                                  holdout_data,
                                                                  make_predictions),
                            # PCR:
                            pcr_test_predictions =  map2(pcr_model,
                                                         test_data,
                                                         make_predictions),
                            # holdout
                            pcr_holdout_predictions =  map2(pcr_model,
                                                            holdout_data,
                                                            make_predictions))

We are done using the caret package and can stop the parallel processing cluster:

stopCluster(cl)
In this example we used the caret package because it provides a straightforward option to create a variety of models, but there are several great similar alternatives to make a variety of models in both R and Python. Some noteworthy mentions are tidymodels, mlr, and scikit-learn.

7.4 Timeseries

Because this tutorial is already very dense, we will just focus on the models we created above. When creating predictive models on timeseries data there are some other excellent options which consider when the information was collected in similar but more intricate ways to the way we did when creating the lagged variables.

For more information on using excellent tools for ARIMA and ETS models, consult the high-level version of this tutorial where they were discussed.

Move on to the next section ➡️ to assess the accuracy of the models as described in the previous section.