# Section - 7 Predictive Modeling

We finally have everything we need to start making predictive models now that the data has been cleaned and we have come up with a gameplan to understand the efficacy of the models.

## 7.1 Example Simple Model

We can start by making a simple **linear regression** model:

```
##
## Call:
## lm(formula = target_price_24h ~ ., data = cryptodata)
##
## Coefficients:
## (Intercept) symbolARDR symbolASP symbolAVA
## 19365.74947412 -0.05403509 0.14073462 0.03944255
## symbolBAT symbolBCHA symbolBNT symbolBRD
## -0.06916238 7.79985629 0.17611503 -0.01715688
## symbolBSV symbolBTC symbolBTG symbolBTM
## 3.18941432 901.35372128 0.36446114 0.15936120
## symbolCHZ symbolCKB symbolCRO symbolCRV
## -0.19312370 0.21171114 -0.08707633 0.69101999
## symbolCUR symbolDCR symbolDGB symbolDOGE
## -1.15019151 2.38052281 0.17081554 3.12328654
## symbolELF symbolENJ symbolEOS symbolETH
## -0.05891700 -0.10728996 0.03868342 30.17807335
## symbolETP symbolFTM symbolFTT symbolHT
## 0.18321279 0.13470414 1.10493446 0.30240564
## symbolHYDRA symbolINJ symbolJST symbolKMD
## 21.32092001 16.14280883 -0.07017112 0.07580409
## symbolKNC symbolLEVL symbolLTC symbolMANA
## 0.58597435 0.21181266 3.52639702 0.00985079
## symbolNAV symbolNEXO symbolOAX symbolSRN
## -0.01239283 0.01785864 -0.45246020 -0.01031757
## symbolSTORJ symbolSUN symbolSUSHI symbolTON
## -3.47586774 15.05676270 0.95029784 10.86528973
## symbolTRX symbolUNI symbolVIB symbolWAXP
## 0.12806278 0.41982371 1.66196527 -0.01917840
## symbolXEM symbolXMR symbolZEC symbolZRX
## 0.03283417 3.77503800 2.04134098 0.18402858
## date_time_utc date price_usd lagged_price_1h
## -0.00001086 -0.10385260 0.91353543 -0.00619565
## lagged_price_2h lagged_price_3h lagged_price_6h lagged_price_12h
## 0.06808906 0.01259610 0.01447910 -0.15004653
## lagged_price_24h lagged_price_3d trainingtest trainingtrain
## 0.08383184 0.04750303 11.02274111 -5.75342793
## split
## 23.01738599
```

We defined the **formula** for the model as ** target_price_24h ~ .**, which means that we are want to make predictions for the

**target_price_24h**field, and use (

**) every other column found in the data (**

`~`

**). In other words, we specified a model that uses the**

`.`

**target_price_24h**field as the dependent variable, and all other columns (

**) as the independent variables. Meaning, we are looking to predict the**

`.`

**target_price_24h**, which is the only column that refers to the future, and use all the information available at the time the rest of the data was collected in order to infer statistical relationships that can help us forecast the future values of the

**target_price_24h**field when it is still unknown on new data that we want to make new predictions for.

In the example above we used the **cryptodata** object which contained all the non-nested data, and was a big oversimplification of the process we will actually use.

### 7.1.1 Using Functional Programming

From this point forward, we will deal with the new dataset **cryptodata_nested**, review the previous section where it was created if you missed it. Here is a preview of the data again:

```
## # A tibble: 260 x 5
## # Groups: symbol, split [260]
## symbol split train_data test_data holdout_data
## <chr> <dbl> <list> <list> <list>
## 1 BTC 1 <tibble [302 x 11]> <tibble [94 x 11]> <tibble [94 x 11]>
## 2 ETH 1 <tibble [302 x 11]> <tibble [94 x 11]> <tibble [96 x 11]>
## 3 EOS 1 <tibble [302 x 11]> <tibble [94 x 11]> <tibble [96 x 11]>
## 4 LTC 1 <tibble [301 x 11]> <tibble [94 x 11]> <tibble [94 x 11]>
## 5 ADA 1 <tibble [302 x 11]> <tibble [94 x 11]> <tibble [96 x 11]>
## 6 BSV 1 <tibble [302 x 11]> <tibble [94 x 11]> <tibble [96 x 11]>
## 7 HT 1 <tibble [296 x 11]> <tibble [93 x 11]> <tibble [95 x 11]>
## 8 TRX 1 <tibble [300 x 11]> <tibble [93 x 11]> <tibble [97 x 11]>
## 9 ZEC 1 <tibble [280 x 11]> <tibble [94 x 11]> <tibble [97 x 11]>
## 10 KNC 1 <tibble [199 x 11]> <tibble [90 x 11]> <tibble [93 x 11]>
## # ... with 250 more rows
```

Because we are now dealing with a **nested dataframe**, performing operations on the individual nested datasets is not as straightforward. We could extract the individual elements out of the data using **indexing**, for example we can return the first element of the column **train_data** by running this code:

```
## # A tibble: 302 x 11
## date_time_utc date price_usd target_price_24h lagged_price_1h
## <dttm> <date> <dbl> <dbl> <dbl>
## 1 2020-11-27 00:00:01 2020-11-27 17150. 17140. 17162.
## 2 2020-11-27 01:00:00 2020-11-27 17392. 17144. 17150.
## 3 2020-11-27 02:00:00 2020-11-27 17304. 17130. 17392.
## 4 2020-11-27 03:00:00 2020-11-27 17097. 16986. 17304.
## 5 2020-11-27 04:00:00 2020-11-27 17079. 16988. 17097.
## 6 2020-11-27 05:00:01 2020-11-27 17076. 16955. 17079.
## 7 2020-11-27 06:00:00 2020-11-27 17258. 16981. 17076.
## 8 2020-11-27 07:00:00 2020-11-27 17283. 17017. 17258.
## 9 2020-11-27 08:00:00 2020-11-27 17095. 17015. 17283.
## 10 2020-11-27 09:00:00 2020-11-27 16854. 16883. 17095.
## # ... with 292 more rows, and 6 more variables: lagged_price_2h <dbl>,
## # lagged_price_3h <dbl>, lagged_price_6h <dbl>, lagged_price_12h <dbl>,
## # lagged_price_24h <dbl>, lagged_price_3d <dbl>
```

As we already saw dataframes are really flexible as a data structure. We can create a new column in the data to store the models themselves that are associated with each row of the data. There are several ways that we could go about doing this (this tutorial itself was written to execute the same commands using three fundamentally different methodologies), but in this tutorial we will take a **functional programming** approach. This means we will focus the operations we will perform on the actions we want to take themselves, which can be contrasted to a **for loop** which emphasizes the objects more using a similar structure that we used in the example above showing the first element of the **train_data** column.

When using a **functional programming** approach, we first need to create functions for the operations we want to perform. Let’s wrap the **lm()** function we used as an example earlier and create a new custom function called **linear_model**, which takes a dataframe as an input (the **train_data** we will provide for each row of the nested dataset), and generates a linear regression model:

We can now use the **map()** function from the **purrr** package in conjunction with the **mutate()** function from **dplyr** to create a new column in the data which contains an individual linear regression model for each row of **train_data**:

```
## # A tibble: 260 x 6
## # Groups: symbol, split [260]
## symbol split train_data test_data holdout_data lm_model
## <chr> <dbl> <list> <list> <list> <list>
## 1 BTC 1 <tibble [302 x 11]> <tibble [94 x 11~ <tibble [94 x 11~ <lm>
## 2 ETH 1 <tibble [302 x 11]> <tibble [94 x 11~ <tibble [96 x 11~ <lm>
## 3 EOS 1 <tibble [302 x 11]> <tibble [94 x 11~ <tibble [96 x 11~ <lm>
## 4 LTC 1 <tibble [301 x 11]> <tibble [94 x 11~ <tibble [94 x 11~ <lm>
## 5 ADA 1 <tibble [302 x 11]> <tibble [94 x 11~ <tibble [96 x 11~ <lm>
## 6 BSV 1 <tibble [302 x 11]> <tibble [94 x 11~ <tibble [96 x 11~ <lm>
## 7 HT 1 <tibble [296 x 11]> <tibble [93 x 11~ <tibble [95 x 11~ <lm>
## 8 TRX 1 <tibble [300 x 11]> <tibble [93 x 11~ <tibble [97 x 11~ <lm>
## 9 ZEC 1 <tibble [280 x 11]> <tibble [94 x 11~ <tibble [97 x 11~ <lm>
## 10 KNC 1 <tibble [199 x 11]> <tibble [90 x 11~ <tibble [93 x 11~ <lm>
## # ... with 250 more rows
```

Awesome! Now we can use the same tools we learned in the high-level version to make a wider variety of predictive models to test

## 7.2 Caret

Refer back to the high-level version of the tutorial for an explanation of the caret package, or consult this document: https://topepo.github.io/caret/index.html

### 7.2.1 Parallel Processing

R is a ** single thredded** application, meaning it only uses one CPU at a time when performing operations. The step below is optional and uses the

**parallel**and

**doParallel**packages to allow R to use more than a single CPU when creating the predictive models, which will speed up the process considerably:

### 7.2.2 More Functional Programming

Now we can repeat the process we used earlier to create a column with the linear regression models to create **the exact same models**, but this time using the **caret** package.

```
linear_model_caret <- function(df){
train(target_price_24h ~ . -date_time_utc -date, data = df,
method = 'lm',
trControl=trainControl(method="none"))
}
```

*We specified the method as *

`lm`

for linear regression. See the high-level version for a refresher on how to use different methods to make different models: https://cryptocurrencyresearch.org/high-level/#/method-options. the trControl argument tells the caret package to avoid additional resampling of the data. As a default behavior caret will do re-sampling on the data and do hyperparameter tuning to select values to use for the paramters to get the best results, but we will avoid this discussion for this tutorial. See the official caret documentation for more details.Here is the full list of models that we can make using the **caret** package and the steps described the high-level version of the tutorial:

We can now use the new function we created **linear_model_caret** in conjunction with **map()** and **mutate()** to create a new column in the **cryptodata_nested** dataset called **lm_model** with the trained linear regression model for each split of the data (by cryptocurrency **symbol** and **split**):

We can see the new column called **lm_model** with the nested dataframe grouping variables:

```
## # A tibble: 260 x 3
## # Groups: symbol, split [260]
## symbol split lm_model
## <chr> <dbl> <list>
## 1 BTC 1 <train>
## 2 ETH 1 <train>
## 3 EOS 1 <train>
## 4 LTC 1 <train>
## 5 ADA 1 <train>
## 6 BSV 1 <train>
## 7 HT 1 <train>
## 8 TRX 1 <train>
## 9 ZEC 1 <train>
## 10 KNC 1 <train>
## # ... with 250 more rows
```

And we can view the summarized contents of the first trained model:

```
## Linear Regression
##
## 302 samples
## 10 predictor
##
## No pre-processing
## Resampling: None
```

### 7.2.3 Generalize the Function

We can adapt the function we built earlier for the linear regression models using caret, and add a parameter that allows us to specify the **method** we want to use (as in what predictive model):

### 7.2.4 XGBoost Models

Now we can do the same thing we did earlier for the linear regression models, but use the new function called **model_caret** using the **map2()** function to also specify the model as **xgbLinear** to create an **XGBoost** model:

```
cryptodata_nested <- mutate(cryptodata_nested,
xgb_model = map2(train_data, "xgbLinear", model_caret))
```

We won’t dive into the specifics of each individual model as the correct one to use may depend on a lot of factors and that is a discussion outside the scope of this tutorial. We chose to use the **XGBoost** model as an example because it has recently gained a lot of popularity as a very effective framework for a variety of problems, and is an essential model for any data scientist to have at their disposal.

There are several possible configurations for XGBoost models, you can find the official documentation here: https://xgboost.readthedocs.io/en/latest/parameter.html

### 7.2.5 Neural Network Models

We can keep adding models. As we saw, caret allows for the usage of over 200 predictive models. Let’s make another set of models, this time setting the **method** to to create

`dnn`

**deep neural networks**:

*Again, we will not dive into the specifics of the individual models, but a quick Google search will return a myriad of information on the subject.*

### 7.2.6 Random Forest Models

Next let’s use create Random Forest models using the **method** `ctree`

### 7.2.7 Principal Component Regression

For one last set of models, let’s make Principal Component Regression models using the **method** `pcr`

### 7.2.8 Caret Options

Caret offers some additional options to help pre-process the data as well. We outlined an example of this in the high-level version of the tutorial when showing how to make a **Support Vector Machine** model, which requires the data to be **centered** and **scaled** to avoid running into problems (which we won’t discuss further here).

## 7.3 Make Predictions

Awesome! We have trained the predictive models, and we want to start getting a better understanding of how accurate the models are on data they have never seen before. In order to make these comparisons, we will want to make predictions on the test and holdout datasets, and compare those predictions to what actually ended up happening.

In order to make predictions, we can use the **prediict()** function, here is an example on the first elements of the nested dataframe:

```
predict(object = cryptodata_nested$lm_model[[1]],
newdata = cryptodata_nested$test_data[[1]],
na.action = na.pass)
```

```
## 1 2 3 4 5 6 7 8
## 18521.49 18532.43 18566.93 18531.92 18447.42 18520.93 18589.95 18532.19
## 9 10 11 12 13 14 15 16
## 18505.52 18509.30 18469.29 18462.86 18393.96 18433.85 18417.05 18464.22
## 17 18 19 20 21 22 23 24
## 18526.31 18573.33 18516.22 18446.11 18387.31 18366.11 18283.56 18279.71
## 25 26 27 28 29 30 31 32
## 18211.02 18342.43 18310.68 18301.69 18313.70 18267.78 18307.11 18336.30
## 33 34 35 36 37 38 39 40
## 18351.25 18258.46 18334.24 18272.11 18356.89 18383.54 18287.66 18257.78
## 41 42 43 44 45 46 47 48
## 18298.52 18247.54 18309.58 18328.91 18384.48 18340.78 18605.96 18556.83
## 49 50 51 52 53 54 55 56
## 18644.57 18576.26 18630.67 18586.31 18646.81 18709.02 18684.97 18746.59
## 57 58 59 60 61 62 63 64
## 18699.45 18765.40 18744.26 18686.92 18651.52 18683.46 18707.89 18771.31
## 65 66 67 68 69 70 71 72
## 18855.43 18905.36 18860.83 18829.95 18820.93 18848.02 18810.72 18837.89
## 73 74 75 76 77 78 79 80
## 18832.59 18843.40 18865.02 18881.32 19003.68 19121.73 19075.31 19124.18
## 81 82 83 84 85 86 87 88
## 19144.08 19119.67 19134.90 19171.99 19191.12 19136.40 19210.56 19153.66
## 89 90 91 92 93 94
## 19124.79 19145.68 19129.94 19157.28 19119.43 19194.35
```

Now we can create a new custom function called **make_predictions** that wraps this functionality in a way that we can use with **map()** to iterate through all options of the nested dataframe:

```
make_predictions <- function(model, test){
predict(object = model, newdata = test, na.action = na.pass)
}
```

Now we can create the new columns **lm_test_predictions** and **lm_holdout_predictions** with the predictions:

```
cryptodata_nested <- mutate(cryptodata_nested,
lm_test_predictions = map2(lm_model,
test_data,
make_predictions),
lm_holdout_predictions = map2(lm_model,
holdout_data,
make_predictions))
```

The predictions were made using the models that had only seen the **training data**, and we can start assessing how good the model is on data it has not seen before in the **test** and **holdout** sets. Let’s view the results from the previous step:

```
## # A tibble: 260 x 4
## # Groups: symbol, split [260]
## symbol split lm_test_predictions lm_holdout_predictions
## <chr> <dbl> <list> <list>
## 1 BTC 1 <dbl [94]> <dbl [94]>
## 2 ETH 1 <dbl [94]> <dbl [96]>
## 3 EOS 1 <dbl [94]> <dbl [96]>
## 4 LTC 1 <dbl [94]> <dbl [94]>
## 5 ADA 1 <dbl [94]> <dbl [96]>
## 6 BSV 1 <dbl [94]> <dbl [96]>
## 7 HT 1 <dbl [93]> <dbl [95]>
## 8 TRX 1 <dbl [93]> <dbl [97]>
## 9 ZEC 1 <dbl [94]> <dbl [97]>
## 10 KNC 1 <dbl [90]> <dbl [93]>
## # ... with 250 more rows
```

Now we can do the same for the rest of the models:

```
cryptodata_nested <- mutate(cryptodata_nested,
# XGBoost:
xgb_test_predictions = map2(xgb_model,
test_data,
make_predictions),
# holdout
xgb_holdout_predictions = map2(xgb_model,
holdout_data,
make_predictions),
# Neural Network:
nnet_test_predictions = map2(nnet_model,
test_data,
make_predictions),
# holdout
nnet_holdout_predictions = map2(nnet_model,
holdout_data,
make_predictions),
# Random Forest:
rf_test_predictions = map2(rf_model,
test_data,
make_predictions),
# holdout
rf_holdout_predictions = map2(rf_model,
holdout_data,
make_predictions),
# PCR:
pcr_test_predictions = map2(pcr_model,
test_data,
make_predictions),
# holdout
pcr_holdout_predictions = map2(pcr_model,
holdout_data,
make_predictions))
```

We are done using the **caret** package and can stop the parallel processing cluster:

## 7.4 Timeseries

Because this tutorial is already very dense, we will just focus on the models we created above. When creating predictive models on timeseries data there are some other excellent options which consider when the information was collected in similar but more intricate ways to the way we did when creating the lagged variables.

For more information on using excellent tools for ARIMA and ETS models, consult the high-level version of this tutorial where they were discussed.

Move on to the next section ➡️ to assess the accuracy of the models as described in the previous section.