cryptocurrencyresearch


High-Level Tutorial

Introduction

Welcome to this tutorial on supervised machine learning in the R programming language. This is the high-level version of a more detailed tutorial.

You can navigate between topics with the right and left arrow keys, and press on the down key to learn more about each topic.

You can also press on the "o" or "esc" keys on your keyboard for an overview of the slides that can be help with navigation. Press "f" to switch to full-screen.

Press on the down arrow key on your keyboard for an overview of what you will learn by following along with the tutorial.

Overview

  • Whenever an R package is referenced, the text will be colored orange. We will discuss R packages and the rest of the terms below later on in this presentation.

  • Whenever a function is referenced, it will be colored green.

  • Whenever an R object is referenced, it will be colored blue. We will also refer to the parameters of functions in blue.

  • When a term is particularly common in machine learning or data science, we will call it out with purple text, but only the first time it appears.

Overview - Continued

You do not need pre-existing R knowledge to follow along, but you should familiar with working with data (even if only in Excel). Here is what we will cover:

  • You will gain a better understanding of how to install and load packages in R.

  • You will get an introduction to the tidyverse by learning about data manipulation and visualizations.

  • You will learn to make many different types of supervised machine learning models using a standardized approach using the caret package.

  • Finally, you will learn how to use tidy tools for time series analysis (tsibble and fable.)

This Version

  • This version of the tutorial is meant to be fully reproducible, meaning the results will never change and by running the code as outlined you will always get the exact same results outlined here.

  • The second main difference, is in this version we only deal with the historical prices for one cryptocurrency, while in the full version we make independent predictive models for each one on fresh data every 12 hours.

  • This problem itself has also been simplified, here we don’t deal with all nuances relating to a cryptocurrency’s price.

What is Machine Learning?

Making forecasts about the future using data from the past using a process called Supervised Machine Learning has become increasingly easier over the past decade, and has become a clearly defined step-by-step process that any professional who regularly works with data can follow along with.

Machine Learning is the process of training a computer to find an answer within clearly defined rules. We provide a machine learning algorithm data about the past and outline the “rules” to work within, and the model looks for statistical structure (depending on the specific model used) that can be used to make forecasts about the future.

Supervised Machine Learning

The Supervised part of supervised machine learning refers to the fact that the data was observed in the past, and provides us information regarding both the inputs (independent variables) and the outcome (dependent variable, also known as the target variable). Click here for a more in-depth explanation of the different branches of machine learning.

What are Cryptocurrencies?

A cryptocurrency is usually defined as a digital currency based on blockchain technology which has properties that make it difficult to be exploited. The most famous and original cryptocurrency is Bitcoin, you can view the original whitepaper for it here.

Cryptocurrencies can be traded on cryptocurrency exchanges, and unlike the stock market they are available for trading 24/7. In this tutorial we will use data for the second most well known cryptocurrency, which is called Ethereum and goes by the trading ticker symbol of ETH.

R Interface

This section will provide a very basic overview of what it actually means to use the R programming language. We will start by covering what an R session is and how to get started. If you have already used R and RStudio in the past, feel free to skip the section below.

R and RStudio

The R programming language is a free computer application that can be downloaded from the internet. Anyone using R should be aware of RStudio, which is a different computer application that can also be downloaded for free online. RStudio makes programming in R a much better experience and is a tool anyone using R should be familiar with. After installing these two programs, anyone is ready to start programming in R and follow along with this tutorial.

R Sessions

Whenever you open the RStudio application for the first time, you will be brought to a new R session that should look like this:

R Sessions - Continued

Throughout this tutorial, we will show pieces of R code that execute as part of this presentation (which was created in a tool called R Markdown within RStudio that you do not need to worry about):

print("Hello from R!")
## [1] "Hello from R!"

The output above shows the result of the code we ran, where the number [1] indicates this is the first value of the output.

Keep going below ⬇️ to learn more.

R Sessions - Continued

1. When we run code in this presentation as shown in the previous slide ⬆️, it is equivalent to running the code in your console in RStudio, which is where the > symbol is inside the red highlighted area.

2. If you wanted to save your code to use in a new R session once you close the program, you would need to create a new file in the top left corner of RStudio.

R Objects

When we start a new R session, for example when we open RStudio, the environment should be empty (as displayed in the top right pane of the previous screenshot). We can create new objects by naming them whatever we would like, and using an arrow to assign the object some kind of value:

R Objects - Continued

Whenever we run a command in R, we have the option of saving the result in the R environment. For example, we can create a new object called words:

words <- "this object stores text"

We could have shown the results directly and not saved the results as a variable. Savings results as R objects is helpful for approaching a problem step-by-step. We can show the results that we saved by running a command with the name of the object:

words
## [1] "this object stores text"

R Functions

When we want to perform some kind of operation on an R object, we will use a function. Earlier, when we ran the command print(“Hello from R!”), we asked R to print() the text “Hello from R!”.

A new R session has many functions available to it, but the default functionality pales in comparison to the number of functions that have been created by R users over the years. We can load functions created by other users and researchers by installing a package on our computer and loading the functionality in the R session, which we will review in detail in a later section.

R Functions - continued

Different functions have different parameters, which are the inputs for a function. Some functions require more inputs than others to work. The print() function for example only requires us to input one parameter for what we want to print in the output. Many other functions however, require more inputs and will not work with a single input provided. In many cases, the function will have default values for a parameter, that way if no selection is made by the user the default behavior will be used, and if the user makes a selection then the selection overrides the default behavior.

R - Getting Help

But what if you don’t know how to use a function? R has some excellent resources available, and in most cases you can figure out the answers to most usage questions directly inside of R by running the help() function on any other function. When packages are installed, they will come with documentation that can be opened this way. The last section tends to be the most useful and includes useful examples of using the function.

Move on

There is a lot more to cover in terms of “how do I program in R?”, but this is all the knowledge you need to be able to follow along with this tutorial.

Move on to the next section by pressing the right arrow key on your keyboard ➡️

Disclaimer

This tutorial is made available for learning and educational purposes only and the information to follow does not constitute trading advice in any way shape or form. We avoid giving any advice when it comes to trading strategies, as this is a very complex ecosystem that is out of the scope of this tutorial; we make no attempt in this regard, and if this, rather than data science, is your interest, your time would be better spent following a different tutorial that aims to answer those questions.

Follow Along With the code

As code comes up and is explained, we recommend that you run it yourself on your computer as well. You can run the code by following the installation instructions as outlined in the steps below.

Alternatively, you can click here to run the code in the cloud without having to worry about installing anything .

Setup R Environment

In order to follow along with this tutorial, you will need to install additional R packages. At this point you will need to have R and RStudio installed as previously outlined. This tutorial is for beginners, and each step is clearly explained. If you do not, follow these instructions first.

If you do not want to install and run R on your computer, you can instead click here to run the code in a cloud environment.

Press on the down arrow key ⬇️ on your keyboard for instructions on how to setup your machine to follow along with this R programming example.

What are packages?

Packages are collections of functions and data that other users have made shareable. We can install these packages into our own library of R tools and load them into our R session, which can enable us to write powerful code with minimal effort compared to writing the same code without the additional packages. Many packages are simply time savers for things we could do with the default/base functionality of R, but sometimes if we want to do something like make a static chart interactive when hovering over points on the chart, we are better off using a package someone already came up with rather than re-invent the wheel.

Keep going below ⬇️ to learn how to install R packages.

Installing Packages

We can install packages to extend the functionality of what we can do in R. As a first step, we will need to run the command install.packages():

install.packages("package_name")

We only need to install any given package once on any given computer, kind of like installing an application (like RStudio or Google Chrome) once.

Using Packages With library()

After installation, we can use library(package) to import those packages into the current R session. Each time we open a new session in R, we need to import the packages again using library(). R has some functionality that is just there at startup, which is typically referred to as base R. When we want to extend R past what base R has to offer, we need to import the package in the current session by using the function library(); if the package/functionality has never been installed on top of your current version of base R, you will first need to first install it with install.packages().

A package only needs to be installed once, but needs to be imported into each new R session using library().

Installing a Package Example

We can install the package called pacman (Rinker and Kurkiewicz 2018) using install.packages():

install.packages("pacman")

pacman does not refer to the videogame, and stands for package manager. After we import this package, we will be able to use new functions that come with it. We can use one of those functions to install the remaining packages we will need for the rest of the tutorial. The advantage to using the new function, is the installation will happen in a “smarter” way, where if you already have a package in your library, it will not be installed again.

Use “pacman”

We can now import the pacman package:

library(pacman)

Now we have access to the function p_load():

p_load("pins","dplyr","ggplot2","ggthemes","gganimate","caret","xgboost","kernlab", "tsibble","fabletools","fable","feasts","urca","plotly")

Running p_load() is equivalent to running install.packages() on each of the packages listed (but only when they are not already installed on your computer), and then running library() for each package in quotes separated by commas to import the functionality from the package into the current R session. Both commands are wrapped inside the single function call to p_load().

All Set

If you followed along with the steps as outlined, the code in the upcoming slides should work on your computer when run in the correct order.

Move on to the next slide (right arrow key ➡️) to get started!

Get the Data

When doing an analysis, you would typically start with some kind of tabular data, like an Excel file. An Excel file can be saved with the csv extension, which can then easily be read by R using the read.csv() function. When running the function we can save the results in an R object and start doing the analysis.

The approach described requires you to download the data to your computer. Rather than having you download the data and place it in the correct folder for your R session to reference, we will take a more unorthodox approach that will let us download the data directly from a specific URL. Do not worry too much about the code in the next two slides outside of the fact that we are retrieving the data from an online resource (instead of the more common local computer storage methodology) and saving the data to an R object.

Pins Package

The first package we will use is the pins package (Luraschi 2020) to retrieve the data that we will be working with.

First, we will need to connect to a public GitHub repository (anyone can post their code to the GitHub website and make a “repository” with code for their project) and register the board that the data is pinned to by using the board_register() function:

board_register(name = "pins_board", url = "https://raw.githubusercontent.com/predictcrypto/pins/master/", board = "datatxt")

By running the board_register() command on the URL where the data is located, we will be able to “ask” for any dataset available on the board (in the next slide ⬇️).

Pull Data

The previous step created a connection to pull the dataset we will use, don’t worry too much about the code on the previous slide, but you can learn more by clicking here. Now that we are connected to the board, we can use the pin_get() function to retrieve the data that we will use for the rest of this tutorial:

cryptodata <- pin_get(name = "ETH_Binance")

The dataset named ETH_Binance is data relating to the cryptocurrency Ethereum (ETH) on the Binance exchange and was downloaded from the website https://cryptodatadownload.com/.

We will be working with hourly data for the date range from 2020-01-01 to 2020-07-31.

Data Source

The data is made available by the website cryptodatadownload.com:

Original Data Source

The data was originally sourced from the link for the ETH/USD pair and the [Hourly] option found below:

Data Preview

Only 100 rows of data are made available in the table above as an example

Documentation

The data we will be using was downloaded from cryptodatadownload.com. This is tidy data, meaning:

  1. Every column is a variable.

  2. Every row is an observation.

  3. Every cell is a single value.

The cryptocurrency’s price is recorded in one hour intervals and includes 212 days worth of data. Each row summarizes pricing information relating to the cryptocurrency over the specified 1 hour period (the very last field shows the date and the time associated with the row).

Data Dictionary

  • DateTime: The date and time that the data was collected in the UTC timezone.
  • Date: The date on which the data was collected in the UTC timezone.
  • Symbol: The cryptocurrency associated with the row of data. “ETH” stands for the “Ethereum” cryptocurrency.
  • Open: The price of the cryptocurrency in US Dollars ($) at the start of the hour.
  • High: The highest price during the course of the hour associated with the row of data in US Dollars.
  • Low: The lowest price during the course of the hour associated with the row of data in US Dollars.

Data Dictionary - Continued

  • Close: The price of the cryptocurrency in US Dollars ($) at the end of the hour.
  • VolumeUSDT: The total trading volume associated with the one hour period associated with the row of data.
  • CloseLag1Hour: The value of the field Close, offset by 1 hour into the past, meaing the Close price ofthe cryptocurrency 1 hour earlier in US Dollars ($).
  • CloseLag12Hour: The Close price of the cryptocurrency 12 hours earlier in US Dolars ($)
  • CloseLag24Hour: The Close price of the cryptocurrency 24 hours earlier in US Dolars ($)
  • CloseLag3Day: The Close price of the cryptocurrency 3 days earlier in US Dollars ($)

Data Dictionary - Continued

  • CloseLag7Day: The Close price of the cryptocurrency 7 days earlier in US Dollars ($).
  • CloseLag14Day: The Close price of the cryptocurrency 14 days earlier in US Dollars ($)
  • CloseLag30Day: The Close price of the cryptocurrency 30 days earlier in US Dollars ($)
  • CloseLag90Day: The Close price of the cryptocurrency 90 days earlier in US Dollars ($)
  • CloseLag120Day: The Close price of the cryptocurrency 120 days earlier in US Dollars ($).
  • Target24HourClose: The value of the field Close, offset by 24 hours into the future. This is the column we are interested in predicting values for.

Data Preview

Only 100 rows shown. If you are unclear on what a particular column might be, you can refer to the previous section, which contains a data dictionary describing each column.

What is the goal?

Before we start even cleaning the data, it’s important to establish what we want to do, as the steps we take may very well depend on those intentions. Our goal, is to predict the future price of the cryptocurrency called Ethereum. For any given hour, we are given the Openprice at the start of the hour, the Close price at the end of the hour, the lowest and highest prices for the given hour, the volume associated with the hour of trading, as well as the Close price from the previous hour, previous 12 hours, and 7 other lagged variables. We want to be able to draw relationships between these variables in order to predict the value of the field called Target24HourClose, which is the Close price 24 hours into the future relative to the rest of the columns in the data.

Data Prep

Data preparation is an essential step for effective predictive modeling. Not all datasets are created equal, and the quality of the models we are able to make will only ever be as good as the quality of the data itself. Cleaning a dataset can mean many different things depending on the problem at hand, and unfortunately we are unable to cover them all here; this tutorial focuses on the predictive modeling side of things, so the data provided has already been cleaned beforehand, but we still introduce some useful tools in the section below.

Cleaning Data - Overview

What does “cleaning data” look like more specifically? Data tends to have issues, and before we start deriving insights from it, we usually need to fix those issues, which is what we mean by cleaning the data. Some examples:

  • Are the columns of the dataset the correct data types (i.e. numeric, string, etc..)?

  • Are there any missing (null) values? What proportion of all rows is missing for each column? What is the context behind them? Is this problematic in terms of what you end goal is? Are there any duplicate

  • Are there any inconsistencies in the data? Do the outliers make sense or is there a data capture issue of some kind? Are there typos if the data is manually entered?

There are tools that can help answer all of these questions, in the full version of this tutorial, we cover some functionality found in the skimr package. After we have made sure we are working with a clean dataset, we can do some data engineering to try and improve the accuracy of the models we will be making by adding new information that might be relevant to the problem at hand, which is what we will be doing in the next step when we add the new spread column to the data.

Exploratory Data Analysis

It is worth mentioning that Exploratory Data Analysis (EDA) is the important process by which you figure out what about the data needs to be cleaned and adjusted, but we are not covering it in this tutorial, and these steps have already been taken care of. You can find an excellent guide to EDA here: https://r4ds.had.co.nz/exploratory-data-analysis.html

In the full version of this tutorial we are a bit more thoughtful around EDA.

Dplyr Package

There are resources and tools out there that are very effective for cleaning and manipulating data. We recommend starting with the dplyr package (Wickham et al. 2020) if working in R, which is what we used to clean the data. Because this is such a useful tool in R, we included a brief example using the mutate() function to add a new column to the data that calculates the spread for each row, which we will define as the percent difference between the Low and the High prices.

Mutate Function - dplyr

The dplyr package from the tidyverse (Wickham et al. 2019) is very useful for cleaning and manipulating data. Using the mutate() function from dplyr, we can add a new column to the data according to a formula that gets applied to each row. For example, we could create add a brand new column (C) to our data by taking the sum of two columns (A+B) like illustrated below:

Illustration by Allison Horst

Illustration by Allison Horst

Add Spread

Using mutate() we can add a new column called spread, which tracks the percent difference between the Low price and the High price at each hourly datapoint as an absolute value:

cryptodata <- mutate(cryptodata, spread = abs(((High - Low)/Low)*100))
select(cryptodata, DateTime, Low, High, spread) #Show results (only columns of interest)
## # A tibble: 5,111 x 4
##    DateTime              Low  High spread
##    <dttm>              <dbl> <dbl>  <dbl>
##  1 2020-01-01 01:00:00  129.  131.  1.45 
##  2 2020-01-01 02:00:00  130.  131.  0.483
##  3 2020-01-01 03:00:00  130.  131.  0.731
##  4 2020-01-01 04:00:00  130.  131.  0.453
##  5 2020-01-01 05:00:00  130.  130.  0.277
##  6 2020-01-01 06:00:00  130.  131.  0.376
##  7 2020-01-01 07:00:00  130.  131.  0.438
##  8 2020-01-01 08:00:00  130.  130.  0.416
##  9 2020-01-01 09:00:00  130.  131.  0.377
## 10 2020-01-01 10:00:00  130.  131.  0.577
## # ... with 5,101 more rows

Remove Symbol Column

In the previous slide we checked we calculated the spread correctly by only returning the relevant columns using the select() function. The column named Symbol in the dataset always has a consistent value of ETH and adds no value when making predictions, so we can remove the column very easily using the select() function again, but this time using it to remove a column from the data:

cryptodata <- select(cryptodata, -Symbol)

Train/Test Split

When we make predictive models, we would not want to make important real life decisions based on those models without having a good idea of how effective the methodology actually is, and ensuring no mistakes were made. We will likely never get a 100% accurate representation of what the model will actually perform like in the real world without actually tracking those results over time, but there are ways for us to get a sense of whether something works or not ahead of time. Here, we are taking the simplest of those approaches by training our models on the earlier 80% of the data, and testing the accuracy of those models on the remaining 20% new unseen data to start evaluating how good the models are. This is an important step! You should never trust a model that has not been tested outside of the data it was trained on, and understanding those results first.

Later we will introduce a more powerful way of doing this called cross validation which repeats the process over random samples of the data. Because our data is captured over a period of time, we will consider when the data was collected in the way we split, and do time aware cross validation.

Train/Test Split - Continued

Let’s split the data into a train and test dataset. We will build the models on the earlier 80% of the data, and assess how well the models perform on the last 20% of the data. For now this is a simplified version of the process, which will ultimately split the data into more subsets and consider the specific date on which the data was collected in those splits.

# Take first 80% of rows to make train data
cryptodata_train <- head(cryptodata, as.integer(nrow(cryptodata)*.8))
# Use the other 20% for the test data
cryptodata_test <- tail(cryptodata, as.integer(nrow(cryptodata)*.2))

The first piece of code is taking the first 80% of the rows, and the second line for cryptodata_test creates the object taking the last 20% of the rows of cryptodata (which was already sorted ahead of time from the earliest to the latest datapoint).

  • nrow() returns the total number of rows in the data.

  • as.integer() removes decimals from nrow(cryptodata)*.8).

  • head() returns the specified number of elements based on the other two operations.

  • tail() does the opposite of head() and gets the last elements.

Helpful Resources

As previously mentioned, we omitted data cleaning/preparation steps as they might not be relevant to all predictive analytics problems. If you are looking for a good resource for these types of problems, look no further than “STAT 545”: https://stat545.com/ which takes the opposite approach of what we are doing here, and “is about everything that comes up during data analysis except for statistical modelling and inference.”

Also remember that if you do not know what a function does, you can put it inside of the function help() to open the documentation associated with the function. You can learn more about help() here: https://www.r-project.org/help.html

Visualizing Data

The ggplot2 package (Wickham 2016) is the standard when it comes to making plots in R. The gg in ggplot2 stands for the Grammar of Graphics, which is essentially the idea that many different types of charts share the same underlying building blocks, and that they can be put together in different ways to make charts that look very different from each other. In Hadley Wickham’s (the creator of ggplot2) own words, “a pie chart is just a bar chart drawn in polar coordinates”, “They look very different, but in terms of the grammar they have a lot of underlying similarities.”

Illustration by Allison Horst

Illustration by Allison Horst

Basics - ggplot2

So how does ggplot2 actually work? According to the official documentation, "…start with ggplot(), supply a dataset and aesthetic mapping (with aes())…

price_chart <- ggplot(data = cryptodata, aes(x = DateTime, y = Close))
# Show chart:
price_chart

Here we specified which data to use and which variables we are interested in plotting, find out about specifying a geom to visually represent the points in the slide below ⬇️

ggplot - Add geom

In the previous slide, we got a blank chart, but that was not due to an error. Up to this point, we have specified the data as being cryptodata, and we assigned the x-axis to the variable DateTime and the y-axis to the variable Close, but we have not yet specified how to plot the information. Should the points be shown as dots? Lines? Bars? We can specify the shape by adding a geom, in this case a line using geom_line(). You can find many more geom types here: https://ggplot2.tidyverse.org/reference/

price_chart <- price_chart + geom_line()
# Show chart
price_chart

ggplot add title

The main takeaway, is that we don’t need to re-build every aspect of a chart when we want to produce something that looks different. We can keep building a chart one piece at a time, and change those pieces when we have to. Here is another example adding a title:

price_chart <- price_chart + ggtitle('Price ($) Over Time - Ethereum')
# Show chart
price_chart

We could of course do the same to rename the x and y axis. The package ggplot2 is much more than that however, go to the slide below to learn about how the functionality of the ggplot2 package can be extended further.

Extending ggplot2

The ggplot2 package has official frameworks on how it can be extended, and the R community has developed several very useful extensions, which allow for things like and interactive charts. You can find a pretty comprehensive list of useful extensions here: https://exts.ggplot2.tidyverse.org/

Extension - ggthemes

For example, we could use the ggthemes package, (Arnold 2019) which was installed and loaded previously, to change the look of the chart:

price_chart + theme_economist()

Extension - gganimate

We cannot cover all these extensions in detail here, but as one last example here we use the transition_states() function from the gganimate package (Pedersen and Robinson 2020) to create an animation that iterates through each date and adjusts the axis relative to the current data:

animated_chart <- price_chart + 
                      transition_states(Date) + 
                      view_follow() + 
                      geom_point()

View the results in the slide below ⬇️

gganimate results

animate(animated_chart, fps = 1, height = 320, width = 450)

We wrapped animated_chart in the function animate() to slow down and resize the GIF shown. We will stop here, but we encourage you to keep exploring the possibilities of ggplot2 in the full version of the tutorial. Move on to the next slide ➡️ to start making predictive models!

Predictive Models

Now we can start thinking about how we will make predictions into the future. We will take information/data observed in the past, and use it to make forecasts about the future. Each row in the data already comes with the target variable, meaning what we want to predict. The column Target24HourClose is the only information in the dataset that refers to a point in the future. It is really important to make sure that all the data we are using to make predictions would actually be available at the time we want to perform a new prediction. If the model was trained on information only available once it was too late to act on it, we might run into target leakage, which may give us the impression the models are much better than they actually are. Click here to read more about this idea, and how this is similar to having the answer sheet to an exam and how it would not help on the next different exam; we want the model to have a proper understanding of the problem rather than one answer sheet. Beware results that are too good to be true and always double check your work!

Overview

Therefore, we will:

1. Take data from the past, where all columns are associated with a specific point in time, with the exception of the Target24HourClose column, which represents the price of the cryptocurrency 24 hours in the future relative to the rest of the columns.

2. Choose a model, which will determine the approach the computer will take to draw statistical relationships between the columns.

3. Consider ways of reducing the risk of real-world results not matching the accuracy of the original model’s results.

4. Make forecasts about the future.

We will start with a simple linear regression model in the slide below ⬇️

Simple Linear Regression Model

Because we are looking to predict numeric values (rather than a 0/1 outcome), we could make a very straightforward model using the Linear Regression algorithm. We will not dive into the mathematical formula or specifics around linear regression here, you can take a deeper dive into linear regression here if you would like: https://uc-r.github.io/linear_regression

lm(data = cryptodata_train, formula = "Target24HourClose ~ .")
## 
## Call:
## lm(formula = "Target24HourClose ~ .", data = cryptodata_train)
## 
## Coefficients:
##     (Intercept)             Date             Open             High  
## -394.1820495913     0.1960631410    -0.5768230984     0.2023495714  
##             Low            Close       VolumeUSDT    CloseLag1Hour  
##    0.0667490386     0.8877501055    -0.0000001949     0.5406464200  
##  CloseLag12Hour   CloseLag24Hour     CloseLag3Day     CloseLag7Day  
##   -0.2150085726     0.0551978040     0.0336116669    -0.0488194241  
##   CloseLag14Day    CloseLag30Day    CloseLag90Day   CloseLag120Day  
##    0.0019227141    -0.0351710872     0.0190707762     0.0129766585  
##      Volatility         DateTime           spread  
##    0.7902979814    -0.0000020136               NA

“lm” model - Continued

In the previous slide, we were able to make a linear regression model by using the function lm() (R Core Team 2020a) and specifying the data to be used as cryptodata_train, and the formula as predicting the Target24HourClose and using (~) all other variables (represented by the period “.”). Here is the code again for reference:

lm(data = cryptodata_train, formula = "Target24HourClose ~ .")

If we wanted to implement a different model, there may be different structural requirements to the data used by the model, and we may have more parameters to adjust, so in some cases it may not be very straightforward to implement a different model, even though we intend to use the same dataset and the same prediction formula predicting the Target24HourClose column using all other columns. Instead of figuring out the implementation of different models, we can use a framework that has standardized the usage of many models using the caret package.

Reproducibility - Set seed

When making predictive models, some functions we will use next will involve some degree of randomness. Computers are not truly random, we will not dive into this topic here, but essentially we can control the randomness by specifying a number on which the randomness depends on, and get the same “random” results by doing so. Without this step, the results on your computer might be slightly different than the ones shown here. We can make sure the results are the same (and reproducible) by running the function from base R set.seed():

set.seed(123) # arbitrarily set to 123, can be any number

This ensures the “randomness” begins from the same starting point on your computer as it does in the example code being shown. Click here to find out more about this concept. You will see this before training models because there is a degreee of “randomness” involved, and in order to get the same exact results you will need to set the same seed.

Caret Package

If we think back to the approach we took with ggplot, we identified which parts of the code would need to change; we specified the data and axis first, and we were then free to add points to the chart as lines or dots or any other “geom”. We will take a similar approach here for predictive modeling. As already pointed out earlier, if we made a different predictive model we would still be predicting the value of the Target24HourClose column, and we would still use the rest of the independent variables as predictors. What usually changes between predictive models, is the implementation and use of them, as they may have different structural requirements and parameters for their specific implementation.

Caret Package - Continued

The caret package (Kuhn 2020) is one of several solutions to resolve the problem described in the previous slide. It allows the user to specify the data, variables to use and everything else that would not change regardless of the model we want to use. As a separate parameter, we can specify which modeling framework to use (i.e. linear regression). Different models will have different parameters, and we will be able to adjust those parameters depending on the model, or leave those blank to use the sensible default values set by the creator of the package.

See this idea in action in the slide below ⬇️

Caret - Simple Linear Regression

set.seed(123)
lm_fit <- train(Target24HourClose ~ .,
                    data = cryptodata_train,
                    method = "lm")
# Show fit
lm_fit
## Linear Regression 
## 
## 4088 samples
##   18 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 4088, 4088, 4088, 4088, 4088, 4088, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   9.579822  0.9488271  6.47054
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

“Method” Options

You can find the entire list of models supported by caret, as well as the method value to enter to produce the specified model at the page below:

https://topepo.github.io/caret/available-models.html

The website from the previous slide is all you need in order to make any of the 238 (currently) available model choices. The columns circled and labeled as 1 and 2 are explained on the next slide below ⬇️

“Method” Options continued

  1. method Value: Two slides back/up we specified method = “lm” for a linear regression model. You can make any of the listed models by changing the method Value which would be adaboost for the model highlighted in the screenshot above.
  1. Libraries: This tells us what packages need to be installed in the R session before the model is able to run. Before we would be able to run the model on the first row of the screenshot, we would need to install the required package by running install.packages(“fastAdaboost”) (for the highlighted model) or else we would run into an error when training the model.

XGBoost

The XGBoost predictive modeling framework (Chen et al. 2020) has proven itself to be very powerful in recent years on a variety of different problems, so that’s the next model we will make. XGBoost stands for Extreme Gradient Boosting, which is a mathematical concept we won’t tackle in this tutorial, but you can learn more about the high-level idea and how this model uses it in the context of Decision Tree Ensembles here: https://xgboost.readthedocs.io/en/latest/tutorials/model.html

Move on to the next slide below where we keep all the same code, but change the method to make an XGBoost model.

XGBoost - Caret

We can use the same code we used when making the linear regression model, but specify the method to be xgbLinear instead of the lm value we used before:

set.seed(123)
xgb_fit <- train(Target24HourClose ~ . ,
                    data = cryptodata_train,
                    method = "xgbLinear")
# Show results
xgb_fit
## eXtreme Gradient Boosting 
## 
## 4088 samples
##   18 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 4088, 4088, 4088, 4088, 4088, 4088, ... 
## Resampling results across tuning parameters:
## 
##   lambda  alpha   nrounds  RMSE      Rsquared   MAE     
##   0.0000  0.0000   50      3.957696  0.9911113  2.265014
##   0.0000  0.0000  100      3.887280  0.9914165  2.190684
##   0.0000  0.0000  150      3.866661  0.9915047  2.168938
##   0.0000  0.0001   50      3.962023  0.9910903  2.267355
##   0.0000  0.0001  100      3.891323  0.9913940  2.192726
##   0.0000  0.0001  150      3.869327  0.9914894  2.170714
##   0.0000  0.1000   50      3.942446  0.9911900  2.261720
##   0.0000  0.1000  100      3.868587  0.9915095  2.184636
##   0.0000  0.1000  150      3.844850  0.9916094  2.159715
##   0.0001  0.0000   50      3.941018  0.9912003  2.266039
##   0.0001  0.0000  100      3.865918  0.9915218  2.189273
##   0.0001  0.0000  150      3.842734  0.9916213  2.165600
##   0.0001  0.0001   50      3.941597  0.9911974  2.266765
##   0.0001  0.0001  100      3.866168  0.9915228  2.189727
##   0.0001  0.0001  150      3.843776  0.9916187  2.165978
##   0.0001  0.1000   50      3.937007  0.9912182  2.263762
##   0.0001  0.1000  100      3.860721  0.9915460  2.185463
##   0.0001  0.1000  150      3.838270  0.9916409  2.161029
##   0.1000  0.0000   50      3.902000  0.9914220  2.294267
##   0.1000  0.0000  100      3.811490  0.9918085  2.200294
##   0.1000  0.0000  150      3.785479  0.9919181  2.173111
##   0.1000  0.0001   50      3.905799  0.9914065  2.298146
##   0.1000  0.0001  100      3.814888  0.9917955  2.204769
##   0.1000  0.0001  150      3.786906  0.9919132  2.175525
##   0.1000  0.1000   50      3.897304  0.9914341  2.291492
##   0.1000  0.1000  100      3.804273  0.9918304  2.200034
##   0.1000  0.1000  150      3.777444  0.9919416  2.172513
## 
## Tuning parameter 'eta' was held constant at a value of 0.3
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, lambda = 0.1, alpha
##  = 0.1 and eta = 0.3.

You may have noticed that both outputs say Resampling: Bootstrapped (25 reps). This is because of a default value on an additional option of train(), go to the next slide below to learn more about additional options/functionality.

Caret Additional Options

Some models, like a Support Vector Machine model, need the data to be centered and scaled for each column, click here to learn why that is the case. We could make this change adding the parameter preProc = c(“center”, “scale”) to the train() call:

set.seed(123)
train(Target24HourClose ~ . ,
          data = cryptodata_train,
          method = "svmPoly",
          preProc = c("center", "scale"))
## Support Vector Machines with Polynomial Kernel 
## 
## 4088 samples
##   18 predictor
## 
## Pre-processing: centered (18), scaled (18) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 4088, 4088, 4088, 4088, 4088, 4088, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     RMSE       Rsquared   MAE     
##   1       0.001  0.25  10.444329  0.9405996  6.813365
##   1       0.001  0.50  10.142601  0.9429776  6.569494
##   1       0.001  1.00   9.988989  0.9444432  6.464840
##   1       0.010  0.25   9.874144  0.9456493  6.406155
##   1       0.010  0.50   9.813655  0.9463046  6.387656
##   1       0.010  1.00   9.764008  0.9468404  6.372180
##   1       0.100  0.25   9.715306  0.9473626  6.355374
##   1       0.100  0.50   9.681753  0.9477219  6.343336
##   1       0.100  1.00   9.662114  0.9479313  6.337777
##   2       0.001  0.25  10.135227  0.9430523  6.563593
##   2       0.001  0.50   9.975888  0.9445868  6.457157
##   2       0.001  1.00   9.873690  0.9456716  6.405768
##   2       0.010  0.25   9.587601  0.9487712  6.297531
##   2       0.010  0.50   9.448230  0.9502552  6.229058
##   2       0.010  1.00   9.305788  0.9517446  6.155100
##   2       0.100  0.25   8.831106  0.9565532  5.979976
##   2       0.100  0.50   8.770508  0.9571195  5.960379
##   2       0.100  1.00   8.735269  0.9574485  5.942189
##   3       0.001  0.25  10.016425  0.9441904  6.483807
##   3       0.001  0.50   9.892814  0.9454808  6.416449
##   3       0.001  1.00   9.787677  0.9466014  6.369975
##   3       0.010  0.25   9.180576  0.9530110  6.017467
##   3       0.010  0.50   8.980345  0.9550364  5.877479
##   3       0.010  1.00   8.799147  0.9568208  5.750864
##   3       0.100  0.25   8.977224  0.9489107  4.852243
##   3       0.100  0.50  10.506763  0.9293688  4.815376
##   3       0.100  1.00  12.643457  0.8996649  4.824300
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 0.1 and C = 1.

The output that said Resampling: Bootstrapped (25 reps) was referring to a different option trControl of the train() function.

Cross Validation for timeseries

train_control <- trainControl(method = "timeslice",
                              initialWindow = 100*24,
                              horizon = 30*24,
                              fixedWindow = TRUE,
                              number = 5)

Source: https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series

Cross Validation for timeseries

In the previous slide we set the timeseries cross-validation to take an initialWindow of 100 days (given by 24 hourly time points multiplied by 100 days), and a fixedWindow of 30 days. The parameter number = 5 specifies the process should take place on 5 separate occasions. Consult the image in the previous slide to understand what the other two parameters do. By the end of this process, we end up with 5 separate training and testing datasets. If a model works well across 5 separate periods of time, we can have greater confidence things will go well when we apply the model to the real world as well, and that there are no mistakes in the process. Next, let’s run the XGBoosst model with the specified cross-validation.

XGBoost with Time-Aware Cross Validation

set.seed(123)
xgb_fit <- train(Target24HourClose ~ . ,
                    data = cryptodata_train,
                    method = "xgbLinear",
                    trControl = train_control)
# Show fit
xgb_fit
## eXtreme Gradient Boosting 
## 
## 4088 samples
##   18 predictor
## 
## No pre-processing
## Resampling: Rolling Forecasting Origin Resampling (720 held-out with a fixed window) 
## Summary of sample sizes: 2400, 2400, 2400, 2400, 2400, 2400, ... 
## Resampling results across tuning parameters:
## 
##   lambda  alpha   nrounds  RMSE      Rsquared   MAE     
##   0.0000  0.0000   50      23.48164  0.2605845  19.05384
##   0.0000  0.0000  100      23.49777  0.2586981  19.07057
##   0.0000  0.0000  150      23.49961  0.2580315  19.07421
##   0.0000  0.0001   50      23.49308  0.2603506  19.05520
##   0.0000  0.0001  100      23.51915  0.2581942  19.07720
##   0.0000  0.0001  150      23.52521  0.2572911  19.08466
##   0.0000  0.1000   50      23.70605  0.2558033  19.19484
##   0.0000  0.1000  100      23.74174  0.2534766  19.21825
##   0.0000  0.1000  150      23.75052  0.2527491  19.22158
##   0.0001  0.0000   50      23.45349  0.2605976  19.01950
##   0.0001  0.0000  100      23.48332  0.2582353  19.04615
##   0.0001  0.0000  150      23.49469  0.2572132  19.05671
##   0.0001  0.0001   50      23.45368  0.2612314  19.03745
##   0.0001  0.0001  100      23.48690  0.2592752  19.06773
##   0.0001  0.0001  150      23.50452  0.2585624  19.08364
##   0.0001  0.1000   50      23.62649  0.2538029  19.10492
##   0.0001  0.1000  100      23.64552  0.2516789  19.11381
##   0.0001  0.1000  150      23.65968  0.2509828  19.12142
##   0.1000  0.0000   50      23.50952  0.2459705  18.52896
##   0.1000  0.0000  100      23.51144  0.2438362  18.53459
##   0.1000  0.0000  150      23.51451  0.2429365  18.53663
##   0.1000  0.0001   50      23.48793  0.2476829  18.51885
##   0.1000  0.0001  100      23.50035  0.2454878  18.53253
##   0.1000  0.0001  150      23.49778  0.2446405  18.53131
##   0.1000  0.1000   50      23.37062  0.2508630  18.43669
##   0.1000  0.1000  100      23.38575  0.2483836  18.45379
##   0.1000  0.1000  150      23.38696  0.2475664  18.45489
## 
## Tuning parameter 'eta' was held constant at a value of 0.3
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 50, lambda = 0.1, alpha
##  = 0.1 and eta = 0.3.

Hyperparameter tuning

We will not discuss this at length here, but because we are able to assess the accuracy of the models we make by creating a test set with data that the model has never seen before, we could keep making slight changes to the parameters used by the models. What is the ideal number of trees for the XGBoost model to have? It might be the default value specified by the caret package, but an alternative approach would be to specify a range we find to be sensible, and have the computer iterate over the range to find the most optimal solution. You can read more about hyperparameter tuning in caret here: https://topepo.github.io/caret/model-training-and-tuning.html

Hyperparameter tuning - Cont.

Caret as a default behavior will perform hyperparameter tuning, we can view the results by accessing the column of the fitted model we made for XGBoost that is called bestTune. To do this, we start with the xgb_fit object followed by a dollar sign and the name bestTune to indicate we want to return the table with the best found values for the different parameters:

xgb_fit$bestTune
##    nrounds lambda alpha eta
## 25      50    0.1   0.1 0.3

Make Predictions on New Data

We can use the predict() function to make predictions, supplying the model to use as xgb_fit and making predictions on the dataset cryptodata_test

xgb_pred <- predict(xgb_fit, newdata = cryptodata_test)

Evaluate Model Performance

Because we actually know what really happened for the target variable in the test data we used in the previous step, we can get a good idea of how good the model performed on a dataset it has never seen before. We do this to avoid overfitting, which is the idea that the model may work really well on the training data we provided, but not on the new data that we want to predictions on. If the performance on the test set is good, that is a good sign. If the data is split into several subsets and each subset has consistent results for the training and test datasets, that is an even better sign the model may perform as expected. Read more about measuring performance here: https://topepo.github.io/caret/measuring-performance.html

postResample(pred = xgb_pred, obs = cryptodata_test$Close)
##       RMSE   Rsquared        MAE 
## 32.6041938  0.4466543 19.1595556

Error Metrics

In the previous slide, by using the postResample() function we returned three columns:

1. MAE: The Mean Absolute Error measures the average error between the predictions and what actually happened as an absolute value (MAE could never be less than 0).

  1. RMSE: The Root Mean Squared Error takes the square root of the average squared errors and behaves slightly differently than MAE depending on the magnitude of the error, but would always be larger or equal to MAE. You can learn more about the differences between the two here: https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d

3.Rsquared: The R Squared metric gives us a general indication of how informative the independent variables are in relation to the dependent variable (Target24HourClose), where a perfect score of 1.0 would suggest the price movements of the dependent variable are completely explained by the independent variables and a score of 0 would suggest the opposite. This metric should be taken with a grain of salt, but taken in conjunction with the RMSE or MAE error metrics it can be informative. Learn more about R Squared here: https://www.investopedia.com/terms/r/r-squared.asp

In the full-version of the tutorial, we will compare the scores for the RMSE and \(R^2\) metrics across many cryptocurrencies and models over time.

Timeseries Data Prep

In the previous section, we performed the cross-validation relative to the date and time that the data was collected, because we would have risked overfitting our models otherwise. Dealing with data that is a sequence in time also has a positive side to it however. In this section, we will learn to extrapolate seasonal patterns from the data to make forecasts about the future, and the greater flexibility it will provide relative to the supervised machine learning models we made in the previous section.

Much of the content for this section was created referencing this talk at the 2019 RStudio conference: https://rstudio.com/resources/rstudioconf-2019/melt-the-clock-tidy-time-series-analysis/

The “tsibble” data structure

Up to this point, when we referred to tabular data, we were working with a data structure called tibble. Hadley Wickham is the creator of packages we have gone over like dplyr, ggplot2 and many others, including the tibble package. The default way for R to read tabular data is as a dataframe, which can be converted to a tibble and viceversa. The main difference between the two, is a tibble will show the results in a much neater fashion when there are many columns and/or rows. Next, we will convert the data from being a tibble to a tsibble (notice the extra s) (Wang, Cook, and Hyndman 2020), which will be recognized as a time series dataset that was collected over an even time interval, and will have many useful properties for the time series analysis we will perform next.

Make a “tsibble”

We can create a tsibble object using the as_tsibble() function, and using the DateTime field as the index of the tsibble:

cryptodata_ts <- as_tsibble(cryptodata, index = DateTime)

The index represents the Date/Time field of when the data was collected. If we were dealing with multiple cryptocurrencies, we would also specify the key parameter as being the name of the cryptocurrency, because a tsibble needs to have a unique combination of the index and key. Meaning, if the time series data was collected once hourly, we would only expect one row every hour, and we cannot have duplicates, which is why if we are tracking more than one series at a time we would have to specify the group the rows belong to using the key parameter as well. Here we only have one series, so we don’t need a key.

tsibble preview

If we print the new object that we created in the previous slide cryptodata_ts, the top of the output will now show the text “A tsibble: 5,111 x 18 [1h] <UTC>” which specifies the dimensions (rows X columns), 1 hour intervals between rows, and the UTC timezone:

cryptodata_ts
## # A tsibble: 5,111 x 19 [1h] <UTC>
##    Date        Open  High   Low Close VolumeUSDT CloseLag1Hour CloseLag12Hour
##    <date>     <dbl> <dbl> <dbl> <dbl>      <dbl>         <dbl>          <dbl>
##  1 2020-01-01  129.  131.  129.  131.   1446168.          132.           132.
##  2 2020-01-01  131.  131.  130.  131.    980579.          131.           132.
##  3 2020-01-01  131.  131.  130.  130.    611280.          131.           130.
##  4 2020-01-01  130.  131.  130.  130.    436381.          130.           130.
##  5 2020-01-01  130.  130.  130.  130.    546888.          130.           129.
##  6 2020-01-01  130.  131.  130.  130.    474847.          130.           129.
##  7 2020-01-01  130.  131.  130.  130.    536083.          130.           128.
##  8 2020-01-01  130.  130.  130.  130.    955261.          130.           128.
##  9 2020-01-01  130.  131.  130.  130.    580141.          130.           129.
## 10 2020-01-01  130.  131.  130.  131.    577170.          130.           129.
## # ... with 5,101 more rows, and 11 more variables: CloseLag24Hour <dbl>,
## #   CloseLag3Day <dbl>, CloseLag7Day <dbl>, CloseLag14Day <dbl>,
## #   CloseLag30Day <dbl>, CloseLag90Day <dbl>, CloseLag120Day <dbl>,
## #   Volatility <dbl>, Target24HourClose <dbl>, DateTime <dttm>, spread <dbl>

Dealing with gaps

Almost all timeseries models will require a complete dataset with no missing gaps for the index we specified. We can check whether our data has any gaps using the function has_gaps():

has_gaps(cryptodata_ts)
## # A tibble: 1 x 1
##   .gaps
##   <lgl>
## 1 TRUE

Dealing with gaps - Continued

We can find the specific gaps (if there are any) using the scan_gaps() function:

scan_gaps(cryptodata_ts)
## # A tsibble: 1 x 1 [1h] <UTC>
##   DateTime           
##   <dttm>             
## 1 2020-04-19 10:00:00

And we can turn these missing values into explicit missing values that the timeseries can recognize as null with fill_gaps():

cryptodata_ts <- fill_gaps(cryptodata_ts)

fill gaps recap

When we created the tsibble, we specified an index, which allowed the function to figure out our data is in 1 hour intervals. It was also able to figure out the range of the data. Based on that information, it is able to detect that we have 1 gap in our timeseries dataset. We were able to add this missing row of data using fill_gaps(), which explicitly defines the row as missing and avoids issues moving forward. Printing the new overwritten version of cryptodata_ts, the dimensions have changed to have 1 additional row (5,112):

cryptodata_ts
## # A tsibble: 5,112 x 19 [1h] <UTC>
##    Date        Open  High   Low Close VolumeUSDT CloseLag1Hour CloseLag12Hour
##    <date>     <dbl> <dbl> <dbl> <dbl>      <dbl>         <dbl>          <dbl>
##  1 2020-01-01  129.  131.  129.  131.   1446168.          132.           132.
##  2 2020-01-01  131.  131.  130.  131.    980579.          131.           132.
##  3 2020-01-01  131.  131.  130.  130.    611280.          131.           130.
##  4 2020-01-01  130.  131.  130.  130.    436381.          130.           130.
##  5 2020-01-01  130.  130.  130.  130.    546888.          130.           129.
##  6 2020-01-01  130.  131.  130.  130.    474847.          130.           129.
##  7 2020-01-01  130.  131.  130.  130.    536083.          130.           128.
##  8 2020-01-01  130.  130.  130.  130.    955261.          130.           128.
##  9 2020-01-01  130.  131.  130.  130.    580141.          130.           129.
## 10 2020-01-01  130.  131.  130.  131.    577170.          130.           129.
## # ... with 5,102 more rows, and 11 more variables: CloseLag24Hour <dbl>,
## #   CloseLag3Day <dbl>, CloseLag7Day <dbl>, CloseLag14Day <dbl>,
## #   CloseLag30Day <dbl>, CloseLag90Day <dbl>, CloseLag120Day <dbl>,
## #   Volatility <dbl>, Target24HourClose <dbl>, DateTime <dttm>, spread <dbl>

Train/Test Split

As a last step, let’s again split the data into a train and test dataset. We will build the models on the earlier 80% of the data, and assess how well the models perform on the last 20% of the data:

# Take first 80% of rows to make train data
ts_train <- head(cryptodata_ts, as.integer(nrow(cryptodata)*.8))
# Use the other 20% for the test data
ts_test <- tail(cryptodata_ts, as.integer(nrow(cryptodata)*.2))

As was already discussed, this is an example to illustrate the idea behind creating a train/test split to assess the performance of the models on data that the model has not seen before (trained on “train” data, and assessed on “test” data), but in practice a more complicated implementation using cross validation is much more effective.

Timeseries Predictive Modeling

Next, we will use the fable package (O’Hara-Wild, Hyndman, and Wang 2020) to create timeseries models. Why is it called fable?

  1. It makes forecasting tables

  2. “A fable is like a forecast: it’s never true, but it tells you something important about reality” - Rob J Hyndman New York Open Statistical Programming Meetup

Building a model

All the models made from this point are made using the fable package. The simplest type of timeseries model we can make, is a Naive model, which simply takes values from the past to forecast the future. Here we make one using the closing price from the previous hour to predict the future closing price:

set.seed(123)
naive_model <- model(ts_train, naive = NAIVE(Target24HourClose))
# Show model
naive_model
## # A mable: 1 x 1
##     naive
##   <model>
## 1 <NAIVE>

The “mable” data structure

In the previous slide you may have noticed the output produced a mable, which stands for model table. Just like we saw earlier with the tibble and tsibble data structures, a mable has properties which will help evaluate the efficacy of the models later on.

Naive Model Rationale

A Naive model like this is usually not very useful for forecasting, but it can give us a good baseline to compare against. If we made a model that performed worse than a Naive model, chances are that is not very informative. We will also add a very similar model called a seasonal naive model, which we will call snaive. In a later step we will discuss the difference between looking at the trend of the previous values vs. the trend from the previous season.

Show Naive Model Predictions

We can use the augment() function to compare the results for each row looking at the real result, the prediction (.fitted in output below) and the difference between those two values, corresponding to the error of the prediction (.resid):

augment(naive_model)
## # A tsibble: 4,088 x 5 [1h] <UTC>
## # Key:       .model [1]
##    .model DateTime            Target24HourClose .fitted  .resid
##    <chr>  <dttm>                          <dbl>   <dbl>   <dbl>
##  1 naive  2020-01-01 01:00:00              130.     NA  NA     
##  2 naive  2020-01-01 02:00:00              130.    130. -0.43  
##  3 naive  2020-01-01 03:00:00              129.    130. -0.62  
##  4 naive  2020-01-01 04:00:00              130.    129.  0.45  
##  5 naive  2020-01-01 05:00:00              130.    130.  0.0800
##  6 naive  2020-01-01 06:00:00              129.    130. -0.21  
##  7 naive  2020-01-01 07:00:00              129.    129. -0.160 
##  8 naive  2020-01-01 08:00:00              130.    129.  0.95  
##  9 naive  2020-01-01 09:00:00              130.    130. -0.23  
## 10 naive  2020-01-01 10:00:00              130.    130. -0.260 
## # ... with 4,078 more rows

A note on the target variable

Previously we defined the target variable as the closing price 24 hour in the future relative to the rest of the variables; this time we are using tools that “understand” time, and after making a model we will be able to make a forecast for however long in the future we would like. This time therefore, we will assume all columns reference the same exact point in time, we will pick the one we are interested in predicting (the Close price), and make a forecast that goes as far out as we need it to. Therefore, we will switch to the Close price as our target variable, and be very careful to not include the variable called Target24HourClose as a dependent variable, or we would run into target leak and end up with a meaningless model.

Make multiple models

Let’s create a new object like we did for the naive model earlier, but this time including a naive model, a snaive model and an ARIMA model (which will be explained next) at the same time:

set.seed(123)
ts_model <- model(ts_train, naive = NAIVE(Close),
                            snaive = SNAIVE(Close),
                            arima = ARIMA(Close))
ts_model$arima[[1]]
## $fit
## $par
## # A tibble: 2 x 5
##   term  estimate std.error statistic   p.value
##   <chr>    <dbl>     <dbl>     <dbl>     <dbl>
## 1 ar1     -0.482    0.0153     -31.5 4.61e-195
## 2 ar2     -0.205    0.0153     -13.4 5.35e- 40
## 
## $est
## # A tibble: 4,088 x 3
##    .fitted  .resid .regression_resid
##      <dbl>   <dbl>             <dbl>
##  1    131.  0.131               131.
##  2    131.  0.188               131.
##  3    131. -0.554               130.
##  4    130. -0.270               130.
##  5    130. -0.0332              130.
##  6    130.  0.188               130.
##  7    130. -0.112               130.
##  8    130.  0.0522              130.
##  9    130. -0.173               130.
## 10    130.  0.463               131.
## # ... with 4,078 more rows
## 
## $fit
## # A tibble: 1 x 7
##   sigma2 log_lik    AIC   AICc    BIC ar_roots  ma_roots 
##    <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <list>    <list>   
## 1   9.14 -10319. 20643. 20643. 20662. <cpl [2]> <cpl [0]>
## 
## $spec
## # A tibble: 1 x 8
##       p     d     q     P     D     Q constant period
##   <int> <int> <int> <int> <int> <int> <lgl>     <dbl>
## 1     2     1     0     0     0     0 FALSE        24
## 
## $model
## 
## Call:
## .f(x = ..1, order = ..2, seasonal = ..3, xreg = ..4, include.mean = FALSE, fixed = ..7, 
##     method = ..5)
## 
## Coefficients:
##           ar1      ar2
##       -0.4822  -0.2049
## s.e.   0.0153   0.0153
## 
## sigma^2 estimated as 9.143:  log likelihood = -10318.74,  aic = 20643.47
## 
## attr(,"class")
## [1] "ARIMA"
## 
## $model
## <ARIMA model definition>
## 
## $data
## # A tsibble: 4,088 x 2 [1h] <UTC>
##    DateTime            Close
##  * <dttm>              <dbl>
##  1 2020-01-01 01:00:00  131.
##  2 2020-01-01 02:00:00  131.
##  3 2020-01-01 03:00:00  130.
##  4 2020-01-01 04:00:00  130.
##  5 2020-01-01 05:00:00  130.
##  6 2020-01-01 06:00:00  130.
##  7 2020-01-01 07:00:00  130.
##  8 2020-01-01 08:00:00  130.
##  9 2020-01-01 09:00:00  130.
## 10 2020-01-01 10:00:00  131.
## # ... with 4,078 more rows
## 
## $response
## $response[[1]]
## Close
## 
## 
## $transformation
## $transformation[[1]]
## Transformation: .x
## Backtransformation: .x
## 
## attr(,"class")
## [1] "mdl_ts"

ARIMA

ARIMA stands for AutoRegressive Integrated Moving Average. Looking at the code from before, you can see there were several terms that were added with the prefixes of ar and ma. The terms that start with the name ar refer to the Auto Regressive components, while the terms that starts with the prefix of ma refer to Moving Averages. The Integrated part of ARIMA is meant to help eliminate trends in the time series data to make it more stationary (click here for an article explaining this idea in more detail).

From https://people.duke.edu/~rnau/411arim.htm:

A nonseasonal ARIMA model is classified as an “ARIMA(p,d,q)” model, where:

  • p is the number of auto-regressive terms,

  • d is the number of nonseasonal differences needed for stationarity, and

  • q is the number of lagged forecast errors in the prediction equation.

ARIMA - Fable Documentation

non-seasonal components

ARIMA - Fable Documentation

seasonal components

ARIMA Terms

The naive models doesn’t have any terms, but ARIMA does and we can view them using the function tidy():

tidy(ts_model)
## # A tibble: 2 x 6
##   .model term  estimate std.error statistic   p.value
##   <chr>  <chr>    <dbl>     <dbl>     <dbl>     <dbl>
## 1 arima  ar1     -0.482    0.0153     -31.5 4.61e-195
## 2 arima  ar2     -0.205    0.0153     -13.4 5.35e- 40

In the term column, we can see there are three auto-regressive terms (ar) and two seasonal auto-regressive terms (sar) which correspond to the p and P terms from the documentation shown.

ARIMA Terms - Continued

In the previous slide, the output showed three auto-regressive terms, and two seasonal auto-regressive terms. We can see this by showing the ts_models object:

ts_model
## # A mable: 1 x 3
##     naive   snaive          arima
##   <model>  <model>        <model>
## 1 <NAIVE> <SNAIVE> <ARIMA(2,1,0)>

The ARIMA model has terms as described earlier for (p, d, q)(P, D, Q), shown above as (0,1,4)(0,0,1). This summarizes the ARIMA model go to the next slide to get a better understanding of what this tells us.

ARIMA Terms - Continued

  1. The ARIMA() function we used automatically picked a value for p of 0, meaning there are no non-seasonal auto-regressive terms (as seen two slides back).

  2. The d term was set to 1, meaning the data was differenced once to try and achieve stationarity (a note on that in two slides).

  3. The P term was set to a value of two, indicating there are two seasonal auto-regressive terms.

ARIMA Terms - Simplified

ARIMA will add new variables for the specified terms p, P and q, Q, which could have been manually set as well. When the parameter p is equal to three, that means the model will calculate and leverage three new columns each being the previous value of the Close column one, two, and three rows/hours in the past. The difference for the P parameter, is the fact that the lagged values are added looking at the previous seasonal value, where instead of taking the value from the previous hour, it would take the value from last quarter. The q term on the other hand, would add new columns with Moving Averages (the MA of ARIMA) averaging the value of the Close column over a specified number of hours in the past (quarters when using Q instead of q).

A note on stationarity

Please note that ARIMA works best on data with stationary trends, meaning a time series with a mean and variance that is constant over time (which this data is not). The I in ARIMA refers to the d, D terms, which determine how much of an adjustment to make (if any) to the data to make it stationary. If you had data that was clearly non-stationary, an ETS model may be more appropriate, we will not cover it here, but you can learn more about ETS models here: https://fable.tidyverts.org/reference/ETS.html

There are ways of making both models and selecting the one with the better cross-validation results, or you can find an example of testing the stationarity of your data here: https://rpubs.com/richkt/269797

Forecast

We can use the forecast() function on the ts_models object we created earlier to generate forecasts for a given timeframe for all models found in ts_models:

ts_forecasts <- forecast(ts_model, h = '30 days')

The new object contains hourly forecasts for a 30 day period for both models:

## # A fable: 2,160 x 4 [1h] <UTC>
## # Key:     .model [3]
##    .model DateTime                  Close .mean
##    <chr>  <dttm>                   <dist> <dbl>
##  1 naive  2020-06-19 09:00:00  N(229, 11)  229.
##  2 naive  2020-06-19 10:00:00  N(229, 23)  229.
##  3 naive  2020-06-19 11:00:00  N(229, 34)  229.
##  4 naive  2020-06-19 12:00:00  N(229, 45)  229.
##  5 naive  2020-06-19 13:00:00  N(229, 57)  229.
##  6 naive  2020-06-19 14:00:00  N(229, 68)  229.
##  7 naive  2020-06-19 15:00:00  N(229, 80)  229.
##  8 naive  2020-06-19 16:00:00  N(229, 91)  229.
##  9 naive  2020-06-19 17:00:00 N(229, 102)  229.
## 10 naive  2020-06-19 18:00:00 N(229, 114)  229.
## # ... with 2,150 more rows

Forecast - Continued

The new output ts_forecasts gives us the normal distribution of a prediction. The predictions that are made this time are not simple numbers with error metrics, but ranges of a prediction. As more time goes on, we get less confident about our predictions about the future. Here, we can extract the 80% and 95% confidence intervals of the forecasts:

hilo(ts_forecasts, level = c(80, 95))

See the output in the slide below. Learn more about confidence intervals: https://online.stat.psu.edu/stat200/lesson/4/4.2

Forecast - Continued

## # A tsibble: 2,160 x 4 [1h] <UTC>
## # Key:       .model [3]
##                     `70%`                  `95%` DateTime            .model
##                    <hilo>                 <hilo> <dttm>              <chr> 
##  1 [225.3164, 232.3036]70 [222.2034, 235.4166]95 2020-06-19 09:00:00 naive 
##  2 [223.8693, 233.7507]70 [219.4668, 238.1532]95 2020-06-19 10:00:00 naive 
##  3 [222.7589, 234.8611]70 [217.3670, 240.2530]95 2020-06-19 11:00:00 naive 
##  4 [221.8228, 235.7972]70 [215.5967, 242.0233]95 2020-06-19 12:00:00 naive 
##  5 [220.9981, 236.6219]70 [214.0371, 243.5829]95 2020-06-19 13:00:00 naive 
##  6 [220.2525, 237.3675]70 [212.6271, 244.9929]95 2020-06-19 14:00:00 naive 
##  7 [219.5668, 238.0532]70 [211.3305, 246.2895]95 2020-06-19 15:00:00 naive 
##  8 [218.9286, 238.6914]70 [210.1236, 247.4964]95 2020-06-19 16:00:00 naive 
##  9 [218.3292, 239.2908]70 [208.9901, 248.6299]95 2020-06-19 17:00:00 naive 
## 10 [217.7623, 239.8577]70 [207.9180, 249.7020]95 2020-06-19 18:00:00 naive 
## # ... with 2,150 more rows

You could read the results in the first row of the output as: we are 70% confident the price on 2020-06-19 10:00:00 will be between $226 and $233, and we are 95% confident it will be between $223 and $236 based on data we have observed in the past. As we look at predictions farther out into the future, we will have a larger range of values because it becomes harder for us to be very confident.

Plot Forecasts

We can plot the forecast and the confidence intervals:

autoplot(ts_forecasts)

Plot Forecasts - Continued

The results for the naive approach covered the results for the arima model. Because the confidence intervals are going to be hard to visually compare, we can instead use what we learned earlier in the visualization section of this tutorial to plot the average value of the given distribution of forecasts:

ggplot(ts_forecasts, aes(x=DateTime, y=.mean, color=.model)) + geom_line()

View Forecasts

Anytime we make a chart in using gglot(), we can wrap the results in a function from the plotly package called ggplotly(). We won’t show the code to leave more room to interact with the chart (try hovering over it with your mouse) below:

Forecast - Recap

1. Now we can define how far out to make predictions for. With the models we made with caret we defined the target variable as the price 24 hours into the future, and built a model to predict the Close price 24 hours into the future. If we were interested in any other forecast, we would need to need to adjust the target variable in the data and create a new model, making sure only one variable is related to future values.

2. The resulting forecast is a distribution of values, we can take the mid-point of the distribution to return what we think the most likely scenario is, but we could also explore more unlikely scenarios by looking at the forecast for more extreme confidence intervals.

Accuracy on Test set

accuracy(ts_forecasts, ts_test)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 2 observations are missing between 2020-06-19 09:00:00 and 2020-06-19 10:00:00
## # A tibble: 3 x 9
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE  ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima  Test   5.13  8.37  6.70  2.12  2.81   NaN 0.967
## 2 naive  Test   4.99  8.29  6.63  2.06  2.79   NaN 0.967
## 3 snaive Test   3.19  7.61  6.27  1.28  2.65   NaN 0.943

Refer to this previous slide for more information on the error metrics.

Next steps

You now have all the building-blocks you need to tackle the full-version of the tutorial!


https://cryptocurrencyresearch.org/


Click on the link above to use these tools to run a similar analysis on the latest cryptocurrency data (updated every 12 hours) for many cryptocurrencies at once.

References


References start on the slide below ⬇️ and appear in the same order in which they are referenced throughout this presentation.

R Package - pacman

Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman


http://trinker.github.io/pacman/

R Package - pins


Javier Luraschi (2020). pins: Pin, Discover and Share Resources. R package version 0.4.0. https://CRAN.R-project.org/package=pins


https://pins.rstudio.com/

Illustrations by Allison Horst


https://github.com/allisonhorst/stats-illustrations

R Package - dplyr


Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 1.0.2. https://CRAN.R-project.org/package=dplyr


https://dplyr.tidyverse.org

R Package - tidyverse


Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686


https://tidyverse.org

R Package - ggplot2


H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.


https://ggplot2.tidyverse.org

R Package - ggthemes


Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes


https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/

R Package - gganimate


Thomas Lin Pedersen and David Robinson (2020). gganimate: A Grammar of Animated Graphics. R package version 1.0.5. https://CRAN.R-project.org/package=gganimate


https://gganimate.com/index.html

R Package - caret


Max Kuhn (2020). caret: Classification and Regression Training. R package version 6.0-86. https://CRAN.R-project.org/package=caret


http://topepo.github.io/caret/

R Package - xgboost


Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, Mu Li, Junyuan Xie, Min Lin, Yifeng Geng and Yutian Li (2020). xgboost: Extreme Gradient Boosting. R package version 1.0.0.2. https://CRAN.R-project.org/package=xgboost


https://xgboost.readthedocs.io/en/latest/#

R Package - tsibble


Wang, E, D Cook, and RJ Hyndman (2020). A new tidy data structure to support exploration and modeling of temporal data. Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2019.1695624.


https://tsibble.tidyverts.org/

R Package - fable


Mitchell O’Hara-Wild, Rob Hyndman and Earo Wang (2020). fable: Forecasting Models for Tidy Time Series. R package version 0.2.1. https://CRAN.R-project.org/package=fable


https://fable.tidyverts.org/

Timeseries

R Markdown

This document was produced using R Markdown.


JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2020). rmarkdown: Dynamic Documents for R. R package version 2.1. URL https://rmarkdown.rstudio.com.

Yihui Xie and J.J. Allaire and Garrett Grolemund (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC. ISBN 9781138359338. URL https://bookdown.org/yihui/rmarkdown.

Reveal Js

This presentation was created using the R Markdown implementation of the Reveal JS library.


Hakim El Hattab and JJ Allaire (2017). revealjs: R Markdown Format for ‘reveal.js’ Presentations. R package version 0.9. https://CRAN.R-project.org/package=revealjs


https://revealjs.com/

Base R


(R Core Team 2020b). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.


https://cran.r-project.org/

Session Information

Here are the details relating to the R session that last ran this tutorial:

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] doParallel_1.0.15 iterators_1.0.12  foreach_1.5.0     plotly_4.9.2.1   
##  [5] urca_1.3-0        feasts_0.1.4      fable_0.2.1       fabletools_0.2.0 
##  [9] tsibble_0.9.2     kernlab_0.9-29    xgboost_1.0.0.2   caret_6.0-86     
## [13] lattice_0.20-38   gganimate_1.0.5   ggthemes_4.2.0    ggplot2_3.3.2    
## [17] dplyr_1.0.2       pins_0.4.0        pacman_0.5.1      DT_0.15          
## [21] tictoc_1.0        knitr_1.28        bookdown_0.18    
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.4-1     ellipsis_0.3.1       class_7.3-15        
##  [4] revealjs_0.9         rstudioapi_0.11      farver_2.0.3        
##  [7] dials_0.0.9          prodlim_2019.11.13   fansi_0.4.1         
## [10] lubridate_1.7.8      codetools_0.2-16     splines_3.6.3       
## [13] jsonlite_1.7.1       workflows_0.2.1      pROC_1.16.2         
## [16] anytime_0.3.7        compiler_3.6.3       httr_1.4.2          
## [19] backports_1.1.6      assertthat_0.2.1     Matrix_1.2-18       
## [22] lazyeval_0.2.2       cli_2.0.2            tweenr_1.0.1        
## [25] htmltools_0.5.0      prettyunits_1.1.1    tools_3.6.3         
## [28] gtable_0.3.0         glue_1.4.1           reshape2_1.4.3      
## [31] rappdirs_0.3.1       Rcpp_1.0.4.6         DiceDesign_1.8-1    
## [34] vctrs_0.3.2          nlme_3.1-144         progressr_0.6.0     
## [37] transformr_0.1.3     crosstalk_1.1.0.1    parsnip_0.1.3       
## [40] timeDate_3043.102    gower_0.2.1          xfun_0.13           
## [43] stringr_1.4.0        lpSolve_5.6.15       lifecycle_0.2.0     
## [46] MASS_7.3-51.5        scales_1.1.1         ipred_0.9-9         
## [49] hms_0.5.3            yaml_2.2.1           curl_4.3            
## [52] rpart_4.1-15         stringi_1.4.6        highr_0.8           
## [55] e1071_1.7-3          gifski_0.8.6         lhs_1.0.1           
## [58] filelock_1.0.2       lava_1.6.7           rlang_0.4.7         
## [61] pkgconfig_2.0.3      distributional_0.2.0 evaluate_0.14       
## [64] purrr_0.3.4          sf_0.9-6             recipes_0.1.14      
## [67] htmlwidgets_1.5.1    labeling_0.3         tidyselect_1.1.0    
## [70] plyr_1.8.6           magrittr_1.5         R6_2.4.1            
## [73] generics_0.0.2       DBI_1.1.0            pillar_1.4.4        
## [76] withr_2.3.0          units_0.6-7          survival_3.1-8      
## [79] nnet_7.3-12          tibble_3.0.4         crayon_1.3.4        
## [82] KernSmooth_2.23-16   utf8_1.1.4           rmarkdown_2.1       
## [85] emo_0.0.0.9000       progress_1.2.2       grid_3.6.3          
## [88] data.table_1.12.8    ModelMetrics_1.2.2.2 digest_0.6.25       
## [91] classInt_0.4-3       tidyr_1.1.1          stats4_3.6.3        
## [94] GPfit_1.0-8          munsell_0.5.0        viridisLite_0.3.0









Arnold, Jeffrey B. 2019. Ggthemes: Extra Themes, Scales and Geoms for ’Ggplot2’. https://CRAN.R-project.org/package=ggthemes.

Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2020. Xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.

Kuhn, Max. 2020. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Luraschi, Javier. 2020. Pins: Pin, Discover and Share Resources. https://CRAN.R-project.org/package=pins.

O’Hara-Wild, Mitchell, Rob Hyndman, and Earo Wang. 2020. Fable: Forecasting Models for Tidy Time Series. https://CRAN.R-project.org/package=fable.

Pedersen, Thomas Lin, and David Robinson. 2020. Gganimate: A Grammar of Animated Graphics. https://CRAN.R-project.org/package=gganimate.

R Core Team. 2020a. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

———. 2020b. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rinker, Tyler W., and Dason Kurkiewicz. 2018. pacman: Package Management for R. Buffalo, New York. http://github.com/trinker/pacman.

Wang, Earo, Dianne Cook, and Rob J Hyndman. 2020. “A New Tidy Data Structure to Support Exploration and Modeling of Temporal Data.” Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2019.1695624.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.