Section - 1 Introduction
Welcome to this programming tutorial on machine learning!
In this tutorial we will use live data from the cryptocurrency markets to provide a hands-on and complete example of time-series machine learning for everyone.
1.1 What will I learn?
The focus of the tutorial is on supervised machine learning, a process for building models that can predict future events, such as cryptocurrency prices. You will learn how to use the
caretpackage to make many different predictive models, and tools to evaluate their performance.
In this tutorial you will primarily learn about tools for the R programming language developed by RStudio and more specifically the
tidyverse. If you are not familiar with these open source products, don’t worry. We’ll introduce these throughout the tutorial as needed.
Before we can make the models themselves, we will need to “clean” the data. You will learn how to perform “group by” operations on your data using the
dplyrpackage and to clean and modify grouped data.
You will learn to visualize data using the
ggplot2package, as well as some powerful tools to extend the functionality of ggplot2.
You will gain a better understanding of the steps involved in any supervised machine learning process, as well as considerations you might need to keep in mind based on your specific data and the way you plan on acting on the predictions you have made. Each problem comes with a unique set of challenges, but there are steps that are necessary in creating any supervised machine learning model, and questions you should ask yourself regardless of the specific problem at hand.
You will learn about the value of “reproducible” research.
You will also learn a little about cryptocurrencies themselves, but this is not a tutorial centered around trading or blockchain technology.
1.2 Before Getting Started
You can toggle the sidebar on the left side of the page by clicking on the menu button in the top left, or by pressing on the s key on your keyboard. You can read this document as if it were a book, scrolling to the bottom of the page and going to the next “chapter”, or navigating between sections using the sidebar.
This document is the tutorial itself, but in order to make the tutorial more accessible to people with less programming experience (or none) we created a high-level version of this tutorial, which simplifies both the problem at hand (what we want to predict) and the specific programming steps, but uses the same tools and methodology providing easier to digest examples on one cryptocurrency using a static dataset that does not get updated over time.
1.2.1 High-Level Version
We recommend for everyone to run the code in the high-level version first to get more comfortable with the tools we will be using. If you are not very familiar with programming in either R or Python, or are not sure what cryptocurrencies are, you should definitely work your way through the high-level version first.
Below is an embedded version of the high-level version, you can click on the presentation below and press the
f button on your keyboard to full-screen it, or use any of the links above to view it in its own window:
1.3 Format Notes
You can hide the sidebar on the left by pressing the s key on your keyboard.
The cryptocurrency data was chosen to produce results that change over time and because these markets don’t have any downtime (unlike the stock market).
Whenever an R package is referenced, the text will be colored orange. We will discuss R packages and the rest of the terms below later in this document, but if you are not familiar with these terms please look through the high-level version first.
Whenever a function is referenced, it will be colored green.
Whenever an R object is referenced, it will be colored blue. We will also refer to the parameters of functions in blue.
When a term is particularly common in machine learning or data science, we will call it out with purple text, but only the first time it appears.
Whenever text is
highlighted this way, that means it is a snippet of R code, which will usually be done to bring attention to specific parts of a longer piece of code.
You can leave feedback on any content of either version of the tutorial, for example if something is not clearly explained, by highlighting any text and clicking on the button that says Annotate. Please be aware that any feedback posted is publicly visible. Thanks to Ben Marwick for the implementation of this tool, and to Matthew Galganik for creating the “Open Review Toolkit” (OTR) we are using.
1.4 Plan of Attack
How will we generate predictive models to predict cryptocurrency prices? At a high level, here are the steps we will be taking:
Setup guide. Installation guide on getting the tools used installed on your machine.
Explore data. What is the data we are working with? How “good” is the “quality”?
Prepare the data. Make adjustments to “clean” the data based on the findings from the previous section to avoid running into problems when making models to make predictions.
Visualize the data. Visualizing the data can be an effective way to understand relationships, trends and outliers before creating predictive models, and is generally useful for many different purposes. Definitely the most fun section!
Make predictive models. Now we are ready to take the data available to us and use it to make predictive models that can be used to make predictions about future price movements using the latest data.
Evaluate the performance of the models. Before we can use the predictive models to make predictions on the latest data, we need to understand how well we expect them to perform, which is also essential in detecting issues.
1.5 Who is this example for?
Are you a Bob, Hosung, or Jessica below? This section provides more information on how to proceed with this tutorial.
Bob (beginner): Bob is someone who understands the idea of predictive analytics at a high level and has heard of cryptocurrencies and is interested in learning more about both, but he has never used R. Bob would want to opt for the more high-level version of this tutorial. Bob might benefit from reading the free book “R for Data Science” as well before attempting this tutorial.
Hosung (intermediate): Hosung is a statistician who learned to use R 10 years ago. He has heard of the tidyverse, but doesn’t regularly use it in his work and he usually sticks to base R as his preference. Hosung should start with the high-level version of this tutorial and later return to this version.
Jessica (expert): Jessica is a business analyst who has experience with both R and the Tidyverse and uses the pipe operator (
%\>%) regularly. Jessica should skim over the high-level version before moving onto the next section for the detailed tutorial.
One of the objectives of this document is to showcase the power of reproducibility. This tutorial does not provide coded examples on making code reproducible, but it’s worth a quick discussion. The term itself is defined in different ways:
We can think of something as reproducible if anyone can run the exact same analysis and end up with our exact same results. This is how the high-level version of the tutorial works.
Depending on our definition, we may consider something reproducible if we can run the same analysis that is shown on a newer subset of the data without running into problems. This is how this version works, where the analysis done on your own computer would be between 1 and 12 hours newer than the one shown on this document.
1.6.1 The cost of non-reproducible research
Reproducibility is especially important for research in many fields, but is a valuable tool for anyone who works with data, even within a large corporation. If you work with Excel files or any kind of data, there are tools to be aware of that can save you a lot of time and money. Even if you do things that are “more involved”, for example using your data to run your analysis separately and then putting together a presentation with screenshots of your results with commentary, this is also something you can automate for the future. If you do any kind of repeated manual process in Excel, chances are you would be better off creating a script that you can simply kick off to generate new results.
Reproducibility is all about making the life of anyone who needs to run the analysis as simple as possible, including and especially for the author herself. When creating a one-time analysis, the tool used should be great for the specific task as well as have the side-effect of being able to run again with the click of a button. In our case we are using a tool called R Markdown (Xie, Allaire, and Grolemund 2018).
Not being transparent about the methodology and data (when possible) used is a quantifiable cost that should be avoided, both in research and in business. A study conducted in 2015 for example approximated that for preclinical research (mostly on pharmaceuticals) alone the economy suffers a cost of $28 Billion a year associated with non-reproducible research in the United States (Freedman, Cockburn, and Simcoe 2015):
Reproducibility as discussed in the paper embedded above relates to many aspects of the specific field of preclinical research, but in their work they identified 4 main categories that drove costs associated with irreproducibility as it pertains to preclinical research in the United States:
The “Data Analysis and Reporting” aspect (circled in red in the screenshot above) of a project is shared across many disciplines. As it related to preclinical research, in their breakdown they attribute roughly $7.19 Billion of the costs of irreproducible preclinical research to data analysis and reporting, which could potentially be avoided through the usage of open source tools that we are utilizing for this tutorial. These costs should pale in comparison to the other three categories, and this example is meant to show that there currently exists a costly lack of awareness around these tools; the tutorial itself is meant as an example to showcase the power of these open source tools and the fact that a lot of complex analysis, including this one, can be written used reproducible tools that improve the quality and cost-effectiveness of projects.
As of 2020, the most popular way of sharing open source data science work is through a website called GitHub which allows users to publicly share their code and do much more. This document gets refreshed using a tool called Github Actions that runs some code and updates the file you are looking at on the public GitHub Repository for the project. The website then updates to show the latest version every time the document is refreshed on the GitHub repository.
220.127.116.11 GitHub Repository
The public GitHub Repository associated with this tutorial is not only a website for us to easily distribute all the code behind this document for others to view and use, but also where it actually runs. By clicking on the option that looks like an eye in the options given at the top of the document, you can view the raw code for the page you are currently viewing on the GitHub repository. Every 12 hours, a process gets kicked off on the page associated with our project on GitHub.com and refreshes these results. Anyone can view the latest runs as they execute over time here: https://github.com/ries9112/cryptocurrencyresearch-org/actions
In the next section we will talk more about the GitHub Repository for this tutorial, for now you can check on the latest run history for this document, which is expected to update every 12 hours every day: https://github.com/ries9112/cryptocurrencyresearch-org/actions
If you are running into problems using the code or notice any problems, please let us know by creating an issue on the GitHub repository: https://github.com/ries9112/cryptocurrencyresearch-org/issues
Go to the next section for instructions on getting setup to follow along with the code run in the tutorial. You can run every step either in the cloud on your web browser, or in your own R session. You can even make a copy of the GitHub Repository on your computer and run this “book” on the latest data (updated hourly), and make changes wherever you would like. All explained in the next section ➡️
It should also be noted that this tutorial has nothing to do with trading itself, and that there is a difference between predicting crypotcurrency prices and creating an effective trading strategy. See this later section for more details on why the results shown in this document mean nothing in terms of the effectiveness of a trading strategy.
In this document we aimed to predict the change in cryptocurrency prices, and it is very important to recognize that this is not the same as being able to successfully make money trading on the cryptocurrency markets.
Freedman, Leonard P., Iain M. Cockburn, and Timothy S. Simcoe. 2015. “The Economics of Reproducibility in Preclinical Research.” PLOS Biology 13 (6): e1002165. https://doi.org/10.1371/journal.pbio.1002165.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.