Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
These are the (Unofficial) Lecture Notes of the Fast.ai Machine Learning for Coders MOOC.
You can find the Official Thread here
This is Part 1/12 Lecture Notes.
Introduction
- Course taught at USF, available as MOOC.
- Tip: The right place to watch the Lectures is at courses.fast.ai
- Tip: Check for “Cards” on the Video.
Alternatives to Local Setup: (With Fast AIÂ Support)
- CrestlePricing: 3 Cents an hour (Approx)Jupyter Nb opens in browser
- Paperspace
Local Setup instructions:
Assuming you have your GPU and Anaconda Setup (Preferably CUDA ≥9):
$ git clone https://github.com/fastai/fastai$ cd fastai$ conda env update
- Use the Setup Script provided Here
$ bash | wget files.fast.ai/setup/paperspace
Approach to Learning
- Follow Along.
- Watch First and Follow Later (Lose Recommendation).You might miss some important information and you can experiment.
Teaching Approach:
- Dive into Code
- Build Models
- Theory Comes later, at a point which you’ll be able to effective coder.
- Try with More Datasets.
- The More Coding you do, The Better (Recommended by Alumni)
- Write Blog Posts.
“Hey I Just Learned this Concept, and I’ll share about it”Good Technical Blogs:
- Peter Norvig (more here)
- Stephen Merity
- Julia Evans (more here)
- Julia Ferraioli
- Edwin Chen
- Slav Ivanov (fast.ai student)
- Brad Kenstler (fast.ai and USF MSANÂ student)
Imports
- Auto reload commands:
%load ext_autoreload %autoreload 2
This allows the NB to be updated when you change the source code (Restart not required)
%matplotlib inline
To plot Figures inline
from fastai.imports import*
Data Science is not Software Engineering. Prototyping models needs things to be done interactively.import * allows everything to be present, we don’t need to determine the specifics.Jupyter Tricks
fn_name
?fn_name
??fn_name
- Gives the fn_name library
- Gives Details of the fn
- Gives the Source Code of the Fn
Getting the Data
Kaggle: Real World Problems posted by a company/institute.These are really close to real world problems, allow you to check yourself against other competitors.
TL;DR: Perfect place to check your skillset.
Jeremy: “I’ve learnt more from Kaggle competitions than anything else I’ve done in my Entire Life”
- Go to Competition page.
- Accept Terms and Conditions.
- Download Dataset
OR
- Setup Official Kaggle API
- Use The Terminal to Download the Dataset.
OR
- Use CurlWget Chrome Extension.
- Start Download and Cancel it.
- Click on the Extension.
- Paste the Copied Command into a Terminal.
Note: Prefer Techniques that will be useful for Downloading Data to your Cloud Compute Instance.Crestle and Paperspace will have most of the Datasets pre-downloaded.
Good Practise: Create a Data Folder for all of your data
To Run BASH Commands in Jupyter
!BASH_COMMAND
To Add Python Commands
!BASH {Python}
Blue Book for Bulldozers:
Goal:
The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it’s usage, equipment type, and configuaration. The data is sourced from auction result postings and includes information on usage and equipment configurations.Fast Iron is creating a “blue book for bull dozers,” for customers to value what their heavy equipment fleet is worth at auction.
- Look at Data to Get Started.
!head data/bulldozers/Train.csv
Gives the First Few lines.
Structured Data:
(Unoffcial Def) Columns of Data having varying types of Data.
- Pandas:Best Library to work with Tabular Data.
- Fastai imports import pandas library by default.
- Reading CSV
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=["saledate"])
- low_mem=FalseAllows it to load more details to memory.
Python 3.6 Formatting:
var ='abc'f'ABC {abc}'
This allows Python to interpret Code inside the {}
Display data:
df_raw
Simply writing this would truncate the output
display_all()def display_all(df): with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): display(df)
This allows The Complete df to be printed.
display_all(df_raw.tail().T)
Since there are a lot of columns, we have taken Transpose.
Evaluation:
Since the Metric is RMSLE, we would consider the logarithmic values here.
Root mean squared log error: between the actual and predicted auction prices.
Random Forests:
- Introduction: Universal Machine Learning Technique that can predicting categorical/continous variables.
- It can work with Pixel Values/Columns.
- In General, it doesn’t overfit.
- It’s easy to avoid Overfitting.
- Works fine without Any Validation Cells.
- Requires no Statistical Assumption.
TL;DR: It’s a great Start.
Curse of Dimensionality:
The Greater number of Columns creates emptier Mathematical space where the Data Points sit on the Edges (Math Property).
This leads to distance between points being meaningless.
In General, False.
- Datapoints have distance even when they sit on the boundaries.
- Theoretical Research was more heavy in ’90s.
- Building Models on lots of columns works really well.
No Free Lunch Theorem:
There is no kind of Model that works well for any kind of Dataset.
In general, we look at Data that was created by some cause/structure. There are actually techniques that work well for nearly all of the General Datasets that we work with. Ensembles of Decision Tree is the Technique that is most widely used.
ValueError: could not convert string to float: 'Conventional'
SKLearn isn’t the Best library, but it’s good for our purposes.
RandomForest:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
- Regressor: Continous Values.
- Classifier: Classify Values.
Note: Regression!=Linear Regression.
Feature Engineering
The RandomForest Algorithm expects numerical data.
- We Need to Convert Everything to Numbers.
DataSet:
- Continuos variables.
- Categorical:- Numbers - Strings-Â Dates
df_raw.saledate
Information inside a Date:
- Is it a holiday?
- Is it a weekend?
- The weather.
- Event(s) Information.
??add_datepart
To look at the source code.
This grabs the field “fldname”Note: df.fldname would literally look up a Field named fldname.
df[fldname] is a safer option in general. It’s a safe bet, doesn’t give weird errors in case we make a mistake. Don’t be lazy to do df(dot)fldnameAlso, df[fldname] returns a series.
The function goes through all of the Strings, it looks inside the object and finds attribute with that name. This has been made to create any column that might be relevant to our case. (Exact opposite of the Curse of Dimensionality- We are creating more columns)
There is no harm in adding more columns to your data.
Link getattr()
Pandas splits out different methods inside attributes.
All of the Date time specific linked in pd.dt.___
Finally we drop the column.
Dealing with Strings
- UsageBand has Low, High, Medium.
- Pandas has a Categorical Variable but it doesn’t work by default.
train_cats
Creates categorical variables for strings. It creates a column that stores number and stores the mapping of the String and numbers.
Make sure you use the same mapping for training dataset and testing dataset.
- Similar to .dt, .cat gives access to Categorical data.
Since we’ll have a decision tree that will split the columns. It’ll be better to have a “Logical” order.
RF consists of Trees that make splits. The splits could be High Vs Low+Medium then followed by Low Vs Medium.
Missing Values
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))
- .isnull() Returns T/F if the data has null values.
- .sum() adds up the null values.
- We then sort and divide them by the length to return the missing values.
Saving
os.makedirs('tmp', exist_ok=True)df_raw.to_feather('tmp/bulldozers-raw')
Feather: Saves the Files in the Format similar to the one in RAM. In layman-it’s fast.
Pro-Tip: Use Temporary folder for all actions/needs that pop up while you’re working.
Final Steps
proc_df
A Function inside the Structured.fastai
- Grabs a copy of the df
- Grab the dependent column.
- Dependent column is dropped.
- Missing Values are fixed.
- Fix Missing- Numeric values: If it does have missing values, then create a new column named Col_na (Boolean column) and replace the _na with the median- Non-Numeric and Categorical: Replace with the code and add 1.
df, y, nas = proc_df(df_raw, 'SalePrice')
Running Regressor
m = RandomForestRegressor(n_jobs=-1)m.fit(df, y)m.score(df,y)
- Random Forests are parallel-isablen_jobs=-1 creates a separate process for every CPU we have.
- Create a Model
- Return the Score
1 is the Best Score.
0 is the Worst.
def rmse(x,y): return math.sqrt(((x-y)**2).mean())
def print_score(m): res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid), m.score(X_train, y_train), m.score(X_valid, y_valid)] if hasattr(m, 'oob_score_'): res.append(m.oob_score_) print(res)
Checking Overfitting
- We can create a Validation Dataset.
- Sorted by Date, The Most Recent 12,000 dates will be the validation set.
def split_vals(a,n): return a[:n].copy(), a[n:].copy()
n_valid = 12000 # same as Kaggle's test set sizen_trn = len(df)-n_validraw_train, raw_valid = split_vals(df_raw, n_trn)X_train, X_valid = split_vals(df, n_trn)y_train, y_valid = split_vals(y, n_trn)
X_train.shape, y_train.shape, X_valid.shape
Final Score
If you’re in the Top Half of the Kaggle LB, it’s a great start.
print_score(m)[0.09044244804386327, 0.2508166961122146, 0.98290459302099709, 0.88765316048270615]
0.25 would get a LB position in the Top 25%
Appreciation: Without any thinking or intensive Feature Engineering, without defining/worrying about any statistical assumption-we get a decent score.
If you found this article to be useful and would like to stay in touch, you can find me on Twitter here.
Fast AI Machine Learning Lecture 1 Notes was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.