Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
If you are a developer and want to integrate data manipulation or science into your product or starting your journey in data science, here are the Python libraries you need to know.
- NumPy
- Pandas
- Matplotlib
- Scikit-Learn
The goal of this series is to provide introductions, highlights, and demonstrations of how to use the must-have libraries so you can pick what to explore more in depth.
pandas
This library is built on top of NumPy, which you may remember from my last article. Pandas takes NumPyâs powerful mathematical array-magic one step further. It allows you to store & manipulate data in a relational table structure.
Focus of the Library
This library focuses on two objects: the Series (1D) and the DataFrame (2D). Each allow you to set:
- an indexâââthat lets you find and manipulate certain rows
- column namesâââthat lets you find and manipulate certain columns
Having SQL deja-vu yet?
Installation
Open a command line and type in
pip install pandas
Windows: in the past I have found installing NumPy & other scientific packages to be a headache, so I encourage all you Windows users to download Anacondaâs distribution of Python which already comes with all the mathematical and scientific libraries installed.
Details
A pandas data structure differs from a NumPy array in a couple of ways:
- All data in a NumPy array must be of the same data type, a pandas data structure can hold multiple data types
- A pandas data structure allows you to name rows and columns
- NumPy arrays can reach multiple dimensions, pandas data structures limit you to just 1 &Â 2D.*
*there is a 3D pandas data structure called a Panel but it is depreciated
Letâs dive in!
import pandas as pdimport numpy as np
Creation
Itâs very simple!
You can create a Series or DataFrame from a list, tuple, NumPy array, or even a dictionary! Oh and of course from CSVs and databases.
From an array
# Seriesfuture_array1 = [1,2,3,4,5,6]array1 = np.array(future_array1)s = pd.Series(array1)
>>> s0 11 22 33 44 55 6dtype: int64
The print out you see above has two columns. The one on the left is the index and the one on the right is your data. This index looks like the indexes we are used to when using lists, tuples, arrays, or any other iterable. We will see soon in pandas we can change it to whatever we like!
# DataFramefuture_array2 = [2,4,6,8,10,12]array2 = np.array(future_array2)df = pd.DataFrame([future_array1, future_array2])
>>> df 0 1 2 3 4 50 1 2 3 4 5 61 2 4 6 8 10 12
The print out you see above has a ton of numbers. The first column on the left is the index. The top row is the columns names (for now 0âŠ5). Again, we will see soon in pandas we can change it to whatever we like!
From a dictionary
The dictionary keys will become the index in a Series
# Seriesfuture_series = {0: 'A', 1: 'B', 2: 'C'}s = pd.Series(future_series)
>>> s0 A1 B2 Cdtype: object
It works a bit differently in a DataFrameâââthe keys become the column names
# DataFramedict = {'Normal': ['A', 'B', 'C'], 'Reverse': ['Z', 'Y', 'X']}df = pd.DataFrame(dict)
>>> df Normal Reverse0 A Z1 B Y2 C X
Upload data
Pandas has many ways to upload data, but letâs focus on the standard csv format.
uploaded_data = pd.read_csv("filename.csv", index_col=0)
The keyword argument, index_col, is where you can specify which column in your CSV should be the index in the DataFrame. For more details on the read_csv function, go here.
I love that the pandas library only requires 1 line to import data from a CSV. Who else is over copying and pasting the same lines of code from the csv library? ;)
Use the Index
Your days of text wrangling are over! No more weird list comprehensions or for loops with comments like â# extract this column during given periodâ or â# sorry for the messâ.
Here is an example DataFrame:
dates = pd.date_range("20160101", periods=6)data = np.random.random((6,3))column_names = ['Column1', 'Column2', 'Column3']df = pd.DataFrame(data, index=dates, columns=column_names)
>>> df Column1 Column2 Column3
2016-01-01 0.704351 0.151919 0.505881
2016-01-02 0.242099 0.887256 0.069512
2016-01-03 0.683565 0.305862 0.278066
2016-01-04 0.943801 0.388292 0.221318
2016-01-05 0.353116 0.418686 0.054011
2016-01-06 0.802379 0.720102 0.043310
Indexing a column
>>> df['Column2'] # use the column name's string
2016-01-01 0.151919
2016-01-02 0.887256
2016-01-03 0.305862
2016-01-04 0.388292
2016-01-05 0.418686
2016-01-06 0.720102
Freq: D, Name: Column2, dtype: float64
Indexing a row
>>> df[0:2] # use the standard indexing technique
Column1 Column2 Column3
2016-01-01 0.704351 0.151919 0.505881
2016-01-02 0.242099 0.887256 0.069512
>>> df['20160101':'20160102'] # use the index's strings
Column1 Column2 Column3
2016-01-01 0.704351 0.151919 0.505881
2016-01-02 0.242099 0.887256 0.069512
Indexing multiple axesââânames
>>> df.loc['20160101':'20160102',['Column1','Column3']]
Column1 Column3
2016-01-01 0.704351 0.505881
2016-01-02 0.242099 0.069512
Indexing multiple axesââânumbers
>>> df.iloc[3:5, 0:2]
Column1 Column2
2016-01-04 0.943801 0.388292
2016-01-05 0.353116 0.418686
View Your Data
Quickly check the top and bottom rows:
>>> df.head(2) # first 2 rows
Column1 Column2 Column3
2016-01-01 0.704351 0.151919 0.505881
2016-01-02 0.242099 0.887256 0.069512
>>> df.tail(2) # last 2 rows
Column1 Column2 Column3
2016-01-05 0.353116 0.418686 0.054011
2016-01-06 0.802379 0.720102 0.043310
View summary statistics before you dash off for a meeting:
>>> df.describe()
Column1 Column2 Column3
count 6.000000 6.000000 6.000000
mean 0.621552 0.478686 0.195350
std 0.269550 0.273359 0.180485
min 0.242099 0.151919 0.043310
25% 0.435728 0.326470 0.057887
50% 0.693958 0.403489 0.145415
75% 0.777872 0.644748 0.263879
max 0.943801 0.887256 0.505881
Control Your Data
Pandas brings the flexibility of SQL into Python.
Sort
>>> df.sort_index(axis=0, ascending=False) # sort using the index
Column1 Column2 Column3
2016-01-06 0.802379 0.720102 0.043310
2016-01-05 0.353116 0.418686 0.054011
2016-01-04 0.943801 0.388292 0.221318
2016-01-03 0.683565 0.305862 0.278066
2016-01-02 0.242099 0.887256 0.069512
2016-01-01 0.704351 0.151919 0.505881
>>> df.sort_values(by='Column2') # sort using a column
Column1 Column2 Column3
2016-01-01 0.704351 0.151919 0.505881
2016-01-03 0.683565 0.305862 0.278066
2016-01-04 0.943801 0.388292 0.221318
2016-01-05 0.353116 0.418686 0.054011
2016-01-06 0.802379 0.720102 0.043310
2016-01-02 0.242099 0.887256 0.069512
Join
Here are new example DataFrames:
dates1 = pd.date_range("20160101", periods=6)data1 = np.random.random((6,2))column_names1 = ['ColumnA', 'ColumnB']
dates2 = pd.date_range("20160101", periods=7)data2 = np.random.random((7,2))column_names2 = ['ColumnC', 'ColumnD']
df1 = pd.DataFrame(data1, index=dates1, columns=column_names1)df2 = pd.DataFrame(data2, index=dates2, columns=column_names2)
>>> df1.join(df2) # joins on the index
ColumnA ColumnB ColumnC ColumnD
2016-01-01 0.128655 0.181495 0.574188 0.628584
2016-01-02 0.278669 0.810805 0.634820 0.545531
2016-01-03 0.489763 0.397794 0.169862 0.300666
2016-01-04 0.911465 0.903353 0.058488 0.911165
2016-01-05 0.094284 0.890642 0.282264 0.568099
2016-01-06 0.512656 0.735082 0.141056 0.698386
If you want to join on a column other than the index, check out the merge method.
Group by
df3 = df1.join(df2)
# add a column to df to group ondf3['ProfitLoss'] = pd.Series(['Profit', 'Loss', 'Profit', 'Profit', 'Profit', 'Loss'], index=dates)
>>> df3.groupby('ProfitLoss').mean()
ColumnA ColumnB ColumnC ColumnD
ProfitLoss
Loss 0.403947 0.759588 0.272969 0.305868
Profit 0.576668 0.477050 0.359661 0.406070
Accessing Attributes
Notice how I was able to just add in a column using a key/value notation in the code above? Pandas allows you to add new data with ease. But it also allows you to access the core attributes of your data structures.
Access the Index
>>> df3.index
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04', '2016-01-05', '2016-01-06'], dtype='datetime64[ns]', freq='D')
Access the Values
>>> df3.values
array([[0.441513594483238, 0.974419927787583, 0.20896018007846018,
0.45913058454344435, 'Profit'], ...
[0.6980963896232228, 0.7005669323477245, 0.09231336594380268,
0.13264595083739117, 'Loss']], dtype=object)
Access the Columns
>>> df3.columns
Index([u'ColumnA', u'ColumnB', u'ColumnC', u'ColumnD', u'ProfitLoss'], dtype='object')
Iâm providing here a link to download my pandas walkthrough using a Jupyter Notebook!
Never used Jupyter notebooks before? Visit their website here.
Overall, if you have a dataset you want to manipulate but donât want to go to the hassle of hauling it all into SQL, I recommend searching for a pandas solution before anything else!
Applications
Letâs look at a scenario. Say you wanted to keep an eye on Bitcoin but donât want to invest too much time in building out an infrastructure. You can use pandas to keep it simple.
Youâll need a Quandl account and the python Quandl library.
pip install quandl
Letâs code:
import quandl
# set up the Quandl connectionapi_key = 'GETYOURAPIKEY'quandl.ApiConfig.api_key = api_keyquandl_code = "BITSTAMP/USD"
# get the data from the APIbitcoin_data = quandl.get(quandl_code, start_date="2017-01-01", end_date="2018-01-17", returns="numpy")
# set up the data in pandasdf = pd.DataFrame(data=bitcoin_data, columns=['Date', 'High', 'Low', 'Last', 'Bid', 'Ask', 'Volume', 'VWAP'])
# make the 'Date' column the indexdf.set_index('Date', inplace=True)
# find a rolling 30 day averagedf['RollingMean'] = df['Last'].rolling(window=30).mean().shift(1)
# label when the last price is less than L30D averagedf['Buy'] = df['Last'] < df['RollingMean']
# create a strategic trading DataFrametrading_info = df.loc[:,['Last', 'RollingMean', 'Buy']]
>>> trading_info.tail(10) # lets look at last 10 days
Last RollingMean Buy
Date
2018-01-08 16173.98 15693.421333 False
2018-01-09 15000.00 15704.147667 True
2018-01-10 14397.30 15716.680333 True
2018-01-11 14900.00 15706.590333 True
2018-01-12 13220.00 15655.209333 True
2018-01-13 13829.29 15539.209333 True
2018-01-14 14189.66 15458.548000 True
2018-01-15 13648.00 15384.760000 True
2018-01-16 13581.66 15258.109000 True
2018-01-17 11378.66 15070.668667 True
This is the power of pandas with real life data! However, what if we wanted to view the data shown above in a graph? Thatâs possible, but youâll have to wait for the next article on Matplotlib. Coming soon.
Thanks for reading! My name is Lauren Glass. I am an entrepreneur & engineer living in Tel Aviv.
Connect with me on Instagram @ljglass
Connect with me on LinkedIn
Need a server? Use me as a referral for DigitalOcean.
Fundamental Python Data Science Libraries: A Cheatsheet (Part 2/4) was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.