Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
Clickhouse is a fairly new column store database. Itâs developed by the guys over at Yandex (the Google of Russia), made to scale horizontally reasonably well and run high speed aggregate queries on hundreds of billions of rows of data.
It uses itâs own SQL dialect and it matches pl/pgSQL in terms of expressivity and simplicity. It even includes higher order functions for working with nested object within rows, such as arrayMap and arrayFilter.
What is Clickhouse used for ?
In short, Clickhouse is used to run fast analytics on very large amount of data. Itâs rather bad as a transactional database, but it can run aggregate queries on billions of rows in sub second times.
It fills the ever increasing niche that technologies like Hadoop, Spark, Druid, Big Query, Redshift, Athena and MonetDb aim for. It allows a team to easily store and analyze large sets of data with high dimensionality. More importantly, it blows the aforementioned solutions out of the water in terms of speed and query complexity.
Clickhouse is meant solely for analytic workloads, so as long as you have a lot of GROUP BY and aggregated functions such as COUNT, SUM and DISTINCT it will be blazing fast, even if your number of rows is in the 11 to 12 digits area. But if you want to do a lot of large scale joins and point queries, a typical RDBMS is still going to be the better choice.
It should be noted here, Clickhouse does not yet support queries like DELETE and UPDATE in itâs stable syntax. Even when it will, they will be slow due to itâs compression mechanism. Again, this is not a database meant for transactional workloads.
It will also compress your data, the space this saves is based on the column granularity and on how much query speed you are willing to sacrifice. The way Clickhouse compresses and partitions data is a key to itâs quick queries.
If you want to read more about how column stores work, I would suggest you start with the Wikipedia article, since most of the information there is quite easy to understand and accurate.
What exactly makes Clickhouse amazing ?
1Itâs extremely easy to setup and use,even if you need data replication and data distribution on a few dozen machines.
Deploying a Clickhouse cluster on 3 machines is almost instant if you are familiar with what you are doing and could take you a few hours to do properly if you are learning everything from scratch.
No need for obscure libraries, no endless configuration files, user permission issues and arcane problems with data distribution and replication.
2 Itâs not fragile. One horrible thing Iâve noticed in my rendezvous with various âbig dataâ technologies (spark is a chief offender here) is that they tend to often break with no warning and for no apparent reason. Even worse, they give little information as to why the problem happened within the humongous logs they generate.
Clickhouse is designed to recover from failure and it produces quite a reasonably sized system log with clear explanations of the problems it encountered.
3 It doesnât reinvent the wheel, it adds a steam engine to the carriage. Perhaps the most amazing thing about Clickhouse is that it doesnât try to push a lot of limitations on you nor does it force you to completely re-think the way you store and query your data.
Its data types cover everything from nested structure to matrices, it has a gigantic library of functions and one of the most versatile query language that Iâve ever seen.
Unlike Map-reduce or other column store solutions, that will force you to rethink your entire schema, you can pretty much duplicate a transactional database schema with Clickhouse and it will do just fine.
4Itâs actually got quite an amazing documentation and a great community. Perhaps the nastiest thing with âbig dataâ technologies nowadays is that most of them are designed in order to sell support.
A prime example of this would be Druid, open source and free in theory, but an unbreakable bastion of weird complexity coupled with a complete lack of documentation and community involvement.
Clickhouseâs documentation is relatively easy to search through and itâs being groomed, expanded and improved constantly.
The community is nothing short of amazing, itâs quite tiny, consisting of the developers and users from Yandex plus a few other early adopters, but that hasnât stopped every question and issue I raised on Github and Stack from being answered within hours.
The developer also seem to very receptive to contributors external to Yandex.
Enough talk, letâs install the thing
The first step is to install Clickhouse, for the purpose of this tutorial, itâs best to have a machine running Debian or Ubuntu, since thatâs the official OS Yandex supports.
Itâs also quite easy to install it on Fedora, CentOS, Suse and RHEL, Iâve personally installed it with success from the AUR repository on Arch, it exists as a package for Gentoo. If all else fails, you can use the docker image or build it from source.
To install it on Ubuntu, the execute the following:
deb http://repo.yandex.ru/clickhouse/deb/stable/ main/
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4sudo apt-get update
sudo apt-get install clickhouse-client clickhouse-server
Now you can start the server as a daemon by running sudo systemctl start clickhouse-server and connect to it via the client by running clickhouse-client.
If you want to play around with the configurations, you can find them inside /etc/clickhouse-server and /etc/clickhouse-client respectively.
Playing with some data
For this introduction, I chose to use the NYC property taxes dataset, itâs a rather small dataset of ~1 million rows, but it means downloading and insertion will be very quick. The query speed will scale linearly with larger datasets as long as they fit on a local RAID and wonât suffer to much if data is distributed.
In case youâve installed an older version you may want to run:
SET experimental_allow_extended_storage_definition_syntax = 1;
This will allow you to use the syntax used throughout this tutorial. If your version is new enough, trying to run this will just throw an exception, no need to worry, you are all set.
Letâs create our database and table:
CREATE DATABASE test_database;
USE test_database;
CREATE TABLE tax_bills_nyc (bbl Int64,owner_name String,address String,tax_class String,tax_rate String,emv Float64,tbea Float64,bav Float64,tba String,property_tax String,condonumber String,condo String,insertion_date DateTime MATERIALIZED now())ENGINE = MergeTree PARTITION BY tax_class ORDER BY (owner_name)
Before we insert that data, letâs look at the things which might be a bit unclear here.
MATERIALIZED is a way to create to column that is automatically set whenever a row is inserted based on the function after the keyword. In this case, the function I chose to use is now(), in order to get the insertion timestamp for each row.
ENGINE = MergeTree sets the storage engine for the table. There are many types of table engines that clickhouse supports. Including your bog standard MEMORY engine, BUFFER engines that are meant to buffer rows when a very high write rate is needed, JOIN engines that are meant to speed up joins and many others. MergeTrees are the best table engines for most analytic queries, and the MergeTree engine is the most simple of those, and usually the best choice.
Whilst I the specific of the MergeTree engines would take too long to explain here. If you are familiar with long-structured merge-trees, Radix trees, immutable âVectorsâ, Merkel trees or Patricia trees⊠then thing of MergeTrees as being something conceptually close to that.
If you arenât, Iâd like to invite you to the data structures sermon when your local church of Functional programming next meets.
PARTITION BY tax_class will partition the table using the tax_class column. Partitions are the largest storage unit under tables in clickhouse. You can delete partitions, attach partitions and temporarily drop partitions. Efficient clickhouse replication and distribution also depends on good partitioning.
From the end user perspective, what you should care about most is that partitions are very fast to operate on using >, < and ==.
For the column you partition by, you should chose one with somewhat low granularity, as a general rule of thumb. A good rule of thumb is to have less than 1000 distinct values per billion of rows. When your data associated with a point in time, a good partition key is the date associated with every row.
ORDER BY is similar to a btree index in an RDBMS, from a user perspective. It speeds up queries using comparison operators. Unlike PARTITION, you can use high granularity data here without losing performance. Again, in the case of a table where you often query based on a time associate with each data point, a good candidate for this value is the timestamp of each row. You can order your table by a combination of columns if you so desire.
Now you should be familiar with most of the new concepts you need to get along with Clickhouse, the rest should feel quite natural to you.
Letâs insert our data:
curl -X GET 'http://taxbills.nyc/tax_bills_june15_bbls.csv' | /usr/bin/clickhouse-client --input_format_allow_errors_num=10 --query="INSERT INTO test_database.tax_bills_nyc FORMAT CSV"
That should take a few seconds or minutes based on your internet speed, once itâs done, we can start testing some queries.
If you want to work with a bigger dataset but donât have time to define your own table, just run the above command a few more times :)
Letâs take it for a ride
Ok, lets run our first query, donât blink: SELECT COUNT(*)/pow(10,3) FROM tax_bills_nyc. That query should have been relatively instant, probably done in sub ms time if you are using server grade hardware.
For the purpose of this tutorial Iâm using an E3â1270v6 with 32GB of EEC DDR4 at 2.133 GHz, but really, any old server or laptop will do.
Lets try something a bit harder:
SELECT SUM(emv) as total_emv, SUM(tbea) as total_tbea, SUM(bav) as total_bav, tax_class,tax_rate FROM tax_bills_nyc GROUP BY tax_class, tax_rate
That should have taken somewhere between 50 to 10ms depending on your machine, quite the feat for doing a GROUP BY 2 dimensions and summing up 4 different columns.
Ok, what if we add a high dimensionality column in the GROUP BY, letâs say the ownerâs name, since there are about 885,000 distinct names in the dataset:
SELECT SUM(emv) as total_emv, SUM(tbea) as total_tbea, SUM(bav) as total_bav, tax_class,tax_rate FROM tax_bills_nyc GROUP BY tax_class, tax_rate, owner_name
That was probably slightly slower, but it still probably took under half a second, despite the high granularity of our new grouping criteria.
But, if we built our schema correctly, we should be able to filter most queries using our order or partition column, letâs add a where clause based on the tax_class column:
SELECT SUM(emv) as total_emv, SUM(tbea) as total_tbea, SUM(bav) as total_bav, tax_class,tax_rate FROM tax_bills_nyc WHERE tax_class='1c - condo in 1-3 unit building' GROUP BY tax_class, tax_rate, owner_name
Suddenly, our query time drops in the sub 10ms range again.
Clickhouse not only offers speed, but it has various type and value manipulation functions that can allow us to easily write a query which output something thatâs easily human readable:
SELECT concat(toString(uniq(owner_name) / pow(10, 3)), ' thousand distinct names among the owners of ', toString(uniq(address) / pow(10, 3)), ' thousand properties') AS result FROM tax_bills_nyc
This should give you a result similar to:
85.058 thousand distinct names among the owners of 989.466 thousand properties
Ok, letâs try and see what happens if we try grouping by every single column in this table. To save you some typing, weâll use the DISTINCT(*) function, which will essentially do just that.
SELECT COUNT(DISTINCT(*)) FROM tax_bills_nyc
Surprisingly enough, this once again takes under half a second.
It might actually be quicker then some of the previous GROUP BYs since we didnât have to transform any of the string typed columns into a printable version.
In my books, that was quite impressive when I first saw it.
Have I peeked() your interest ?
Hopefully you can see why Iâm very excited about this database due to the expressive syntax and amazing query speed alone.
This was mostly a toy dataset, meant to make this tutorial easy to follow. So I encourage you to try and go up to a few billion rows and see for yourself how unbelievable this database is.
As for the dataset you want to do that with, I leave that up to you. Maybe try one of your large internal tables on which you often have trouble running analytical queries. Try formatting some of the data in your âdata lakeâ, which youâd usually use a tool like Spark or Athena to query, and insert it into Clickhouse.
If enough people are interested, I might produce a followup article or series with some more tips, tricks and real world usecases for Clickhouse.
Clickhouse, an analytics database for the 21st century was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.