Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
Almost over night one of the main data pipeline systems had become a major bottleneck for the company, and a solution was needed, Fast!
Luckily Re2 was there to the rescue. With very little work, I managed to reduce my system runtime by roughly 15%.
In this post I will share how to find the bottlenecks using Gprof2dot beautiful image rendering, and of course, what Re2 is and how to use it.
This article addresses a pain I had in a Python framework, but because there are Re2 implementations to all major languages you should still find value reading this article.
The strategy to relieve bottlenecks is pretty straightforward
- Step one: Recognize were it hurts the most
- Step two: Reduce the pain
- Step three: Iterate back to step one until you get the desired result
Sounds simple enough, right? Well it is, in theory.Usually, the implementation is where it gets you.
Step one: Recognize where it hurts the most
Like anyone who ever saw a guy talking about A.A. meetings on TV, will say: âThe first step in dealing with your problem is recognizing that it existsâ, in our case that meant measurements and finding the bottlenecks.
The system in question is a pretty straightforward ETL system written in Python, so I decided to start by looking at the transform part first.
I spent a couple of hours extracting the main part of the code into a single independent process, and when it was done I started a Jupyter notebook. Using cProfile built-in python package and gprof2dot, I ran the following code to test it -
This is the image Gprof2dot produced:
A short explanation if this is the first time you encountered the Gprof2dot library: The image that I got was a profile that measured our code, and determined, for each major block, how much time our code visited that block and how much time in total each block contributed to the overall running time.
The dark blue ones are the âcoldâ areas, the interesting parts in our investigation. The path starting from the red box, and âcooling downâ all the way to the leaves are the interesting paths to investigate.
Each box has the following format:
In the next image below, I looked at the leaf at the end of the path.
If you look at the upper box, you can see that itâs a method called doesTextFit in class GenericTextInTextSearch. The system spent 32.05% of itâs runtime in that code, but only 4.94% of it is in the actual code of the method, the other ~95% of the time was spent in code called from doesTextFit. The 2,221,386 is the amount of times this code was called. The more interesting part is that doesTextFit calls another method, where it spent ~26% of the time. That was the method I decided to focus my attention on.
It might happen in the leaf box, that the self timeâââin the parenthesisâââand the total timeâââabove the self timeâââdonât add up.This happens when the execution time is too short, causing the round-off errors to be large. Donât worry about it.
âMethod âSearch ofâ _sre.SRE_Pattern objectâ is from Python built-in libraryâââthe Regex library re. It turned out that the system that had a lot of moving parts, spent ~26% of the time on a single method on a single library call.
So now I set out to see if I could find a replacement for the Python built-in method.
It turns out there is. And I was completely surprised by it.
Step two: Reduce the pain
I came across these two articles:
They basically told me, I was living a lieâââI always thought that Regex is the most efficient way to match a string. It turns out that it might have started like that, but somewhere along the way we sacrificedâââin all the major programing languagesâââperformance for features.
The salvation was going back to the fundamentals: State Machine Based Regex!
Whatâs wrong with the built in Regex engine?
TL;DR Backtracking.
When the built-in Regex tries to match a string, it follows a greedy algorithm pattern. Trying to match the full string first, and when it fails trying to match minus one characters on the full string and the leftover char separately. Then on the full string minus two characters, and on the two leftover characters separately, and so on and so forth. If there is no match, the engine is destined to fail, it simply doesnât know that before it iterates through all the different combinations.
This process is time and space consuming.
So whatâs the alternative?
Introducing Ken Thompson. you can read more about him in his wiki page, but here is the TL;DR for the list of his achievements:
- Unix
- B (programming language)âââancestor to C
- UTF-8
- Go (programming language)
- State Machine Based Regex!
The last bullet is best represented in this comparison graph:
The left one is for Perl, but represents the major programing languages, the right one is the Thompson method. Make sure you notice that the left one is in seconds and the right one in microseconds(!).
How did he do that? state machine.
Think of a regular expression as a state machine built from the expression you want to match.
Each char the engine encounters, propels it to make the next decision, if by the end of its journey it reaches a âfinalâ state. Then the engine found a match.
If not, then there is no match.
The space complexity of a solution like that, is the amount of different states the regular expression has. And the time complexity is the length of the text being checked.
With that in mind, I went out to see if I could find an implementation of the Thompson engine, and I found the good people of Google with their Re2 implementation in C++.
To integrate Google Re2 in Python, I had to use a third party wrapper py-re2Â , but for Java there is Google Re2/J project or Brics, we in Taboola, use Brics in our Java code, you can read comparison of the two projects in this great blog: brics-vs-re2j.
The integration in the code is that simple:
It is important to remember that not all the functionality that exists in re, exists in re2. So what I did, and what you should do if you plan to use this solution, is to analyze all your Regex calls, and make sure that you get the expected result.
Although the graph above promised a performance boost of hundreds of percent, the performance boost I managed to get was just a ~40%-50% boost. I didnât complain, it was not too bad for a few days work.
Step three: Iterate back to step one till you get the desired result
I never got to this part, and hopefully, if you use this engine you wonât eitherâŠ
No Silver Bullet
The specific circumstances of the system i described here, is that the majority of the Regex comparisons finish without a match. So it worked really well to have a new Regex engine that returns âno matchâ faster than the regularâââbacktrackingâââone. You will need to test it in your unique scenario to see how much performance improvement you will get.
Second thing to check is the Regex you use, if it needs Backtracking to complete itâs comparison it wonât work, and you will get Segfault. But, it shouldnât be too hard to rewrite them to a simpler syntax.
Third thing to monitor, is the state explosion problem when building deterministic automata, that can happen in some of the Re2 implementationsâââIn Java implementation for example: Brics library for example but not in Google Re2. The cost is in performance.
The last thing you need to check is the return values, there might be a case that the expected outcome has changed from what you had before. For example to get a null were before you got a False.
The bottom line is that you will need to do the same evaluation process that you would have done before upgrading to a new version of a library. Things might change, so you need to be carefull.
Having said that, even if it is only a few percent improvement, the fact that it is so easy to integrate makes it a âmoney on the floorâ scenario.
Good luck, and feel free to contact me for help and guidance.
How Re2 Shattered My Bottleneck was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.