You shouldn’t use Python 3.7 for data science right now

“selective focus photography of girl crying” by Arwan Sutanto on Unsplash

There is what I consider to be a critical bug right now for data science workflows and Python 3.7. The bug is not in Python, but in NumPy where casting errors are sometimes swallowed. This means that you can do things like cast strings to complex numbers, and NumPy might not throw an exception. Pandas relies on NumPy’s error handling to determine whether your dataframe is all numeric. So sometimes, taking the mean of dataframes can give you your results as complex numbers. Everyone should stay on Python 3.6 until the NumPy fix is released. The relevant github issues are in NumPy #11993, #12062, and in Pandas #22506, #22753. You can circumvent the issue by passing numeric_only=True into your call to .mean but it is unlikely that you are already doing so. The fix to NumPy has been merged. Once it's released you should upgrade.

NumPy Bug

If we look at the code in NumPy for converting objects into other types:

static voidOBJECT_to_@TOTYPE@(void *input, void *output, npy_intp n,        void *NPY_UNUSED(aip), void *aop){    PyObject **ip = input;    @totype@ *op = output;    npy_intp i;    int skip = @skip@;    for (i = 0; i < n; i++, ip++, op += skip) {        if (*ip == NULL) {            @TOTYPE@_setitem(Py_False, op, aop);        }        else {            @TOTYPE@_setitem(*ip, op, aop);        }    }}

This code does not quit when there is a problem in the @TOTYPE@_setitem call. @ahaldane discovered and fixed it to be

static voidOBJECT_to_@TOTYPE@(void *input, void *output, npy_intp n,        void *NPY_UNUSED(aip), void *aop){    PyObject **ip = input;    @totype@ *op = output;    npy_intp i;    int skip = @skip@;    for (i = 0; i < n; i++, ip++, op += skip) {        if (*ip == NULL) {            if (@TOTYPE@_setitem(Py_False, op, aop) < 0) {                return;            }        }        else {            if (@TOTYPE@_setitem(*ip, op, aop) < 0) {                return;            }        }    }}

Without quitting the loop, subsequent calls probably invoke some CPython code which was changed in 3.7 to call PyErr_Clear. By the way, if that code looks strange to you - it's because NumPy uses it's own template engine.

Pandas Impact

This can certainly have more impact than what I’m describing here, but the most immediate impact is that sometimes aggregating Dataframes with mixed types results in complex results. To illustrate the un-predictableness of this problem, try the following example:

df = pd.DataFrame({    "user":["A", "A", "A", "A", "A"],    "connections":[3.0, 4970.0, 4749.0, 4719.0, 4704.0],})df['connections2'] = df.connections.astype('int64')print()print('usually incorrect')print()print(df.mean())print()print(df.head())print()print('usually correct')print()print(df.mean())

I consistently get some output that looks like this:

usually incorrectuser            (1.38443408503753e-310+1.38443408513886e-310j)connections       (1.3844303826283e-310+1.3844336097506e-310j)connections2                                         (3829+0j)dtype: complex128  user  connections  connections20    A          3.0             31    A       4970.0          49702    A       4749.0          47493    A       4719.0          47194    A       4704.0          4704usually correctconnections     3829.0connections2    3829.0dtype: float64

To illustrate the un-predictableness, if I take out the call to print(df.head()), then I get complex results almost ever single time. If I put the initialization of connections2 into the dataframe constructor, I almost always get black floating point results, however sometimes, I incorrectly get a mean of "user" as 0.0

The reason this happens is because the _reduce method relies on exceptions occuring when the reduction function is applied to .values. If that fails, Pandas attempts to extract only the numeric columns, and then applies the function again. There are 2 attempted invocations here and hereof the function before we actually get to the pure numeric data.

Conclusions

Stay on Python 3.6 for now. When NumPy releases the fix, upgrade to that version.

Thanks For Reading!

We are Open Source Answers and we provide high bandwidth support directly from open source developers who wrote the tools you are using. Video conference with an open source developer instead of spending hours on Google and Stack Overflow. Contact Us!!

Originally published at www.opensourceanswers.com.

You shouldn’t use Python 3.7 for data science right now was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.

Publication date

10/02/2018 - 18:06

Author

Hugo Shi

Article source

Disclaimer

The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.