Skip to main content

Python 2/3 compatibility layer for Pickle

Project description

pickle-compat

Python 2/3 compatibility layer for Pickle

TL;DR

To make your pickle forward- and backward-compatible between Python versions, use this:

pip install pickle-compat

Then monkey-patch your pickle library with this:

import pickle_compat

pickle_compat.patch()

From this point you can safely assume that what's pickled with pickle.dumps() in Python 2 can be converted back to the real object in Python 3 with pickle.loads(), and vise versa.

If you want to roll back the patch, use:

pickle_compat.unpatch()

Problem Statement

You were always aware of how pickle is unsafe, hard to debug, and how backward-incompatibility issues may bite you if you decide to update the version. You also heard that you should never use the pickle in a multi-language environment because it's Python-specific.

You knew it all, but you considered it "good enough" for your case. You worked on a monolith application, and pickle provides a serialization mechanism that works out of the box for anything you can create from your Python code.

Until came the time to migrate to Python 3. Anxious, you postponed it for your big legacy app for as long as you could, but there's no way you can delay it even further. This was when you realized that Python2 and Python3 are not two versions of the same language, but actually two different languages which happen to share some code constructs.

OK, now all of a sudden, you came up with a multi-language environment, where you need to read the pickle content, serialized by Python2, from your code in Python3. If you're making gradual migration, the opposite is also true.

First frustrations

Things work out of the box only for simplest cases.

$ python2 -c 'import pickle; print pickle.dumps("Hello world")' | python3 -c 'import pickle, sys; print(repr(pickle.load(sys.stdin.buffer)))'
'Hello world'

All of a sudden, things start to get broken in the most unexpected places. For example, Python3 fails to unpickle Python2's datetime, spitting the scariest issue of any Python developer, a UnicodeDecodeError.

$ python2 -c 'import pickle, datetime; print pickle.dumps(datetime.datetime.utcnow())' | python3 -c 'import pickle, sys; print(repr(pickle.load(sys.stdin.buffer)))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)

Let's follow the rabbit to learn a bit more about the pickle, just enough to make it work for Python2 and Python3. At this point, I'm not sure how to make a smooth transition from where you are to where I wanted us to be, so I start throwing random facts at you in the hope that they build a more or less consistent picture in your head.

Protocol versions

Pickle has several so-called "protocols," or formats in which the file can be written. You can optionally define the protocol version in the pickle.dumps(). The default format in Python 2.7 is 0 (also known as ASCII format), but it can read and write in the formats 1 and 2 as well. Formats 1 and 2 are not ASCII-safe, but they are more compact and faster.

>>> pickle.dumps("hello")
"S'hello'\np0\n."
>>> pickle.dumps("hello", protocol=1)
'U\x05helloq\x00.'
>>> pickle.dumps("hello", protocol=2)
'\x80\x02U\x05helloq\x00.'

In Python 3, Guido introduced a new version of the protocol, intentionally make it backward-incompatible with Python2.7. See the commit. The comment around the DEFAULT_PROTOCOL constant warns, "We intentionally write a protocol that Python 2.x cannot read; there are too many issues with that."

The main takeaway from us is that if we want to have a backward- and forward-compatible code, we can only use protocols that both Python2 and Python3 understand: from 0 to 2 inclusive.

Pickle format and pickletools

Module pickletools calls itself an "Executable documentation" for the pickle module. I highly recommend we open the source code and read an extensive introduction, starting with the words "A pickle is a program for a virtual pickle machine." Another useful feature of pickletools is that it provides a readable representation of the pickle stack.

$ python2
>>> import pickle, pickletools
>>> pickletools.dis(pickle.dumps("hello"))
    0: S    STRING     'hello'
    9: p    PUT        0
   12: .    STOP
highest protocol among opcodes = 0

Here the main takeaway is that data in a pickle are represented in the format of the "opcode - data," where opcode decides, roughly speaking, the type of the following element. The list of opcodes is quite extensive and is always growing. You can find them here

Strings and bytes

Let's find out how text and bytes are represented in Python2 and Python3, and what are the differences between then. We'll use Pickle version 2 for comparison. There's no surprise that Python2 encodes strings and bytes as BINSTRING and Unicode objects as BINUNICODE.

$ python2
>>> import pickle, pickletools
>>> pickletools.dis(pickle.dumps("foo", protocol=2))
    0: \x80 PROTO      2
    2: U    SHORT_BINSTRING 'foo'
    7: q    BINPUT     0
    9: .    STOP
highest protocol among opcodes = 2
>>> pickletools.dis(pickle.dumps(b"foo", protocol=2))
    0: \x80 PROTO      2
    2: U    SHORT_BINSTRING 'foo'
    7: q    BINPUT     0
    9: .    STOP
highest protocol among opcodes = 2
>>> pickletools.dis(pickle.dumps(u"foo", protocol=2))
    0: \x80 PROTO      2
    2: X    BINUNICODE u'foo'
   10: q    BINPUT     0
   12: .    STOP
highest protocol among opcodes = 2

On the contrary, Python3 doesn't want to deal with "strings" as the name is ambiguous, and prefers to deal with BINBYTES and BINUNICODE. I will show how it's encoded in the protocol 3 that doesn't mean to be compatible with Python2.

$ python3
>>> import pickle, pickletools
>>> pickletools.dis(pickle.dumps(b"foo", protocol=3))
    0: \x80 PROTO      3
    2: C    SHORT_BINBYTES b'foo'
    7: q    BINPUT     0
    9: .    STOP
highest protocol among opcodes = 3
>>> pickletools.dis(pickle.dumps(u"foo", protocol=3))
    0: \x80 PROTO      3
    2: X    BINUNICODE 'foo'
   10: q    BINPUT     0
   12: .    STOP
highest protocol among opcodes = 2

Here come two questions:

  • How Python3 encode bytes in the protocol 2? Note that the second protocol knows nothing about BINBYTES?
  • How Python3 decodes the BINSTRING type, provided that it's a Python2 type, and it's ambiguous?

Answering the first question is easy. The pickler introduces a backward-compatible hack.

$ python3
>>> pickletools.dis(pickle.dumps(b'foo', protocol=2))
    0: \x80 PROTO      2
    2: c    GLOBAL     '_codecs encode'
   18: q    BINPUT     0
   20: X    BINUNICODE 'foo'
   28: q    BINPUT     1
   30: X    BINUNICODE 'latin1'
   41: q    BINPUT     2
   43: \x86 TUPLE2
   44: q    BINPUT     3
   46: R    REDUCE
   47: q    BINPUT     4
   49: .    STOP
highest protocol among opcodes = 2

Converting back to Python, it saves the byte sequence to a Unicode object, puts it to the stack, and tells the unpickler to execute the following command:

import _codecs
_codecs.encode(u"foo", "latin1")

A side note. I did not know, but apparently, you can convert safely to Unicode and back any byte sequence.

$ python3
>>> import os
>>> s = os.urandom(100000)
>>> s == s.decode('latin1').encode('latin1')
True

It also works for Python2, so we shouldn't care much about the backward compatibility.

Now, how Python3 decodes BINSTRING opcodes? From the first example, we can see that a string in Python2 is now a Unicode object in Python3. In other words, the pickler tries to convert bytes to Unicode.

$ python2 -c 'import pickle; print pickle.dumps("Hello world")' | python3 -c 'import pickle, sys; print(repr(pickle.load(sys.stdin.buffer)))'
'Hello world'

At this point, you probably ask yourself what encoding does it use? Fortunately, the answer is right there, in the documentation. Python 3 introduced a parameter "encoding" that defaults to ASCII.

The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects. Using encoding='latin1' is required for unpickling NumPy arrays and instances of datetime, date and time pickled by Python 2.

If you wonder what's wrong with datetime, here's how its output looks like in Python2.

$ python2

>>> import pickle, pickletools, datetime
>>> pickletools.dis(pickle.dumps(datetime.datetime.utcnow(), protocol=2))
    0: \x80 PROTO      2
    2: c    GLOBAL     'datetime datetime'
   21: q    BINPUT     0
   23: U    SHORT_BINSTRING '\x07\xe4\x05\x1a\x0f\x01\x16\x00\x96\x10'
   35: q    BINPUT     1
   37: \x85 TUPLE1
   38: q    BINPUT     2
   40: R    REDUCE
   41: q    BINPUT     3
   43: .    STOP
highest protocol among opcodes = 2

Here comes yet another surprise for me: datetime constructor can accept a byte sequence to initialize its internal state, and pickle takes advantage of this.

>>> import datetime
>>> datetime.datetime(b'\x07\xe4\x05\x1a\x0f\x01\x16\x00\x96\x10')
datetime.datetime(2020, 5, 26, 15, 1, 22, 38416)

Setting the encoding to "latin1" seems to work.

python2 -c 'import pickle, datetime; print pickle.dumps(datetime.datetime.utcnow())' | python3 -c 'import pickle, sys; print(repr(pickle.load(sys.stdin.buffer, encoding="latin1")))'
datetime.datetime(2020, 5, 26, 15, 19, 6, 275120)

The main takeaway is that strings in Python2 are converted to Unicode objects in Python3, and you can control the encoding.

Non-latin strings in Python2

Hopefully, at this point, you converted all your non-ASCII strings in Unicode objects, because if you haven't, you're in trouble.

python2 -c 'import pickle; print pickle.dumps("©")' | python3 -c 'import pickle, sys; print(repr(pickle.load(sys.stdin.buffer, encoding="latin1")))'
'©'

To workaround, you need to use UTF-8, which will work for this case.

python2 -c 'import pickle; print pickle.dumps("©")' | python3 -c 'import pickle, sys; print(repr(pickle.load(sys.stdin.buffer, encoding="utf8")))'
'©'

Unfortunately, it will not work for datetimes and other binary strings that don't represent a valid UTF-8 sequence.

Well, we were so close to the victory, and we're back to square one. What we're going to do? Fortunately, there's a documented escape hatch, the "bytes" encoding. This encoding looks precisely the way we need it. It doesn't try to outsmart you and convert bytes to something that looks like a string. Instead, it returns bytes as bytes objects. Even better than "latin1"!

python2 -c 'import pickle; print pickle.dumps("©")' | python3 -c 'import pickle, sys; print(repr(pickle.load(sys.stdin.buffer, encoding="bytes")))'
b'\xc2\xa9'

Datetime objects also work. Is this a victory? Not so fast.

Objects with attributes

Consider the file foo.py, and let's try to serialize foo.foo.

class Foo(object):
    a = 'UNSET'
    b = 'UNSET'
    def __init__(self):
        self.a = 1
        self.b = 2
    def __repr__(self):
        return 'Foo(%s, %s)' % (self.a, self.b)

foo = Foo()

As long as we use the default settings, we're good.

$ python2 -c 'import pickle, foo; print pickle.dumps(foo.foo)' | python3 -c 'import pickle, sys; print(repr(pickle.load(sys.stdin.buffer)))'

Foo(1, 2)

But if we pass "bytes" as an argument, all of a sudden something goes wrong.

python2 -c 'import pickle, foo; print pickle.dumps(foo.foo)' | python3 -c 'import pickle, sys; print(repr(pickle.load(sys.stdin.buffer, encoding="bytes")))'

Foo(UNSET, UNSET)

We lost the attributes of a and b. Where do they go? The same pickletool.dis() helps us to find the answer:

$ python2
>>> import pickle, pickletools, foo
>>> pickletools.dis(pickle.dumps(foo.foo, protocol=2))
    0: \x80 PROTO      2
    2: c    GLOBAL     'foo Foo'
   11: q    BINPUT     0
   13: )    EMPTY_TUPLE
   14: \x81 NEWOBJ
   15: q    BINPUT     1
   17: }    EMPTY_DICT
   18: q    BINPUT     2
   20: (    MARK
   21: U        SHORT_BINSTRING 'a'
   24: q        BINPUT     3
   26: K        BININT1    1
   28: U        SHORT_BINSTRING 'b'
   31: q        BINPUT     4
   33: K        BININT1    2
   35: u        SETITEMS   (MARK at 20)
   36: b    BUILD
   37: .    STOP
highest protocol among opcodes = 2

The pickle loader doesn't call __init__. Instead, it creates a new empty "dummy" object of the class Foo and populates its state by updating the __dict__. If this would be Python, we could write it like this:

obj = object.__new__(foo.Foo)
obj.__dict__ = {"a": 1, "b": 2}

I think now you understand what went wrong. Because of the bytes encoding, we did not convert b"a" and b"b" to their "python3-string" representations. You can put anything to object's dict, but only the keys that are strings are represented as "proper object attributes."

The next command shows the contents of the __dict__ of an object and proves that we were right?

python2 -c 'import pickle, foo; print pickle.dumps(foo.foo)' | python3 -c 'import pickle, sys; print(pickle.load(sys.stdin.buffer, encoding="bytes").__dict__)'

{b'a': 1, b'b': 2}

OK, we can't use ASCII, latin1, utf8 as an encoding, and now we learned that we couldn't use bytes? It looks like a dead-end. Or you can get to your last resort, dirty and evil, monkey-patching.

Monkeypatching the unpickler

Before we go straight to this topic, there's one remark about Python3 pickle. It uses the fast version implemented in C if possible, and if it's not, it falls back to the slow pure-python implementation. See the code.

We plan to subclass the standard unpickler with our version that overwrites the handler of the BUILD opcode. We can use this unpickler directly or monkey patch the original pickle module to call it implicitly. The code that we need to overwrite is load_build. If you read the code, you can see that the builder tries to find out the __setstate__ method of the object, and if nothing is found, fall back to assigning via __dict__.

Let's follow the path of modifying __dict__ before assignment because it looks less invasive than messing with __setstate__.

I ended up with the code that you can find in pickle_compat.compat and load with pickle_compat.patch(). It works!

python2 -c 'import pickle, foo; print pickle.dumps(foo.foo)' | python3 -c 'import pickle, sys, pickle_compat; pickle_compat.patch(); print(pickle.load(sys.stdin.buffer))'

Foo(1, 2)

It also works with non-ASCII strings and datetime objects.

Old-style classes

We are almost there, except for one thing: old-style classes. As you know, in Python3, everything subclasses objects, while in Python2, unless you explicitly inherit your class from it, the top-level class will be "type". It is considered outdated, but it's still used in different places of the standard library, waiting to ruin your life in the most unexpected moment.

This time we talk about forward-compatibility and want to make sure that anything that is pickled in Python3 can be successfully unpicked in Python2.

Let's take an object that is an old-style class in Python2.

python3 -c 'import pickle, smtplib, sys; sys.stdout.buffer.write(pickle.dumps(smtplib.SMTP(), protocol=2))' | python2 -c 'import pickle, sys; print pickle.load(sys.stdin)'

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "2.7.15/lib/python2.7/pickle.py", line 1384, in load
    return Unpickler(file).load()
  File "2.7.15/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "2.7.15/lib/python2.7/pickle.py", line 1089, in load_newobj
    obj = cls.__new__(cls, *args)
AttributeError: class SMTP has no attribute '__new__'

The approach is similar to the old one: find out how unpickler loads new objects and then patch it to see if the class is old. The Python2 implementation lives here.

Note that the protocol version 0 doesn't contain a NEWOBJ opcode and uses a set of workarounds to make it work, so this approach will only work for version 2 of the protocol.

Putting it all together

What we learned

  • The default version of the protocol has to be 2, both for Python 2 and Python 3
  • We must prevent automatic conversion from bytes to strings by passing "bytes" as encoding in the pickle for Python 3
  • We must patch Unpickler in Python 3 to set object attributes properly.
  • We must patch Unpickler in Python 2 to correctly unpickle instances of old-style classes.

Also, we learned some of the internals of pickle and learned how to use pickletools. Finally, we wrapped everything with a pickle_compat library that monkey-patches the standard pickle module.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pickle-compat-1.0.2.tar.gz (15.4 kB view hashes)

Uploaded Source

Built Distribution

pickle_compat-1.0.2-py2.py3-none-any.whl (9.5 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page