Dive into python 3 chapter 13
pickle
and json
are both data structures that capture “your
progress so far” needs to be stored on disk when you quit, then loaded from disk when you relaunch.
pickle
can store All the native datatypes and Lists, tuples, dictionaries, and sets containing them. But pickle is python-specific, while json is “explicitly designed to be usable across multiple programming languages”. (BTW, the full name of JSON is JavaScript Object Notation). In contrast, json can not serialize tuple 1 and bytes, that’s because JSON data format is text-based, such difference will be further explored when we try to store and read from pickle and json files.
dump and load of each
dump
is a frequently-used function for storing things, for example, if we get such a dictionary named entry
:
>>>entry
{'title': "A lesson about 'pickle' module",
'article_link': None,
'internal_id': b'\xde\xd5\xb4\xf8',
'tags': ('diveintopython', 'docbook', 'module'),
'published': True,
'published_date': time.struct_time(tm_year=2018, tm_mon=2, tm_mday=2, tm_hour=14, tm_min=11, tm_sec=20, tm_wday=4, tm_yday=33, tm_isdst=-1)}
we can dump the entry dictionary into a pickle file
>>> import pickle
>>> with open('entry.pickle', 'wb') as f:
... pickle.dump(entry, f)
The module serializes the data structure using a data format called “the pickle protocol” and the latest pickle protocol is a binary format, so we use wb
.
>>> with open('entry.pickle', 'rb') as f:
... entry = pickle.load(f)
…and we can easily load it using the same fashion.
The pickle.dump() / pickle.load() cycle results in a new data structure that is equal to the original data structure. However, when you test it with is
, it is still two different object.
we can also serialize to a bytes object in memory:
>>> b = pickle.dumps(entry)
>>> type(b)
<class 'bytes'>
>>> entry3 = pickle.loads(b)
>>> entry3 == entry
True
For pickle has different versions of protocols, you can find out the versions with pickletools.dis()
function in pickle module.
Nearly same stories for json module
>>> import json
>>> with open('basic.json', mode='w', encoding='utf-8') as f:
... json.dump(basic_entry, f)
just to notice that JSON is text-based, so you have to specify a character encoding.
When you ‘cat’ the json file, it looks neat, and you can makes it even better by giving a indent
parameter when you use the dump()
function. Python documentation describes that:
JSON is human-readable, while pickle is not
However, json has some ugly parts …
dealing with datatypes unsupported by json
For JSON cannot encode bytes, when trying to serialize the entry
object mentioned at the beginning, it raises a TypeError
exception. As a alternative, we can keep this bytes object using some custom serializer, i.e. convert the object to a JSON-supported
datatype
def to_json(python_object):
if isinstance(python_object, bytes):
return {'__class__' : 'bytes', '__value__': list(python_object)}
raise TypeError(repr(python_object) + 'is not JSON serializable')
we use list()
function to convert the bytes object into a list of integers
And when dumping, we pass a parameter to default
, and that parameter is the to_json
function you defined. It is required that default should be a function that gets called for objects that can’t otherwise be serialized.
one of the values in the entry
dictionary, the ‘published_date’ is a time.struct_time object returned by a function named strptime() (the function parses a string representing time in a certain format). time.struct_time
is a class object with a named tuple interface, and cannot be properly serialized either as described by author.
But it seems that in Python3.6.4, such object is serialized as an array of integers
{"title":"A lesson about 'pickle' module",
...
"published_date": [2018, 2, 2, 14, 11, 20, 4, 33, -1]}
It’s kind of a mystery to me.
When we want to load it, we want to get back the original data, so we need another function to take a custom-converted JSON object and convert it back to the original Python datatype.
def from_json(json_object):
if '__class__' in json_object:
if json_object['__class__'] == 'bytes':
return bytes(json_object['__value__'])
return json_object
To hook the from_json() function into the deserialization process, pass it as the object_hook parameter to the json.load() function.
>>> import customserializer
>>> with open('entry.json', 'r', encoding='utf-8') as f:
... entry = json.load(f, object_hook=customserializer.from_json)
The return value of object_hook will be used instead of the default conversion, so you will have to return the original one if is not a custom-converted object.
Note also that the deserialized one have a list of tags
, instead of a tuple.1
FULL customserializer file here
Bonus: pickel can serialize class as well, here is the official guidance.