I have a (very large) dataset. Something in the order of 250,000 binary vectors, each of size 800.
The dataset resides in a (.txt ascii coding) file, in 'compressed representation'. Meaning, every line in that file represents how a vector looks, rather than 800 characters of zeroes and ones.
For example, suppose that the i'th line in that file looks like this:
12 14 16 33 93 123 456 133
This means that the i'th vector is a vector with its 12'th 14'th, 16'th, ... 133'th indices holding the value 1, and the rest are zeroes.
The file's size is a little more than 30MB.
Now, since I use this data to feed a neural network, this data need some preprocessing, in order to transform it into what the network expects: list of size 250,000, where every element in that list is a 20x40 matrix (list of lists) of zeros and ones.
For example, if we rescale the problem to 4x2, this is what the final list looks like:
[[[1,0],[1,1],[0,0],[1,0]], [[0,0],[0,1],[1,0],[1,0]], ..., [[1,1],[0,1],[0,0],[1,1]]]
(only instead of 4x2 I have 20x40 matrices).
So I wrote two functions:
load_data() - which parses the file and returns a list of 800 binary lists, and
reshape() - which reshapes the lists to a 20x40 matrices.
Needless to say that my poor laptop works really hard when
reshape() are running. It takes about 7-9 minutes to complete the preprocessing, while in that time I can barley do anything else in my laptop. Even minimizing the IDE window is an extremely difficult task.
Since I use this data to adjust a neural net, I find myself very often killing the running process, re-tuning the network, and starting again - where every restart results with an invoke to
load_data() followed by
So, I decided to short-cut through this painful process of loading the data --> transforming to binary vectors --> reshaping it.
I want to load the data from the file, transform to binary vectors, reshaping it and serializing it to a file
Now, whenever I need to feed the network, I can just de-serialize the data from
my_input, and spare me a lot of time.
This is how I did it:
input_file=open('my_input', 'wb') print 'loading data from file...' input_data=load_data() # this will load the data from file and will re-encode it to binary vectors print 'reshaping...' reshaped_input=reshape(input_data) print 'writing to file...' cPickle.dump(reshaped_input, input_file, HIGHEST_PROTOCOL) input_file.close()
The problem is this:
The resulting file is huge; 1.7GB in size, and it seems like the game is not worth the candle (I hope I used it right), since it takes too much time to load it (didn't measure how much, I just tried to load it, and after 9-10 minutes I gave up and killed the process).
Why is the resulting file is so much bigger than the original (I'd expect it to be bigger, but not by that much)?
Is there another way to encode the data (serialize/de-serialize wise), that will result in a smaller file, and will worth my while?
Or, alternatively, if anyone can suggest a better way to speed things up (besides buying a faster computer) that would also be great.
p.s I don't care about compatibility issues when it comes to de-serializing. The only place where this data will ever be de-serialized is on my computer.