Incremental Sync in ownCloud
Incremental Sync is probably the feature that most people ask, or even sometimes cry for. Recently there was another wave of discussion about ownCloud is doing incremental sync or not.
I will try again (as in this issue) to explain why we decided to slowing that feature. Slowing means that it will be done later, not never, as it was stated. It is just that we think that other things benefit the whole idea of ownCloud more. That has plain technical reasons. Let’s dive a bit into.
RSync is great
Nobody will object here. In a nutshell, this is how rsync works: There is a file on the client and on the server. The idea is to not transfer the entire file from one side to the other if either side changes, but only the parts that have changed.
The original rsync does that by chopping the file to blocks of a given size and calculating a checksum of each of the blocks. The list of checksums is sent to the server and – here’s the trick – the server looks at its version of the file and for each of the checksum in the list, it seeks if it finds the same block in the file. That will often not be at the same position in the file, but maybe somewhere else. That is done for each block, and finally the server will work out the information of which parts of the file are existing and which are not and have to be sent by the client.
By way of this clever algorithm, we will just have to transmit a very small fraction of the changed file, because most content did not change. And that is what we want! Yeah!
Mission accomplished? No, not really. While there is basically nothing wrong with the idea in general, there is a severe architectural downside. The rsync algorithm depends on a strong server component which, for each file, searches around and calculates checksums. In an environment where we potentially have a lot of clients connecting to one server that would create a huge load on it which we need to avoid. So what if instead of putting the burden on the server’s shoulder, we could make the clients take the responsibility?
And guess what, there has been somebody thinking about that before and he says:
Use ZSync for this!
ZSync basically turns the idea of rsync upside down and shifts the calculation of checksums away from the server and onto the clients. That means that with zsync, the server can keep a static list of checksums for every block specific to a version of a file. The list can for example be computed along the upload of the file to the server. From that point it does not change, as long as the file does not change. That means less computation work for the server, and maybe this job can also put into the client.
So far that sounds cool (even though some questions remain) and sounds like something that can help us.
Unfortunately, the approach does not work very well for compressed files. The reason is that if a file gets compressed, even if only a couple of bytes in the original file change, the compression algorithm usually changes a lot all over the entire file. As a result, the zsync algorithm can only compute a comparably large diff. Given the cost of computation that can turn inefficient quickly.
“But who uses compressed files?” you might argue. The problem is that almost each and every of the files in everyday life are stored compressed. This is for example true for Microsoft Office files and the Open Document files produced by LibreOffice and Apache OpenOffice. They are really renamed ZIP containers, that hold the documents with all its embedded files, etc.
Now of course you will reply that zsync has an improved algorithm for compressed files. Yes, true, that is a great thing. However, it involves that the compressed file gets uncompressed to be worked on by zsync. Afterwards it is compressed again. And that is the problem: As common compressors do not leave a hint behind _how_ the file was compressed, it is not possible to reliably recreate a file that is equivalent to the original one. How will apps react on a file that has changed its compression scheme?
As said above, yes, we will at one point of time implement something along the zsync algorithm. The explanations above should show however, that at the current state of ownCloud, other features will improve ownClouds performance, stability and convenience more. And that is the important thing for us, more than pleasing the loudest barking dogs.
Here is a rough outline of how I would move on on this, open for your suggestions and critique:
The zsync algorithm is designed to improve downloads. We need it for both up- and downloads, and it needs to be thought through if that is also possible. For the server side functionality, there are a couple of open questions which carefully have to be investigated. Preferably an app can be written that provides the handling of the zsync checksum lists. That has to be clarified and discussed, and that will take a while.
But as outlined above, this idea is only clever for a limited amount of file types. So what I would suggest first is that we get an idea of the file types users usually store in their ownCloud, so that we can do a validated estimate on how this feature helps. I will follow up on this first step.
Thanks for reading this long blog post. Thanks Danimo for lectorate.