Here is something that might be a little outdated already, but I hope it still adds some interesting thoughts. The rainy Sunday afternoon today finally gives the opportunity to write this little blog.
Recently an ownCloud fork was coming up with a little shiny box with one harddisk, that can be complemented with a Rapsberry Pi and their software, promoting that as your private cloud.
While I like the idea of building a private cloud for everybody (I started to work on ownCloud because of that idea back in the days), I do not think that this example of gear is a good solution for private cloud.
In fact I believe that throwing this kind of implementations on the table is especially unfortunate because if we come up with too many not optimal proposals, we waste the willingness of users to try it. This idea should not target geeks who might be willing to try ideas on and on. The idea of the private cloud needs to target at every computer user who wants to store data safely, but does not want to care about longer than ever necessary. And with them I fear we only have very little chances, if one at all, to introduce them to a private cloud solution before they go back to something that simply works.
Here are some points why I think solutions like the proposed one are not good enough:
That is nothing new: The hardware of the Raspberry Pi was not designed for this kind of usecases. It is simply too weak to drive ownCloud, which is an PHP app plus database server that has some requirements on the servers power. Even with PHP7, which is faster, and the latest revisions of the mini computer, it might look ok in the beginning, but after all the neccessary bells and whistles were added to the installation and data run in, it will turn out that the CPU power is simply not enough. Similar weaknesses are also true for the networking capabilities for example.
A user that finds that out after a couple of weeks after she worked with the system will remain angry and probably go (back) to solutions that we do not fancy.
One Disk Setup
The solution comes as one disk setup: How secure can data be that is on one single hardisk? A seriously engineered solution should at least recommend a way to store the data more securely and/or backup, like on an at homes NAS for example.
That can be done, but requires manual work and might require more network capabilities and CPU power.
Last, but for me the most important point: Having such a box in the private network requires to drill a whole in the firewall, to allow port forwarding. I know, that is nothing unusual for experienced people, and in theory little problem.
But for people who are not so interested, that means they need to click in the interface of their router on a button that they do not understand what it does, and maybe even insert data by following an documentation that they have to believe. (That is not very much different from downloading a script from somewhere letting it do the changes which I would not recommend as well).
Doing mistakes here could potentially have a huge impact for the network behind the router, without that the person who did it even has an understanding for.
Also DynDNS is needed: That is also not a big problem in theory and for geeks, but in practice it is nothing easily done.
With a good solution for private cloud, it should not be necessary to ask for that kind of setups.
Where to go from here?
There should be better ways to solve this problems with ownCloud, and I am sure ownCloud is the right tool to solve that problem. I will share some thought experiments that we were doing some time back to foster discussion on how we can use the Raspberry Pi with ownCloud (because it is a very attractive piece of hardware) and solve the problems.
This will be subject of an upcoming blog here, please stay tuned.
This is the third and final part of a little blog series about a new chunking algorithm that we discussed in ownCloud. You might be interested to read the first two parts ownCloud Chunking NG and Announcing an Upload as well.
This part makes a couple of ideas how the new chunking could be useful with a future feature of incremental sync (also called delta sync) in ownCloud.
In preparartion of delta sync the server could provide another new WebDAV route: remote.php/dav/blocks.
For each file, remote.php/dav/blocks/file-id exists as long as the server has valid checksums for blocks of the file which is identified by its unique file id.
A successful reply to remote.php/dav/blocks/file-id returns an JSON formatted data block with byte ranges and the respective checksums (and the checksum type) over the data blocks for the file. The client can use that information to calculate the blocks of data that has changed and thus needs to be uploaded.
If a file was changed on the server and as a result the checksums are not longer valid, access to remote.php/blocks/file-id is returning the 404 "not found" return code. The client needs to be able to handle missing checksum information at any time.
The server gets the checksums of file blocks along the upload of the chunks from the client. There is no obligation of the server to calculate the checksums of data blocks that came in other than through the clients, yet it can if there is capacity.
To implement incremental sync, the following high level processing could be implemented:
- The client downloads the blocklist of the file: GET remote.php/dav/blocks/file-id
- If GET succeeded: Client computes the local blocklist and computes changes
- If GET failed: All blocks of the file have to be uploaded.
- Client sends request MKCOL /uploads/transfer-id as described in an earlier part of the blog.
- For blocks that have changed: PUT data to /uploads/transfer-id/part-no
- For blocks that have NOT changed: COPY /blocks/file-id/block-no /uploads/transfer-id/part-no
- If all blocks are handled by either being uploaded or copied: Client sends MOVE /uploads/transfer-id /path/to/target-file to finalize the upload.
This would be an extension to the previously described upload of complete files. The PROPFIND semantic on /uploads/transfer-id remains valid.
Depending on the amount of not changed blocks, this could be a dramatic cut for the data that have to be uploaded. More information has to be collected to find out how much that is.
Note that this is still in the idea- and to-be-discussed state, and not yet an agreed specification for a new chunking algorithm.
Please, as usual, share your feedback with us!
Incremental Sync is probably the feature that most people ask, or even sometimes cry for. Recently there was another wave of discussion about ownCloud is doing incremental sync or not.
I will try again (as in this issue) to explain why we decided to slowing that feature. Slowing means that it will be done later, not never, as it was stated. It is just that we think that other things benefit the whole idea of ownCloud more. That has plain technical reasons. Let’s dive a bit into.
RSync is great
Nobody will object here. In a nutshell, this is how rsync works: There is a file on the client and on the server. The idea is to not transfer the entire file from one side to the other if either side changes, but only the parts that have changed.
The original rsync does that by chopping the file to blocks of a given size and calculating a checksum of each of the blocks. The list of checksums is sent to the server and – here’s the trick – the server looks at its version of the file and for each of the checksum in the list, it seeks if it finds the same block in the file. That will often not be at the same position in the file, but maybe somewhere else. That is done for each block, and finally the server will work out the information of which parts of the file are existing and which are not and have to be sent by the client.
By way of this clever algorithm, we will just have to transmit a very small fraction of the changed file, because most content did not change. And that is what we want! Yeah!
Mission accomplished? No, not really. While there is basically nothing wrong with the idea in general, there is a severe architectural downside. The rsync algorithm depends on a strong server component which, for each file, searches around and calculates checksums. In an environment where we potentially have a lot of clients connecting to one server that would create a huge load on it which we need to avoid. So what if instead of putting the burden on the server’s shoulder, we could make the clients take the responsibility?
And guess what, there has been somebody thinking about that before and he says:
Use ZSync for this!
ZSync basically turns the idea of rsync upside down and shifts the calculation of checksums away from the server and onto the clients. That means that with zsync, the server can keep a static list of checksums for every block specific to a version of a file. The list can for example be computed along the upload of the file to the server. From that point it does not change, as long as the file does not change. That means less computation work for the server, and maybe this job can also put into the client.
So far that sounds cool (even though some questions remain) and sounds like something that can help us.
Unfortunately, the approach does not work very well for compressed files. The reason is that if a file gets compressed, even if only a couple of bytes in the original file change, the compression algorithm usually changes a lot all over the entire file. As a result, the zsync algorithm can only compute a comparably large diff. Given the cost of computation that can turn inefficient quickly.
“But who uses compressed files?” you might argue. The problem is that almost each and every of the files in everyday life are stored compressed. This is for example true for Microsoft Office files and the Open Document files produced by LibreOffice and Apache OpenOffice. They are really renamed ZIP containers, that hold the documents with all its embedded files, etc.
Now of course you will reply that zsync has an improved algorithm for compressed files. Yes, true, that is a great thing. However, it involves that the compressed file gets uncompressed to be worked on by zsync. Afterwards it is compressed again. And that is the problem: As common compressors do not leave a hint behind _how_ the file was compressed, it is not possible to reliably recreate a file that is equivalent to the original one. How will apps react on a file that has changed its compression scheme?
As said above, yes, we will at one point of time implement something along the zsync algorithm. The explanations above should show however, that at the current state of ownCloud, other features will improve ownClouds performance, stability and convenience more. And that is the important thing for us, more than pleasing the loudest barking dogs.
Here is a rough outline of how I would move on on this, open for your suggestions and critique:
The zsync algorithm is designed to improve downloads. We need it for both up- and downloads, and it needs to be thought through if that is also possible. For the server side functionality, there are a couple of open questions which carefully have to be investigated. Preferably an app can be written that provides the handling of the zsync checksum lists. That has to be clarified and discussed, and that will take a while.
But as outlined above, this idea is only clever for a limited amount of file types. So what I would suggest first is that we get an idea of the file types users usually store in their ownCloud, so that we can do a validated estimate on how this feature helps. I will follow up on this first step.
Thanks for reading this long blog post. Thanks Danimo for lectorate.