This is the third and final part of a little blog series about a new chunking algorithm that we discussed in ownCloud. You might be interested to read the first two parts ownCloud Chunking NG and Announcing an Upload as well.
This part makes a couple of ideas how the new chunking could be useful with a future feature of incremental sync (also called delta sync) in ownCloud.
In preparartion of delta sync the server could provide another new WebDAV route: remote.php/dav/blocks.
For each file, remote.php/dav/blocks/file-id exists as long as the server has valid checksums for blocks of the file which is identified by its unique file id.
A successful reply to remote.php/dav/blocks/file-id returns an JSON formatted data block with byte ranges and the respective checksums (and the checksum type) over the data blocks for the file. The client can use that information to calculate the blocks of data that has changed and thus needs to be uploaded.
If a file was changed on the server and as a result the checksums are not longer valid, access to remote.php/blocks/file-id is returning the 404 "not found" return code. The client needs to be able to handle missing checksum information at any time.
The server gets the checksums of file blocks along the upload of the chunks from the client. There is no obligation of the server to calculate the checksums of data blocks that came in other than through the clients, yet it can if there is capacity.
To implement incremental sync, the following high level processing could be implemented:
- The client downloads the blocklist of the file: GET remote.php/dav/blocks/file-id
- If GET succeeded: Client computes the local blocklist and computes changes
- If GET failed: All blocks of the file have to be uploaded.
- Client sends request MKCOL /uploads/transfer-id as described in an earlier part of the blog.
- For blocks that have changed: PUT data to /uploads/transfer-id/part-no
- For blocks that have NOT changed: COPY /blocks/file-id/block-no /uploads/transfer-id/part-no
- If all blocks are handled by either being uploaded or copied: Client sends MOVE /uploads/transfer-id /path/to/target-file to finalize the upload.
This would be an extension to the previously described upload of complete files. The PROPFIND semantic on /uploads/transfer-id remains valid.
Depending on the amount of not changed blocks, this could be a dramatic cut for the data that have to be uploaded. More information has to be collected to find out how much that is.
Note that this is still in the idea- and to-be-discussed state, and not yet an agreed specification for a new chunking algorithm.
Please, as usual, share your feedback with us!
Something I wanted to share is a device I recently built. It is a complete net player for music files, served by a NAS. It’s now finished and “in production” now, so here is my report.
HardwareAmp+ is a high-quality power amplifier that is mounted on the Raspberry mini computer, shortcutting the sub optimal audio system of the raspy. Only loudspeakers need to be connected, and this little combination is a very capable stereo audio system.
On the Raspberry runs a specialised linux distribution with a web based audio player for audio files and web radio. The name of this project is Volumio and it is one of my most favorite projects currently. What I like with it is that it is only a music player, and not a project that tries to manage all kind of media. Volumios user interface is accessible via web browser, and it is very user friendly, modern, pretty and clean.
It works on all size factors (from mobile phone to desktop) and is very easy to handle. Its a web interface to mpd which runs on the raspi for the actual music indexing and playing.The distribution all is running on is an optimized raspbian, with a variety of kernel drivers for audio output devices. Everything is pre configured. Once the volumio image is written to SD-Card, the Raspberry boots up and the device starts to play nicely basically after network and media source was configured through the web interface.
It is impressive how the Volumio project aims for absolute simplicity in use, but also allows to play around with a whole lot of interesting settings. A lot can be, but very little has to be configured.
Bottomline: Volumio is really a interesting project which you wanna try if you’re interested in these things.
I built the housing out of cherry tree from the region here. I got it from friends growing cherries, and they gave a few very nice shelfs. It was sliced and planed to 10mm thickness. The dark inlay is teak which I got from 70’s furniture that was found on bulky waste years ago.
After having everything cut, the cherry wood was glued together, with some internal help construction from plywood. After the sanding and polishing, the box was done.The Raspberry and Hifiberry are mounted on a little construction attached to the metal back cover, together with the speaker connector. The metal cover is tightend with one screw, and the whole electronics can be taken off the wood box by unscrewing it.
At the bottom of the device there is a switching power supply that provides the energy for the player.
How to Operate?
The player is comletely operated from the web interface. The make all additional knobs and stuff superflous, the user even uses the tablet or notebook to adjust the volume. And since the device is never switched off, it does not even have a power button.
I combined it with Denon SC-M39 speakers. The music files come from a consumer NAS in the home network. The Raspberry is very well powerful enough for the software, and the Hifiberry device is surprisingly powerful and clean. The sound is very nice. Clear and fresh in the mid and heights, but still enough base, which never is annoying blurry or drones.
I am very happy about the result of this little project. Hope you like it :-)
The first part of this little blog series explained the basic operations of chunk file upload as we set it up for discussion. This part goes a bit beyond and talks about an addition to that, called announcing the upload.
With the processing described in the first part of the blog, the upload is done savely and with a clean approach, but it also has some drawbacks.
Most notably the server does not know the target filename of the uploaded file upfront. Also it does not know the final size or mimetype of the target file. That is not a problem in general, but imagine the following situation: A big file should be uploaded, which would exceed the users quota. That would only become an error for the user once all uploads happened, and the final upload directory is going to be moved on the final file name.
To avoid useless file transfers like that or to implement features like a file firewall, it would be good if the server would know these data at start of the upload and stop the upload in case it can not be accepted.
To achieve that, the client creates a file called _meta in /uploads/ before the upload of the chunks starts. The file contains information such as overall size, target file name and other meta information.
The server’s reply to the PUT of the _meta file can be a fail result code and error description to indicate that the upload will not be accepted due to certain server conditions. The client should check the result codes in order to avoid not necessary upload of data volume of which the final MOVE would fail anyway.
This is just a collection of ideas for an improved big file chunking protocol, nothing is decided yet. But now is the time to discuss. We’re looking forward to hearing your input.
The third and last part will describe how this plays into delta sync, which is especially interesting for big files, which are usually chunked.
Recently Thomas and me met in person and thought about an alternative approach to bring our big file chunking to the next level. “Big file chunking” is ownClouds algorithm to upload huge files to ownCloud with clients.
This is the first of three little blog posts in which we want to present the idea and get your feedback. This is for open discussion, nothing is set in stone so far.
What is the downside of the current approach? Well, the current algorithm needs a lot of distributed knowledge between server and client to work: The naming scheme of the part files, semi secret headers, implicit knowledge. In addition to that, due to the character of the algorithm the server code is too much spread over the whole code base which makes maintaining difficult.
This situation could be improved with the following approach.
To handle chunked uploads, there will be a new WebDAV route, called remote.php/uploads.
All uploads of files larger than the chunk size will go through this route.
In a nutshell, an upload of a big file will happen as parts to a directory under that new route. The client creates it through the new route. This initiates a new upload. If the directory could be created successfully, the client starts to upload chunks of the original file into that directory. The sequence of the chunks is set by the names of the chunk files created in the directory. Once all chunks are uploaded, the client submits a MOVE request the renames the chunk upload directory to the target file.
Here is a pseudo code description of the sequence:
1. Client creates an upload directory with a self choosen name (ideally a numeric upload id):
2. Client sends a chunk:
3. Client repeats 2. until all chunks have successfully been uploaded
4. Client finalizes the upload:
MOVE remote.php/uploads/upload-id /path/to/target-file
5. The MOVE sends the ETag that is supposed to be overwritten in the request header to server. Server returns new ETag and FileID as reply headers of the MOVE.
During the upload, client can retrieve the current state of the upload by a PROPFIND request on the upload directory. The result will be a listing of all chunks that are already available on the server with metadata such as mtime, checksum and size.
If the server decides to remove an upload, ie. because it hasn’t been active for a time, it is free to remove the entire upload directory and return status 404 if a client tries to upload to. Also, a client is allowed to remove the entire upload directory to cancel an upload.
An upload is finalized by the MOVE request. Note that it’s a MOVE of a directory on a single file. This operation is not supported in normal file systems, but we think in this case, it has a nice well descriptive meaning. A MOVE is known as an atomic and fast operation, and that way it should be implemented by the server.
Also note that only with the final MOVE the upload operation is associated with the final destination file. We think that this approach already is a great improvement, because there is always a clear state of the upload with no secret knowledge hidden in the process.
In the next blog I will discuss an extension to this that adds more features to the process.
What do you think so far? Your feedback is appreciated, best on the ownCloud devel mailinglist!
Today, we’re happy to release the best ownCloud Desktop Client ever to our community and users! It is ownCloud Client 1.8.0 and it will push syncing with ownCloud to a new level of performance, stability and convenience.This release brings a new integration into the operating system file manager. With 1.8.0, there is a new context menu that opens a dialog to allow the user to create a public link on a synced file. This link can be forwarded to other users who get access to the file via ownCloud.
Also the clients behavior when syncing files that are opened by other applications on Windows has greatly been improved. The problems with file locking some users saw for example with MS office apps were fixed.
Another area of improvements is again performance. With latest ownCloud servers, the client uses even more parallized requests, now for all kind of operations. Depending on the synced data structure, this can make a huge difference.
All the other changes, improvements and bug-fixes are too hard to count. Finally, this release received around 700 git commits compared to the previous release.
All this is only possible with the powerful and awesome community of ownClouders. We received a lot of very good contributions through the GitHub tracker, which helped us to nail down a lot of issues and improved the client tremendously.
But this time we’d like to specifically point out the code contributions of Alfie “Azelphur” Day and Roeland Jago Douma who contributed significant code bits to the sharing dialog on the client and also some server code.
A great thanks goes out to all of you who helped with this release. It was a great experience again and it is big fun working with you!
We hope you enjoy 1.8.0! Get it from https://owncloud.org/install/#desktop
Often questions come up about the meaning of FileIDs and ETags. Both values are metadata that the ownCloud Server stores for each of the files and directories in the server database. These values are fundamentally important for the integrity of data in the overall system.
Here are some thoughts about what they are why these are so important.This is mainly from a clients point of view, but there are other use cases as well.
ETags are strings that describe exactly one specific version of a file (example: 71a89a94b0846d53c17905a940b1581e).
Whenever the file changes, the ownCloud server will make sure that the ETag of the specific file changes as well. It is not important in which way the ETag changes, it also does not have to be strictly unique, it’s just important that it changes reliably if the file changes for whatever reason. However, ETags should not change if the file was not changed, otherwise the client will download that file again.
In addition to that, The ETags of the parent directories of the file have to change as well, up to the root directory. That way client systems can detect changes that happen somewhere in the file tree. This is in contrast to normal computer file systems where only the modification time of the direct parent of a file is changing.
FileIDs are also strings that are created once at the creation time of the file (example: 00003867ocobzus5kn6s).
But contrary to the ETags, the file IDs should never ever change over the files lifetime. Not on an edit of the file, and also not if the file is renamed or moved. One of the important usages of the FileID is to detect renames and moves of a file on the server.
The FileID is used as an unique key to identify a file. FileIDs need to be unique within one ownCloud, and in inter-owncloud connections, they must be compared together with the ownCloud server instance id.
Also, the FileIDs must never be recycled or reused.
Often ETags and FileIDs are confused with checksums such as MD5 or SHA1 sums over the file content.
Neither ETags nor FileIDs are, even if there are similarities: Especially the ETag can be seen as a checksum over the file content. However, file checksums are way more costly to compute than just a value that only needs to change somehow.
What happens if…?
Let’s make a thought experiment and consider what it would mean especially for sync clients if either fileID or ETag gets lost from the servers database.
If ETags are lost, clients loose the ability to decide if files have changed since the last time that was checked by the clients. So what happens is that the client will download the files again, byte-wise compare them to the local file and use the server file if the files differ. A conflict file will be created. Because the ETag was lost, the server will create new ETags on download. This could be improved by the server creating more predictable ETags based on the storage backends capabilities.
If the ETags are changed without reason, for example because a backup was played back on the server, the clients will consider the ones with changed ETags as changed and redownload them. Conflict handling will happen as described if there was a local change as well.
For the user, this means a lot of unnecessary downloads as well as potential conflicts. However, there will not be data loss.
If FileIDs got lost or changed, the problem is that renames or moves on server side can no longer be detected. That would result in a new download of files in the good case. If a fileID however changes to something that was used before, that can result in a rename that overwrites an unrelated file. That is because clients might still have the FileID associated with another file.
Hopefully this little post explains the importance of the additional metadata that we maintain in ownCloud.
Incremental Sync is probably the feature that most people ask, or even sometimes cry for. Recently there was another wave of discussion about ownCloud is doing incremental sync or not.
I will try again (as in this issue) to explain why we decided to slowing that feature. Slowing means that it will be done later, not never, as it was stated. It is just that we think that other things benefit the whole idea of ownCloud more. That has plain technical reasons. Let’s dive a bit into.
RSync is great
Nobody will object here. In a nutshell, this is how rsync works: There is a file on the client and on the server. The idea is to not transfer the entire file from one side to the other if either side changes, but only the parts that have changed.
The original rsync does that by chopping the file to blocks of a given size and calculating a checksum of each of the blocks. The list of checksums is sent to the server and – here’s the trick – the server looks at its version of the file and for each of the checksum in the list, it seeks if it finds the same block in the file. That will often not be at the same position in the file, but maybe somewhere else. That is done for each block, and finally the server will work out the information of which parts of the file are existing and which are not and have to be sent by the client.
By way of this clever algorithm, we will just have to transmit a very small fraction of the changed file, because most content did not change. And that is what we want! Yeah!
Mission accomplished? No, not really. While there is basically nothing wrong with the idea in general, there is a severe architectural downside. The rsync algorithm depends on a strong server component which, for each file, searches around and calculates checksums. In an environment where we potentially have a lot of clients connecting to one server that would create a huge load on it which we need to avoid. So what if instead of putting the burden on the server’s shoulder, we could make the clients take the responsibility?
And guess what, there has been somebody thinking about that before and he says:
Use ZSync for this!
ZSync basically turns the idea of rsync upside down and shifts the calculation of checksums away from the server and onto the clients. That means that with zsync, the server can keep a static list of checksums for every block specific to a version of a file. The list can for example be computed along the upload of the file to the server. From that point it does not change, as long as the file does not change. That means less computation work for the server, and maybe this job can also put into the client.
So far that sounds cool (even though some questions remain) and sounds like something that can help us.
Unfortunately, the approach does not work very well for compressed files. The reason is that if a file gets compressed, even if only a couple of bytes in the original file change, the compression algorithm usually changes a lot all over the entire file. As a result, the zsync algorithm can only compute a comparably large diff. Given the cost of computation that can turn inefficient quickly.
“But who uses compressed files?” you might argue. The problem is that almost each and every of the files in everyday life are stored compressed. This is for example true for Microsoft Office files and the Open Document files produced by LibreOffice and Apache OpenOffice. They are really renamed ZIP containers, that hold the documents with all its embedded files, etc.
Now of course you will reply that zsync has an improved algorithm for compressed files. Yes, true, that is a great thing. However, it involves that the compressed file gets uncompressed to be worked on by zsync. Afterwards it is compressed again. And that is the problem: As common compressors do not leave a hint behind _how_ the file was compressed, it is not possible to reliably recreate a file that is equivalent to the original one. How will apps react on a file that has changed its compression scheme?
As said above, yes, we will at one point of time implement something along the zsync algorithm. The explanations above should show however, that at the current state of ownCloud, other features will improve ownClouds performance, stability and convenience more. And that is the important thing for us, more than pleasing the loudest barking dogs.
Here is a rough outline of how I would move on on this, open for your suggestions and critique:
The zsync algorithm is designed to improve downloads. We need it for both up- and downloads, and it needs to be thought through if that is also possible. For the server side functionality, there are a couple of open questions which carefully have to be investigated. Preferably an app can be written that provides the handling of the zsync checksum lists. That has to be clarified and discussed, and that will take a while.
But as outlined above, this idea is only clever for a limited amount of file types. So what I would suggest first is that we get an idea of the file types users usually store in their ownCloud, so that we can do a validated estimate on how this feature helps. I will follow up on this first step.
Thanks for reading this long blog post. Thanks Danimo for lectorate.