This is the third and final part of a little blog series about a new chunking algorithm that we discussed in ownCloud. You might be interested to read the first two parts ownCloud Chunking NG and Announcing an Upload as well.
This part makes a couple of ideas how the new chunking could be useful with a future feature of incremental sync (also called delta sync) in ownCloud.
In preparartion of delta sync the server could provide another new WebDAV route: remote.php/dav/blocks.
For each file, remote.php/dav/blocks/file-id exists as long as the server has valid checksums for blocks of the file which is identified by its unique file id.
A successful reply to remote.php/dav/blocks/file-id returns an JSON formatted data block with byte ranges and the respective checksums (and the checksum type) over the data blocks for the file. The client can use that information to calculate the blocks of data that has changed and thus needs to be uploaded.
If a file was changed on the server and as a result the checksums are not longer valid, access to remote.php/blocks/file-id is returning the 404 "not found" return code. The client needs to be able to handle missing checksum information at any time.
The server gets the checksums of file blocks along the upload of the chunks from the client. There is no obligation of the server to calculate the checksums of data blocks that came in other than through the clients, yet it can if there is capacity.
To implement incremental sync, the following high level processing could be implemented:
- The client downloads the blocklist of the file: GET remote.php/dav/blocks/file-id
- If GET succeeded: Client computes the local blocklist and computes changes
- If GET failed: All blocks of the file have to be uploaded.
- Client sends request MKCOL /uploads/transfer-id as described in an earlier part of the blog.
- For blocks that have changed: PUT data to /uploads/transfer-id/part-no
- For blocks that have NOT changed: COPY /blocks/file-id/block-no /uploads/transfer-id/part-no
- If all blocks are handled by either being uploaded or copied: Client sends MOVE /uploads/transfer-id /path/to/target-file to finalize the upload.
This would be an extension to the previously described upload of complete files. The PROPFIND semantic on /uploads/transfer-id remains valid.
Depending on the amount of not changed blocks, this could be a dramatic cut for the data that have to be uploaded. More information has to be collected to find out how much that is.
Note that this is still in the idea- and to-be-discussed state, and not yet an agreed specification for a new chunking algorithm.
Please, as usual, share your feedback with us!
The first part of this little blog series explained the basic operations of chunk file upload as we set it up for discussion. This part goes a bit beyond and talks about an addition to that, called announcing the upload.
With the processing described in the first part of the blog, the upload is done savely and with a clean approach, but it also has some drawbacks.
Most notably the server does not know the target filename of the uploaded file upfront. Also it does not know the final size or mimetype of the target file. That is not a problem in general, but imagine the following situation: A big file should be uploaded, which would exceed the users quota. That would only become an error for the user once all uploads happened, and the final upload directory is going to be moved on the final file name.
To avoid useless file transfers like that or to implement features like a file firewall, it would be good if the server would know these data at start of the upload and stop the upload in case it can not be accepted.
To achieve that, the client creates a file called _meta in /uploads/ before the upload of the chunks starts. The file contains information such as overall size, target file name and other meta information.
The server’s reply to the PUT of the _meta file can be a fail result code and error description to indicate that the upload will not be accepted due to certain server conditions. The client should check the result codes in order to avoid not necessary upload of data volume of which the final MOVE would fail anyway.
This is just a collection of ideas for an improved big file chunking protocol, nothing is decided yet. But now is the time to discuss. We’re looking forward to hearing your input.
The third and last part will describe how this plays into delta sync, which is especially interesting for big files, which are usually chunked.
Recently Thomas and me met in person and thought about an alternative approach to bring our big file chunking to the next level. “Big file chunking” is ownClouds algorithm to upload huge files to ownCloud with clients.
This is the first of three little blog posts in which we want to present the idea and get your feedback. This is for open discussion, nothing is set in stone so far.
What is the downside of the current approach? Well, the current algorithm needs a lot of distributed knowledge between server and client to work: The naming scheme of the part files, semi secret headers, implicit knowledge. In addition to that, due to the character of the algorithm the server code is too much spread over the whole code base which makes maintaining difficult.
This situation could be improved with the following approach.
To handle chunked uploads, there will be a new WebDAV route, called remote.php/uploads.
All uploads of files larger than the chunk size will go through this route.
In a nutshell, an upload of a big file will happen as parts to a directory under that new route. The client creates it through the new route. This initiates a new upload. If the directory could be created successfully, the client starts to upload chunks of the original file into that directory. The sequence of the chunks is set by the names of the chunk files created in the directory. Once all chunks are uploaded, the client submits a MOVE request the renames the chunk upload directory to the target file.
Here is a pseudo code description of the sequence:
1. Client creates an upload directory with a self choosen name (ideally a numeric upload id):
2. Client sends a chunk:
3. Client repeats 2. until all chunks have successfully been uploaded
4. Client finalizes the upload:
MOVE remote.php/uploads/upload-id /path/to/target-file
5. The MOVE sends the ETag that is supposed to be overwritten in the request header to server. Server returns new ETag and FileID as reply headers of the MOVE.
During the upload, client can retrieve the current state of the upload by a PROPFIND request on the upload directory. The result will be a listing of all chunks that are already available on the server with metadata such as mtime, checksum and size.
If the server decides to remove an upload, ie. because it hasn’t been active for a time, it is free to remove the entire upload directory and return status 404 if a client tries to upload to. Also, a client is allowed to remove the entire upload directory to cancel an upload.
An upload is finalized by the MOVE request. Note that it’s a MOVE of a directory on a single file. This operation is not supported in normal file systems, but we think in this case, it has a nice well descriptive meaning. A MOVE is known as an atomic and fast operation, and that way it should be implemented by the server.
Also note that only with the final MOVE the upload operation is associated with the final destination file. We think that this approach already is a great improvement, because there is always a clear state of the upload with no secret knowledge hidden in the process.
In the next blog I will discuss an extension to this that adds more features to the process.
What do you think so far? Your feedback is appreciated, best on the ownCloud devel mailinglist!