Home > FOSS, ownCloud > ownCloud Chunking NG

ownCloud Chunking NG

Recently Thomas and me met in person and thought about an alternative approach to bring our big file chunking to the next level. “Big file chunking” is ownClouds algorithm to upload huge files to ownCloud with clients.

This is the first of three little blog posts in which we want to present the idea and get your feedback. This is for open discussion, nothing is set in stone so far.

What is the downside of the current approach? Well, the current algorithm needs a lot of distributed knowledge between server and client to work: The naming scheme of the part files, semi secret headers, implicit knowledge. In addition to that, due to the character of the algorithm the server code is too much spread over the whole code base which makes maintaining difficult.

This situation could be improved with the following approach.

To handle chunked uploads, there will be a new WebDAV route, called remote.php/uploads.
All uploads of files larger than the chunk size will go through this route.

In a nutshell, an upload of a big file will happen as parts to a directory under that new route. The client creates it through the new route. This initiates a new upload. If the directory could be created successfully, the client starts to upload chunks of the original file into that directory. The sequence of the chunks is set by the names of the chunk files created in the directory. Once all chunks are uploaded, the client submits a MOVE request the renames the chunk upload directory to the target file.

Here is a pseudo code description of the sequence:

1. Client creates an upload directory with a self choosen name (ideally a numeric upload id):

MKCOL remote.php/uploads/upload-id

2. Client sends a chunk:

PUT remote.php/uploads/upload-id/chunk-id

3. Client repeats 2. until all chunks have successfully been uploaded
4. Client finalizes the upload:

MOVE remote.php/uploads/upload-id /path/to/target-file

5. The MOVE sends the ETag that is supposed to be overwritten in the request header to server. Server returns new ETag and FileID as reply headers of the MOVE.

During the upload, client can retrieve the current state of the upload by a PROPFIND request on the upload directory. The result will be a listing of all chunks that are already available on the server with metadata such as mtime, checksum and size.

If the server decides to remove an upload, ie. because it hasn’t been active for a time, it is free to remove the entire upload directory and return status 404 if a client tries to upload to. Also, a client is allowed to remove the entire upload directory to cancel an upload.

An upload is finalized by the MOVE request. Note that it’s a MOVE of a directory on a single file. This operation is not supported in normal file systems, but we think in this case, it has a nice well descriptive meaning. A MOVE is known as an atomic and fast operation, and that way it should be implemented by the server.

Also note that only with the final MOVE the upload operation is associated with the final destination file. We think that this approach already is a great improvement, because there is always a clear state of the upload with no secret knowledge hidden in the process.

In the next blog I will discuss an extension to this that adds more features to the process.

What do you think so far? Your feedback is appreciated, best on the ownCloud devel mailinglist!

  1. June 22, 2015 at 14:41

    Will the next posts clear up how to specify the chunk offsets, chunk size, final file size, expected “previous ETag” for the target file?

    Right now the pseudo-HTTP in this blog post is confusing since it does not specify a (placeholder) ID or whatever in the URL.🙂

    • dragotin
      June 22, 2015 at 15:22

      guruz, I fixed that, wordpress has eaten up the < and >

  2. Olivier
    June 23, 2015 at 13:59

    It would be great if you could list the down sides of each approach.

  3. July 3, 2015 at 09:20

    Having a clearly marked begin and end of chunked upload protocol is an improvement. How would you add checksumming into this process? The server needs to trigger the checksum on the assembled files. So in this proposal the MOVE marks the end of the upload transaction and it needs to be involved in the checksum process.

    • dragotin
      July 10, 2015 at 15:52

      Please stay tuned about the checksumming until the next blog where we will address this. Thanks!

  4. Andreas
    July 3, 2015 at 09:58

    Here are two existing implementations:

    NGINX:
    http://www.grid.net.ru/nginx/resumable_uploads.en.html
    https://github.com/pgaertig/nginx-big-upload

    Features

    PUT/POST uploads,
    Partial chunked and resumable uploads,
    On the fly resumable CRC32 checksum calculation (client-side state),
    On the fly resumable SHA-1
    nginx-upload-module resumable protocol compatibility,

    AmazonS3:
    http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html
    http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingRESTAPImpUpload.htm

    • dragotin
      July 10, 2015 at 16:02

      I wonder if these methods can easily be resumed if the client lost the information which parts already were sent. In our proposal the information which parts are already uploaded can be queried again from the server by doing PROPFIND on the server directory. The server keeps control on how long it keeps partial uploaded files.

      While writing that I understand that for that, the client still needs to remember the upload id. But since the client creates the upload id, it could be computed out of the file name and etag, so that it is easy to reproduce.

  5. July 8, 2015 at 22:26

    Would it be a possibility to use this as a basis as well?

    http://sabre.io/dav/http-patch/

    Personally I’m a bit curious if creating full-on resources for every chunk is needed. My own inclination would be to use a single resource and use PATCH to to do all the processing.

    I also think I would try keep the incomplete resource completely out of the regular DAV namespace. I think an incomplete resource will not have a lot of meaning to clients, except the client that’s uploading it. If it’s available for other clients to see, it could just create a risk of corruption.

    Lastly, I wonder if you guys considered looking into a simple upload, but simple handle disconnects more gracefully. If a file was only partially uploaded, the server could mark that file as such and just resume the upload using PATCH instead of preemptively split up the file in chunks. Not sure how PHP takes that though!

    • dragotin
      July 10, 2015 at 16:16

      Well, the PATCH command can still be used, if we update exiting files and can compute that we only wanna update parts. Again, this has the issue that it’s hard to resume.

      Concerning the namespace, I fully agree, but I think that was the idea, right deepdiver?

  6. July 10, 2015 at 15:29

    An very initial implementation of this concept can be found here …. https://github.com/DeepDiver1975/dav

  7. August 4, 2015 at 08:19

    Things to think about:

    – How does one set the mtime of the file or other metadata?
    – How long does the MOVE take? Will the server be able to do reply to the MOVE with a success or failure in a short time (smaller than the the HTTP timeout)?
    – Does the server has some mechanism to expire old chunks?
    – the MOVE should also have a if-match header to check there is no race conditions in the destination.

  8. August 29, 2015 at 21:37

    Instead of If-Match, use the excellent If: header!

  1. July 10, 2015 at 16:40
  2. July 30, 2015 at 18:34
  3. November 13, 2015 at 20:27

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: