Home > FOSS, ownCloud > ownCloud ETags and FileIDs

ownCloud ETags and FileIDs

Often questions come up about the meaning of FileIDs and ETags. Both values are metadata that the ownCloud Server stores for each of the files and directories in the server database. These values are fundamentally important for the integrity of data in the overall system.
Here are some thoughts about what they are why these are so important.This is mainly from a clients point of view, but there are other use cases as well.

ETags

ETags are strings that describe exactly one specific version of a file (example: 71a89a94b0846d53c17905a940b1581e).

data2Whenever the file changes, the ownCloud server will make sure that the ETag of the specific file changes as well. It is not important in which way the ETag changes, it also does not have to be strictly unique, it’s just important that it changes reliably if the file changes for whatever reason. However, ETags should not change if the file was not changed, otherwise the client will download that file again.

In addition to that, The ETags of the parent directories of the file have to change as well, up to the root directory. That way client systems can detect changes that happen somewhere in the file tree. This is in contrast to normal computer file systems where only the modification time of the direct parent of a file is changing.

File IDs

FileIDs are also strings that are created once at the creation time of the file (example: 00003867ocobzus5kn6s).

data3But contrary to the ETags, the file IDs should never ever change over the files lifetime. Not on an edit of the file, and also not if the file is renamed or moved. One of the important usages of the FileID is to detect renames and moves of a file on the server.

The FileID is used as an unique key to identify a file. FileIDs need to be unique within one ownCloud, and in inter-owncloud connections, they must be compared together with the ownCloud server instance id.

Also, the FileIDs must never be recycled or reused.

Checksums?

Often ETags and FileIDs are confused with checksums such as MD5 or SHA1 sums over the file content.

Neither ETags nor FileIDs are, even if there are similarities: Especially the ETag can be seen as a checksum over the file content. However, file checksums are way more costly to compute than just a value that only needs to change somehow.

What happens if…?

Let’s make a thought experiment and consider what it would mean especially for sync clients if either fileID or ETag gets lost from the servers database.

If ETags are lost, clients loose the ability to decide if files have changed since the last time that was checked by the clients. So what happens is that the client will download the files again, byte-wise compare them to the local file and use the server file if the files differ. A conflict file will be created. Because the ETag was lost, the server will create new ETags on download. This could be improved by the server creating more predictable ETags based on the storage backends capabilities.

If the ETags are changed without reason, for example because a backup was played back on the server, the clients will consider the ones with changed ETags as changed and redownload them. Conflict handling will happen as described if there was a local change as well.

For the user, this means a lot of unnecessary downloads as well as potential conflicts. However, there will not be data loss.

If FileIDs got lost or changed, the problem is that renames or moves on server side can no longer be detected. That would result in a new download of files in the good case. If a fileID however changes to something that was used before, that can result in a rename that overwrites an unrelated file. That is because clients might still have the FileID associated with another file.

Hopefully this little post explains the importance of the additional metadata that we maintain in ownCloud.

  1. kuba
    June 16, 2015 at 12:02

    Hi Klaas,

    Nice post but please keep in mind that FileId may actually change for a file which otherwise could be considered always the same file. Consider this on the server:

    echo “new content” > file1.txt
    mv file1.txt file2.txt

    versus this:

    rm file1.txt
    echo “new content” > file1.txt
    mv file1.txt file2.txt

    Simply because there was an rm operation the file2.txt FileIds are different in two cases. And you might have not even noticed the rm because it would happen before the sync kicked in.

    So this is my point about permanent relationship between files on the server and on the client and the limitations of this approach: https://github.com/cernbox/smashbox/blob/master/protocol/protocol.md#lifecycle-and-semantics-of-fileid-vs-path-vs-etag

    It boils down to this: always refresh the locally cached FileId from the response header of GET to make sure your client db does not contain stale FileIds. And do not base case conflict resolution based on FileId.

  1. April 10, 2015 at 14:00

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: