Updates to Repository Sync Semantics
We’re excited to announce that we’re rolling out a new version of atproto repositories that removes history from the canonical structure of repositories, and replaces it with a logical clock. We’ll start rolling out this update next week (August 28, 2023).
For most developers with projects subscribed to the firehose, such as feed generators, this change shouldn’t affect you. These will only affect you if you’re doing commit-aware repo sync (a good rule of thumb is if you’ve ever passed earliest
or latest
to the com.atproto.sync.getRepo
method) or are explicitly checking the repo version when processing commits.
Removing Repository History
Repositories on the AT Protocol are like Git repositories, but for structured records. Just like Git, each commit to an atproto repository currently includes a pointer to the previous commit. However, this approach has caused a couple of pain points:
- Record deletions are difficult to process. If a user deletes a record, that commit needs to be erased from their repository to match their intent.
- Increased storage cost. Maintaining repo history can cause anywhere from a 5-10x increase in repo size.
We attempted to resolve both of these in the current model through rebases (discrete moments when the history of a repository is deleted/mutated, like in Git). However, this is a tricky and sensitive operation that is expensive to conduct and complex to communicate across the network.
Using a Logical Clock for Repositories
To address the above issues, we’re replacing the prev
pointer in commits with a logical clock. We originally published our intention to do so a few weeks ago. These are the changes we’re making to the way we handle repository history:
- Incrementing the repo version to
3
- Making the
prev
field on repo commits optional - Adding a new required
rev
(revision) field which is a logical clock - Removing or adjusting commit-aware repo sync mechanisms
Note: If you explicitly verify the version of a repo commit or do strict type checking on commit repo commits (which you shouldn’t — the spec allows unspecified fields!), you will need to make that check inclusive of version 3.
To facilitate backwards compatibility with software that is still running repo v2, we will continue setting the prev field on commits in the interim.
Even though we are setting the prev field, this can be considered a “hint” and the history is no longer considered a canonical part of the repository.
Repository Revisions
The new sync semantics for the repository rely on a logical clock included in each signed commit.
This “revision” takes the form of a TID and must be monotonically increasing.
The included revision serves a few functions:
Ordering
The clock provides a simple ordering mechanism for encountered repos or commits. If a consumer encounters the same repo from two different sources, each with a valid signature and structure, the revision gives a simple mechanism to determine which is the most recent repository.
Sync
When syncing a repository, revisions give a series of signposts that allow you to request everything from a given repo since a previously seen version. Because revisions are ordered and monotonically increasing, the provider does not necessarily need the exact revision that the consumer is asking for (as with a commit hash), rather they can provide all repo contents from the latest version of the repo that they remember that is before the requested revision.
The PDS for instance will track the revision at which each repo block or record was introduced into a repository. If a consumer asks for every block or record since a given revision, the PDS has a simple mechanism by which to give that information, without needing a complicated sync algorithm.
Stale Reads
Finally, a logical clock on the repo gives us a mechanism through which we can detect stale reads. (We actually already snuck this in with an optional revision field on v2 repos!)
Repo revisions may be returned in response headers to most requests. A client will know their own repo’s current revision and can compare that with the upstream service’s revision.
We use this today on the PDS to paper over some read-after-write concerns that are inherent in eventually consistent architectures. Some clients may use these headers to alert their users that their PDS is “out of sync” with other services in the network (for instance an AppView).
Available sync methods
If you have questions about these changes, join us on GitHub Discussions here.