6 posts tagged with "guide"

Bluesky's Moderation Architecture

March 15, 2024 · 12 min read

Moderation is a crucial aspect of any social network. However, traditional moderation systems often lack transparency and user control, leaving communities vulnerable to sudden policy changes and potential mismanagement. To build a better social media ecosystem, it is necessary to try new approaches.

Today, we’re releasing an open labeling system on Bluesky. “Labeling” is a key part of moderation; it is a system for marking content that may need to be hidden, blurred, taken down, or annotated in applications. Labeling is how a lot of centralized moderation works under the hood, but nobody has ever opened it up for anyone to contribute. By building an open source labeling system, our goal is to empower developers, organizations, and users to actively participate in shaping the future of moderation.

In this post, we’ll dive into the details on how labeling and moderation works in the AT Protocol.

An open network of services

The AT Protocol is an open network of services that anyone can provide, essentially opening up the backend architecture of a large-scale social network. The core services form a pipeline where data flows from where it’s hosted, through a data firehose, and out to the various application indexes.

Data flows from independent account hosts into a firehose and then to applications.

Data flows from independent account hosts into a firehose and then to applications.

This event-driven architecture is similar to other high-scale systems, where you might traditionally use tools like Kafka for your data firehose. However, our open system allows anyone to run a piece of the backend. This means that there can be many hosts, firehoses, and indexes, all operated by different entities and exchanging data with each other.

Account hosts will sync with many firehoses.

Account hosts will sync with many firehoses.

Why would you want to run one of these services?

You’d run a PDS (Personal Data Server) if you want to self-host your data and keys to get increased control and privacy.
You’d run a Relay if you want a full copy of the network, or to crawl subsets of the network for targeted applications or services.
You’d run an AppView if you want to build custom applications with tailored views and experiences, such as a custom view for microblogging or for photos.

So what if you want to run your own moderation?

Decentralized moderation

On traditional social media platforms, moderation is often tightly coupled with other aspects of the system, such as hosting, algorithms, and the user interface. This tight coupling reduces the resilience of social networks as businesses change ownership or as policies shift due to financial or political pressures, leaving users with little choice but to accept the changes or stop using the service.

Decentralized moderation provides a safeguard against these risks. It relies on three principles:

Separation of roles. Moderation services operate separately from other services – particularly hosting and identity – to limit the potential for overreach.
Distributed operation. Multiple organizations providing moderation services reduces the risk of a single entity failing to serve user interests.
Interoperation. Users can choose between their preferred clients and associated moderation services without losing access to their communities.

In the AT Protocol, the PDS stores and manages user data, but it isn’t designed to handle moderation directly. A PDS could remove or filter content, but we chose not to rely on this for two main reasons. First, users can easily switch between PDS providers thanks to the account-migration feature. This means any takedowns performed by a PDS might only have a short-term effect, as users could move their data to another provider. Second, data hosting services aren't always the best equipped to deal with the challenges of content moderation, and those with local expertise and community building skills who want to participate in moderation may lack the technical capacity to run a server.

This is different from ActivityPub servers (Mastodon), which manage both data hosting and moderation as a bundled service, and do not make it as easy to switch servers as the AT Protocol does. By separating data storage from moderation, we let each service focus on what it does best.

Where moderation is applied

Moderation is done by a dedicated service called the Labeler (or “Labeling service”).

Labelers produce “labels” which are associated with specific pieces of user-generated content, such as individual posts, accounts, lists, or feeds. These labels make an assertion about the content, such as whether it contains sensitive material, is unpleasant, or is misleading.

These labels get synced to the AppViews where they can be attached to responses at the client’s request.

Labels are synced into AppViews where they can be attached to responses.

Labels are synced into AppViews where they can be attached to responses.

The clients read those labels to decide what to hide, blur, or drop. Since the clients choose their labelers and how to interpret the labels, they can decide which moderation systems to support. The chosen labels do not have to be broadcast, except to the AppView and PDS which fulfill the requests. A user subscribing to a labeler is not public, though the PDS and AppView can privately infer which users are subscribed to which services.

In the Bluesky app, we hardcode our in-house moderation to provide a strong foundation that upholds our community guidelines. We will continue to uphold our existing policies in the Bluesky app, even as this new architecture is made available. With the introduction of labelers, users will be able to subscribe to additional moderation services on top of the existing foundation of our in-house moderation.

The Bluesky application hardcodes its labeling and then stacks community labelers on top.

The Bluesky application hardcodes its labeling and then stacks community labelers on top.

For the best user experience, we suggest that clients in the AT Protocol ecosystem follow this pattern: have at least one built-in moderation service, and allow additional user-chosen mod services to be layered in on top.

The Bluesky app is a space that we create and maintain, and we want to provide a positive environment for our users, so our moderation service is built-in. On top of that, the additional services that users can subscribe to creates a lot of options within the app. However, if users disagree with Bluesky’s application-level moderation, they can choose to use another client on the network with its own moderation system. There are additional nuances to infrastructure-level moderation, which we will discuss below, but most content moderation happens at the application level.

How are labels defined?

A limited core set of labels are defined at the protocol level. These labels handle generic cases (“Content Warning”) and common adult content cases (“Pornography,” “Violence”).

A label can cover the content of a post with a warning.

A label can cover the content of a post with a warning.

Labelers may additionally define their own custom labels. These definitions are relatively straightforward; they give the label a localized name and description, and define the effects they can have.

interface LabelDefinition {
  identifier: string
  severity: 'inform' | 'alert'
  blurs: 'content' | 'media' | 'none'
  defaultSetting: 'hide' | 'warn' | 'ignore'
  adultContent: boolean
  locales: Record<string, LabelStrings>
}

interface LabelStrings {
  name: string
  description: string  
}

Using these definitions, it’s possible to create labels which are informational (“Satire”), topical (“Politics”), curational (“Dislike”), or moderational (“Rude”).

Users can then tune how the application handles these labels to get the outcomes they want.

Users configure whether they want to use each label.

Users configure whether they want to use each label.

Learn more about label definitions in the API Docs on Labelers and Moderation.

Running a labeler

We recently open-sourced Ozone, our powerful open-source Labeler service that we use in-house to moderate Bluesky. This is a significant step forward in transparency and community involvement, as we're sharing the same professional-grade tooling that our own moderation team relies on daily.

Ozone is designed for traditional moderation, where a team of moderators receives reports and takes action on them, but you're free to apply other models with custom software. We recommend anyone interested in running a labeler try out Ozone, as it simplifies the process by helping labelers set up their service account, field reports, and publish labels from the same web interface used by Bluesky's own moderation team, ensuring that all the necessary technical requirements are met. Detailed instructions for setting up and operating Ozone can be found in the readme.

Beyond Ozone’s interface, you can also explore alternative ways of labeling content. A few examples we've considered:

Community-driven voting systems (great for curation)
Network analysis (e.g., for detecting botnets)
AI models

If you want to create your own labeler, you simply need to build a web service that implements two endpoints to serve its labels to the wider network:

com.atproto.label.subscribeLabels : a realtime subscription of all labels
com.atproto.label.queryLabels : for looking up labels you've published on user-generated content

Reporting content is an important part of moderation, and reports should be sent privately to whoever is able to act on them. The Labeling service's API includes a specific endpoint designed for this use case. To receive user reports, a labeler can implement:

com.atproto.report.createReport : files a report from a particular user

When a user submits a report, they choose which of their active Labelers to report to. This gives users the ability to decide who should be informed of an issue. Reports serve as an additional signal for labelers, and they can handle them in a manner that best suits their needs, whether through human review or automated resolution. Appeals from users are essentially another type of report that provides feedback to Labelers. It's important to note that a Labeler is not required to accept reports.

In addition to the technical implementation, your labeler should also have a dedicated Bluesky account associated with it. This account serves as your labeler's public presence within the Bluesky app, allowing you to share information about the types of labels you plan to publish and how users should interpret them. By publishing an app.bsky.labeler.service record, you effectively "convert" your account into a Bluesky labeler, enabling users to discover and subscribe to your labeling service.

More details about labels and labelers can be found in the atproto specs.

Infrastructure moderation

Labeling is the basic building block of composable moderation, but there are other aspects involved. In the AT Protocol network, various services, such as the PDS, Relay, and AppView, have ultimate discretion over what content they carry, though it's not the most straightforward avenue for content moderation. Services that are closer to users, such as the client and labelers, are designed to be more actively involved in community and content moderation. These service providers have a better understanding of the specific community norms and social dynamics within their user base. By handling content moderation at this level, clients and labelers can make more informed decisions that align with the expectations and values of their communities.

Infrastructure providers such as Relays play a different role in the network, and are designed to be a common service provider that serves many kinds of applications. Relays perform simple data aggregation, and as the network grows, may eventually come to serve a wide range of social apps, each with their own unique communities and social norms. Consequently, Relays focus on combating network abuse and mitigating infrastructure-level harms, rather than making granular content moderation decisions.

An example of harm handled at the infrastructure layer is content that is illegal to host, such as child sexual abuse material (CSAM). Service providers should actively detect and remove content that cannot be hosted in the jurisdictions in which they operate. Bluesky already actively monitors its infrastructure for illegal content, and we're working on systems to advise other services (like PDS hosts) about issues we find.

Labels drive moderation in the client. The Relay and Appview apply infrastructure moderation.

Labels drive moderation in the client. The Relay and Appview apply infrastructure moderation.

This separation between backend infrastructure and application concerns is similar to how the web itself works. The PDSs function like personal websites or blogs on the web, which are hosted by various hosting providers. Just as individuals can choose their hosting provider and move their website if needed, users on the AT Protocol can select their PDS and migrate their data if they wish to change providers. Multiple companies can then run Relays and AppViews over PDSs, which are similar to content delivery networks and search engines, that serve as the backbone infrastructure to aggregate and index information. To provide a unified experience to the end user, application and labeling systems then provide a robust, opinionated approach to content moderation, the way individual websites and applications set their own community guidelines.

In summary

Bluesky's open labeling system is a significant step towards a more transparent, user-controlled, and resilient way to do moderation. We’ve opened up the way centralized moderation works under the hood for anyone to contribute, and provided a seamless integration into the Bluesky app for independent moderators. In addition, by open sourcing our internal moderation tools, we're allowing anyone to use, run, and contribute to improving them.

This open labeling system is a fundamentally new approach that has never been tried in the realm of social media moderation. In an industry where innovation has been stagnant for far too long, we are experimenting with new solutions to address the complex challenges faced by online communities. Exploring new approaches is essential if we want to make meaningful progress in tackling the problems that plague social platforms today, and we have designed and implemented what we believe to be a powerful and flexible approach.

Our goal has always been to build towards a more transparent and resilient social media ecosystem that can better represent an open society. We encourage developers, users, and organizations to get involved in shaping the future of moderation on Bluesky by running their own labeling services, contributing to the open-source Ozone project, or providing feedback on this system of stackable moderation. Together, we can design a more user-controlled social media ecosystem that empowers individuals and communities to create better online spaces.

Additional reading:

Early Access Federation for Self-Hosters

February 22, 2024 · 5 min read

For a high-level introduction to data federation, as well as a comparison to other federated social protocols, check out the Bluesky blog.

Update May 2024: we have removed the Discord registration requirement, and PDS instances can now connect to the network directly. You are still welcome to join the PDS Admins Discord for community support.

Today, we’re releasing an early-access version of federation intended for self-hosters and developers.

The atproto network is built upon a layer of self-authenticating data. This foundation is critical to guaranteeing the network’s long term integrity. But the protocol’s openness ultimately flows from a diverse set of hosts broadcasting this data across the network.

Up until now, every user on the network used a Bluesky PDS (Personal Data Server) to host their data. We’ve already federated our own data hosting on the backend, both to help operationally scale our service, and to prove out the technical underpinnings of an openly federated network. But today we’re opening up federation for anyone else to begin connecting with the network.

The PDS, in many ways, fulfills a simple role: it hosts your account and gives you the ability to log in, it holds the signing keys for your data, and it keeps your data online and highly available. Unlike a Mastodon instance, it does not need to function as a full-fledged social media service. We wanted to make atproto data hosting—like web hosting—into a fairly simple commoditized service. The PDS’s role has been limited in scope to achieve this goal. By limiting the scope, the role of a PDS in maintaining an open and fluid data network has become all the more powerful.

We’ve packaged the PDS into a friendly distribution with an installer script that handles much of the complexity of setting up a PDS. After you set up your PDS and join the PDS Admins Discord to submit a request for your PDS to be added to the network, your PDS’s data will get routed to other services in the network (like feed generators and the Bluesky Appview) through our Relay, the firehose provider. Check out our Federation Overview for more information on how data flows through the atproto network.

Early Access Limitations

As with many open systems, Relays will never be totally unconstrained in terms of what data they’re willing to crawl and rebroadcast. To prevent network and resource abuse, it will be necessary to place rate limits on the PDS hosts that they consume data from. As trust and reputation is established with PDS hosts, those rate limits will increase. We’ll gain capacity to increase the baseline rate limits we have in place for new PDSs in the network as we build better tools for detecting and mitigating abuse..

For a smooth transition into a federated network, we’re starting with some lower limits. Specifically, each PDS will be able to host 10 accounts and limited to 1500 evts/hr and 10,000 evts/day. After those limits are surpassed, we’ll stop crawling the PDS until the rate limit period resets. This is intended to keep the network and firehose running smoothly for everyone in the ecosystem.

These are early days, and we have some big changes still planned for the PDS distribution (including the introduction of OAuth!). The software will be updating frequently and things may break. We will not be breaking things indiscriminately. However, in this early period, in order to avoid cruft in the protocol and PDS distribution, we are not making promises of backwards compatibility. We will be supporting a migration path for each release, but if you do not keep your PDS distribution up to date, it may break and render the app unusable until you do so.

Because the PDS distribution is not totally settled, we want to have a line of communication with PDS admins in the network, so we’re asking any developer that plans to run a PDS to join the PDS Admins Discord. You’ll need to provide the hostname of your PDS and a contact email in order to get your PDS added to the Relay’s allowlist. This Discord will serve as a channel where we can announce updates about the PDS distribution, relay policy, and federation more generally. It will also serve as a community where PDS admins can experiment, chat, and help each other debug issues.

Account Migration

A major promise of the AT Protocol is the ability to migrate accounts between PDS hosts. This is an important check against potential abuse, and further safeguards the fluid open layer of data hosting that underpins the network.

The PDS distribution that we’re releasing has all of the facilities required to migrate accounts between servers. We’re also opening routes on our PDS that will allow you to migrate your account off our server. However — we do not yet provide the capability to migrate back to the Bluesky PDS, so for the time being, this is a one way street. Be warned: these migrations involve possibly destructive identity operations. While we have some guardrails in place, it may still be possible for you to break your account and we will not be able to help you recover it. So although it’s technically possible, we do not recommend migrating your main account between servers yet. We especially recommend against doing so if you do not have familiarity with how DID PLC operations work.

In the coming months we will be hardening this feature and making it safer and easier to do, including creating an in-app flow for moving between servers.

Getting Started

To get started, join the PDS Administrators Discord, and check out the bluesky-social/pds repo on Github. The README will provide all necessary information on getting your PDS setup and running.

Download and Parse Repository Exports

November 6, 2023 · 9 min read

One of the core principles of the AT Protocol is simple access to public data, including posts, multimedia blobs, and social graph metadata. A user's data is stored in a repository, which can be efficiently exported all together as a CAR file (.car). This post will describe how to export and parse a data repository.

A user's data repository consists of individual records, each of which can be accessed in JSON format via HTTP API endpoints.

The example code in this post is in the Go programming language, and uses the atproto SDK packages from indigo. You can find the full source code in our example cookbook GitHub repository.

This post is written for a developer audience. We plan on adding a feature for users to easily export their own data from within the app in the future.

Privacy Notice

While atproto data is public, you should take care to respect the rights, intents, and expectations of others. The following examples work for downloading any account's public data.

This goes beyond following copyright law, and includes respecting content deletions and block relationships. Images and other media content does not come with any reuse rights, unless explicitly noted by the account holder.

Download a Repository

On Bluesky's Main PDS Instance

You can easily construct a URL to download a repository on Bluesky's main PDS instance. In this case, the PDS host is https://bsky.social, the Lexicon endpoint is com.atproto.sync.getRepo, and the account DID is passed as a query parameter.

As a result, the download URL for the @atproto.com account is:

https://bsky.social/xrpc/com.atproto.sync.getRepo?did=did:plc:ewvi7nxzyoun6zhxrhs64oiz

Note that this endpoint intentionally does not require authentication: content in a user's repository is public (much like a public website), and anybody can download it from the web. Such content includes posts and likes, but does not include content like mutes and list subscriptions.

If you navigate to that URL, you'll download the repository for the @atproto.com account on Bluesky. But if you try to open that file, it won't make sense yet. We'll show you how to parse the data later in this post.

On Another Instance

In the more general case, we start with any "AT Identifier" (handle or DID). We need to find account's PDS instance. This involves first resolving the handle or DID to the account's DID document, then parsing out the #atproto_pds service entry.

The github.com/bluesky-social/indigo/atproto/identity package handles all of this for us already:

import (
    "fmt"
    "context"

    "github.com/bluesky-social/indigo/atproto/identity"
    "github.com/bluesky-social/indigo/atproto/syntax"
    "github.com/bluesky-social/indigo/xrpc"
)

func main() {
    run()
}

func run() error {
    ctx := context.Background()
    atid, err := syntax.ParseAtIdentifier("atproto.com")
    if err != nil {
        return err
    }

    dir := identity.DefaultDirectory()
    ident, err := dir.Lookup(ctx, *atid)
    if err != nil {
        return err
    }

    if ident.PDSEndpoint() == "" {
        return fmt.Errorf("no PDS endpoint for identity")
    }

    fmt.Println(ident.PDSEndpoint())
}

Once we know the PDS endpoint, we can create an atproto API client, call the getRepo endpoint, and write the results out to a local file on disk:

carPath := ident.DID.String() + ".car"

xrpcc := xrpc.Client{
    Host: ident.PDSEndpoint(),
}

repoBytes, err := comatproto.SyncGetRepo(ctx, &xrpcc, ident.DID.String(), "")
if err != nil {
    return err
}

err = os.WriteFile(carPath, repoBytes, 0666)
if err != nil {
    return err
}

The go-repo-export example from the cookbook repository implements this with the download-repo command:

> ./go-repo-export download-repo atproto.com
resolving identity: atproto.com
downloading from https://bsky.social to: did:plc:ewvi7nxzyoun6zhxrhs64oiz.car

Now you know that the @atproto.com account's PDS instance is at bsky.social, and we've downloaded the repository's CAR file.

Parse Records from CAR File as JSON

What's a CAR File?

CAR files are a standard file format from the IPLD ecosystem. They stand for "Content Addressable aRchives." They have a simple binary format, with a series of binary (CBOR) blocks concatenated together, not dissimilar to tar files or Git packfiles. They're well-suited for efficient data processing and archival storage, but they're not the most accessible to developers.

The Repository CAR File

The repository data structure is a key-value store, with the keys being a combination of a collection name (NSID) and "record key", separated by a slash (<collection>/<rkey>). The CAR file contains a reference (CID) pointing to a (signed) commit object, which then points to the top of the key-value tree. The commit object also has an atproto repo version number (at time of writing, currently 3), the account DID, and a revision string.

Let's load the repository tree structure out of the CAR file in to memory, and list all of the record paths (keys).

import (
    "encoding/json"
    "context"
    "fmt"
    "os"
    "path/filepath"

    "github.com/bluesky-social/indigo/repo"
)
func carList(carPath string) error {
    ctx := context.Background()
    fi, err := os.Open(carPath)
    if err != nil {
        return err
    }

    // read repository tree in to memory
    r, err := repo.ReadRepoFromCar(ctx, fi)
    if err != nil {
        return err
    }

    // extract DID from repo commit
    sc := r.SignedCommit()
    did, err := syntax.ParseDID(sc.Did)
    if err != nil {
        return err
    }
    topDir := did.String()

    // iterate over all of the records by key and CID
    err = r.ForEach(ctx, "", func(k string, v cid.Cid) error {
        fmt.Printf("%s\t%s\n", k, v.String())
        return nil
    })
    if err != nil {
        return err
    }
    return nil
}

Note that the ForEach iterator provides a record path string as a key, and a CID as the value, instead of the record data itself. If we want to get the record itself, we need to fetch the "block" (CBOR bytes) from the repository, using the CID reference.

Let's also convert the binary CBOR data in to a more accessible JSON format, and write out records to disk. The following code snippet could go in the ForEach function in the previous example:

// where to write out data on local disk
recPath := topDir + "/" + k
os.MkdirAll(filepath.Dir(recPath), os.ModePerm)
if err != nil {
    return err
}


// fetch the record CBOR and convert to a golang struct
_, rec, err := r.GetRecord(ctx, k)
if err != nil {
    return err
}

// serialize as JSON
recJson, err := json.MarshalIndent(rec, "", "  ")
if err != nil {
    return err
}

if err := os.WriteFile(recPath+".json", recJson, 0666); err != nil {
    return err
}

return nil

In the cookbook repository, the go-repo-export example implements these as list-records and unpack-record:

> ./go-repo-export list-records did:plc:ewvi7nxzyoun6zhxrhs64oiz.car
=== did:plc:ewvi7nxzyoun6zhxrhs64oiz ===
key record_cid
app.bsky.actor.profile/self bafyreifbxwvk2ewuduowdjkkjgspiy5li2dzyycrnlbu27gn3hfgthez3u
app.bsky.feed.like/3jucagnrmn22x    bafyreieohq4ngetnrpse22mynxpinzfnaw6m5xcsjj3s4oiidjlnnfo76a
app.bsky.feed.like/3jucahkymkk2e    bafyreidqrmqvrnz52efgqfavvjdbwob3bc2g3vvgmhmexgx4xputjty754
app.bsky.feed.like/3jucaj3qgmk2h    bafyreig5c2atahtzr2vo4v64aovgqbv6qwivfwf3ex5gn2537wwmtnkm3e
[...]

> ./go-repo-export unpack-records did:plc:ewvi7nxzyoun6zhxrhs64oiz.car
writing output to: did:plc:ewvi7nxzyoun6zhxrhs64oiz
did:plc:ewvi7nxzyoun6zhxrhs64oiz/app.bsky.actor.profile/self.json
did:plc:ewvi7nxzyoun6zhxrhs64oiz/app.bsky.feed.like/3jucagnrmn22x.json
did:plc:ewvi7nxzyoun6zhxrhs64oiz/app.bsky.feed.like/3jucahkymkk2e.json
did:plc:ewvi7nxzyoun6zhxrhs64oiz/app.bsky.feed.like/3jucaj3qgmk2h.json
[...]

If you were downloading and working with CAR in a higher-stakes situation than just running a one-off repository export, you would probably want to confirm the commit signature using the account's signing public key (included in the resolved identity metadata). Signing keys can change over time, meaning the signatures in old repo exports will no longer validate. It may be a good idea to keep a copy of the identity metadata along side the repository for long-term storage.

Downloading Blobs

An account's repository contains all the current (not deleted) records. These records include likes, posts, follows, etc. and may refer to images and other media "blobs" by hash (CID), but the blobs themselves aren't stored directly in the repository. So if you want a full public account data export, you also need to fetch the blobs.

It is possible to parse through all the records in a repository and extract all the blob references (tip: they all have $type: blob). But PDS instances also implement a helpful com.atproto.sync.listBlobs endpoint, which returns all the CIDs (blob hashes) for a specific account (DID).

The com.atproto.sync.getBlob endpoint is used to download the original blob itself.

Neither of these PDS endpoints require authentication, though they may be rate-limited by operators to prevent resource exhaustion or excessive bandwidth costs.

Note that the first part of the blob download function is very similar to the CAR download: resolving identity to find the account's PDS:

func blobDownloadAll(raw string) error {
    ctx := context.Background()
    atid, err := syntax.ParseAtIdentifier(raw)
    if err != nil {
        return err
    }

    // resolve the DID document and PDS for this account
    dir := identity.DefaultDirectory()
    ident, err := dir.Lookup(ctx, *atid)
    if err != nil {
        return err
    }

    // create a new API client to connect to the account's PDS
    xrpcc := xrpc.Client{
        Host: ident.PDSEndpoint(),
    }
    if xrpcc.Host == "" {
        return fmt.Errorf("no PDS endpoint for identity")
    }

    topDir := ident.DID.String() + "/_blob"
    os.MkdirAll(topDir, os.ModePerm)

    // blob-specific part starts here!
    cursor := ""
    for {
        // loop over batches of CIDs
        resp, err := comatproto.SyncListBlobs(ctx, &xrpcc, cursor, ident.DID.String(), 500, "")
        if err != nil {
            return err
        }
        for _, cidStr := range resp.Cids {
            // if the file already exists, skip
            blobPath := topDir + "/" + cidStr
            if _, err := os.Stat(blobPath); err == nil {
                continue
            }

            // download the entire blob in to memory, then write to disk
            blobBytes, err := comatproto.SyncGetBlob(ctx, &xrpcc, cidStr, ident.DID.String())
            if err != nil {
                return err
            }
            if err := os.WriteFile(blobPath, blobBytes, 0666); err != nil {
                return err
            }
        }

        // a cursor in the result means there are more CIDs to enumerate
        if resp.Cursor != nil && *resp.Cursor != "" {
            cursor = *resp.Cursor
        } else {
            break
        }
    }
    return nil
}

In the cookbook repository, the go-repo-export example implements list-blobs and download-blobs commands:

> ./go-repo-export list-blobs atproto.com
bafkreiacrjijybmsgnq3mca6fvhtvtc7jdtjflomoenrh4ph77kghzkiii
bafkreib4xwiqhxbqidwwatoqj7mrx6mr7wlc5s6blicq5wq2qsq37ynx5y
bafkreibdnsisdacjv3fswjic4dp7tju7mywfdlcrpleisefvzf44c3p7wm
bafkreiebtvblnu4jwu66y57kakido7uhiigenznxdlh6r6wiswblv5m4py
[...]

> ./go-repo-export download-blobs atproto.com
writing blobs to: did:plc:ewvi7nxzyoun6zhxrhs64oiz/_blob
did:plc:ewvi7nxzyoun6zhxrhs64oiz/_blob/bafkreiacrjijybmsgnq3mca6fvhtvtc7jdtjflomoenrh4ph77kghzkiii  downloaded
did:plc:ewvi7nxzyoun6zhxrhs64oiz/_blob/bafkreib4xwiqhxbqidwwatoqj7mrx6mr7wlc5s6blicq5wq2qsq37ynx5y  downloaded
did:plc:ewvi7nxzyoun6zhxrhs64oiz/_blob/bafkreibdnsisdacjv3fswjic4dp7tju7mywfdlcrpleisefvzf44c3p7wm  downloaded
[...]

This will download blobs for a repository.

A more rigorous implementation should verify the blob CID (by hashing the downloaded bytes), at a minimum to detect corruption and errors.

Posting via the Bluesky API

August 11, 2023 · 11 min read

This blog post may become outdated as new features are added to atproto and the Bluesky application schemas.

First, you'll need a Bluesky account. We'll create a session with HTTPie (brew install httpie).

http post https://bsky.social/xrpc/com.atproto.server.createSession \
        identifier="$BLUESKY_HANDLE" \
        password="$BLUESKY_APP_PASSWORD"

Now you can create a post by sending a POST request to the createRecord endpoint.

http post https://bsky.social/xrpc/com.atproto.repo.createRecord \
    Authorization:"Bearer $AUTH_TOKEN" \
    repo="$BLUESKY_HANDLE" \
    collection=app.bsky.feed.post \
    record:="{\"text\": \"Hello world! I posted this via the API.\", \"createdAt\": \"`date -u +"%Y-%m-%dT%H:%M:%SZ"`\"}"

Posts can get a lot more complicated with replies, mentions, embedding images, and more. This guide will walk you through how to create these more complex posts in Python, but there are many API clients and SDKs for other programming languages and Bluesky PBC publishes atproto code in TypeScript and Go as well.

Skip the steps below and get the full script here. It was tested with Python 3.11, with the requests and bs4 (BeautifulSoup) packages installed.

Authentication

Posting on Bluesky requires account authentication. Have your Bluesky account handle and App Password handy.

import requests

BLUESKY_HANDLE = "example.bsky.social"
BLUESKY_APP_PASSWORD = "123-456-789"

resp = requests.post(
    "https://bsky.social/xrpc/com.atproto.server.createSession",
    json={"identifier": BLUESKY_HANDLE, "password": BLUESKY_APP_PASSWORD},
)
resp.raise_for_status()
session = resp.json()
print(session["accessJwt"])

The com.atproto.server.createSession API endpoint returns a session object containing two API tokens: an access token (accessJwt) which is used to authenticate requests but expires after a few minutes, and a refresh token (refreshJwt) which lasts longer and is used only to update the session with a new access token. Since we're just publishing a single post, we can get away with a single session and not bother with refreshing.

Post Record Structure

Here is what a basic post record should look like, as a JSON object:

{
  "$type": "app.bsky.feed.post",
  "text": "Hello World!",
  "createdAt": "2023-08-07T05:31:12.156888Z"
}

Bluesky posts are repository records with the Lexicon type app.bsky.feed.post — this just defines the schema for what a post looks like.

Each post requires these fields: text and createdAt (a timestamp).

This script below will create a simple post with just a text field and a timestamp. You'll need the datetime package installed.

import json
from datetime import datetime, timezone

# Fetch the current time
# Using a trailing "Z" is preferred over the "+00:00" format
now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")

# Required fields that each post must include
post = {
    "$type": "app.bsky.feed.post",
    "text": "Hello World!",
    "createdAt": now,
}

resp = requests.post(
    "https://bsky.social/xrpc/com.atproto.repo.createRecord",
    headers={"Authorization": "Bearer " + session["accessJwt"]},
    json={
        "repo": session["did"],
        "collection": "app.bsky.feed.post",
        "record": post,
    },
)
print(json.dumps(resp.json(), indent=2))
resp.raise_for_status()

The full repository path (including the auto-generated rkey) will be returned as a response to the createRecord request. It looks like:

{
  "uri": "at://did:plc:u5cwb2mwiv2bfq53cjufe6yn/app.bsky.feed.post/3k4duaz5vfs2b",
  "cid": "bafyreibjifzpqj6o6wcq3hejh7y4z4z2vmiklkvykc57tw3pcbx3kxifpm"
}

Setting the Post's Language

Setting the post's language helps custom feeds or other services filter and parse posts.

This snippet sets the text and langs value of a post to be Thai and English.

# an example with Thai and English (US) languages
post["text"] = "สวัสดีชาวโลก!\nHello World!"
post["langs"] = ["th", "en-US"]

The resulting post record object looks like:

{
  "$type": "app.bsky.feed.post",
  "text": "\u0e2a\u0e27\u0e31\u0e2a\u0e14\u0e35\u0e0a\u0e32\u0e27\u0e42\u0e25\u0e01!\\nHello World!",
  "createdAt": "2023-08-07T05:44:04.395087Z",
  "langs": [ "th", "en-US" ]
}

The langs field indicates the post language, which can be an array of strings in BCP-47 format.

You can include multiple values in the array if there are multiple languages present in the post. The Bluesky Social client auto-detects the languages in each post and sets them as the default langs value, but a user can override the configuration on a per-post basis.

Mentions and Links

Mentions and links are annotations that point into the text of a post. They are actually part of a broader system for rich-text "facets." Facets only support links and mentions for now, but can be extended to support features like bold and italics in the future.

Suppose we have a post:

✨ example mentioning @atproto.com to share the URL 👨‍❤️‍👨 https://en.wikipedia.org/wiki/CBOR.

Our goal is to turn the handle (@atproto.com) into a mention and the URL (https://en.wikipedia.org/wiki/CBOR) into a link. To do that, we grab the starting and ending locations of each "facet".

✨ example mentioning @atproto.com to share the URL 👨‍❤️‍👨 https://en.wikipedia.org/wiki/CBOR.
             start=23^     end=35^            start=74^                          end=108^

We then identify them in the facets array, using the mention and link feature types. (You can view the schema of a facet object here.) The post record will then look like this:

{
  "$type": "app.bsky.feed.post",
  "text": "\u2728 example mentioning @atproto.com to share the URL \ud83d\udc68\u200d\u2764\ufe0f\u200d\ud83d\udc68 https://en.wikipedia.org/wiki/CBOR.",
  "createdAt": "2023-08-08T01:03:41.157302Z",
  "facets": [
    {
      "index": {
        "byteStart": 23,
        "byteEnd": 35
      },
      "features": [
        {
          "$type": "app.bsky.richtext.facet#mention",
          "did": "did:plc:ewvi7nxzyoun6zhxrhs64oiz"
        }
      ]
    },
    {
      "index": {
        "byteStart": 74,
        "byteEnd": 108
      },
      "features": [
        {
          "$type": "app.bsky.richtext.facet#link",
          "uri": "https://en.wikipedia.org/wiki/CBOR"
        }
      ]
    }
  ]
}

You can programmatically set the start and end points of a facet with regexes. Here's a script that parses mentions and links:

import re
from typing import List, Dict

def parse_mentions(text: str) -> List[Dict]:
    spans = []
    # regex based on: https://atproto.com/specs/handle#handle-identifier-syntax
    mention_regex = rb"[$|\W](@([a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)"
    text_bytes = text.encode("UTF-8")
    for m in re.finditer(mention_regex, text_bytes):
        spans.append({
            "start": m.start(1),
            "end": m.end(1),
            "handle": m.group(1)[1:].decode("UTF-8")
        })
    return spans

def parse_urls(text: str) -> List[Dict]:
    spans = []
    # partial/naive URL regex based on: https://stackoverflow.com/a/3809435
    # tweaked to disallow some training punctuation
    url_regex = rb"[$|\W](https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*[-a-zA-Z0-9@%_\+~#//=])?)"
    text_bytes = text.encode("UTF-8")
    for m in re.finditer(url_regex, text_bytes):
        spans.append({
            "start": m.start(1),
            "end": m.end(1),
            "url": m.group(1).decode("UTF-8"),
        })
    return spans

Once the facet segments have been parsed out, we can then turn them into app.bsky.richtext.facet objects.

# Parse facets from text and resolve the handles to DIDs
def parse_facets(text: str) -> List[Dict]:
    facets = []
    for m in parse_mentions(text):
        resp = requests.get(
            "https://bsky.social/xrpc/com.atproto.identity.resolveHandle",
            params={"handle": m["handle"]},
        )
        # If the handle can't be resolved, just skip it!
        # It will be rendered as text in the post instead of a link
        if resp.status_code == 400:
            continue
        did = resp.json()["did"]
        facets.append({
            "index": {
                "byteStart": m["start"],
                "byteEnd": m["end"],
            },
            "features": [{"$type": "app.bsky.richtext.facet#mention", "did": did}],
        })
    for u in parse_urls(text):
        facets.append({
            "index": {
                "byteStart": u["start"],
                "byteEnd": u["end"],
            },
            "features": [
                {
                    "$type": "app.bsky.richtext.facet#link",
                    # NOTE: URI ("I") not URL ("L")
                    "uri": u["url"],
                }
            ],
        })
    return facets

The list of facets gets attached to the facets field of the post record:

post["text"] = "✨ example mentioning @atproto.com to share the URL 👨‍❤️‍👨 https://en.wikipedia.org/wiki/CBOR."
post["facets"] = parse_facets(post["text"])

Replies, Quote Posts, and Embeds

Replies and quote posts contain strong references to other records. A strong reference is a combination of:

AT URI: indicates the repository DID, collection, and record key
CID: the hash of the record itself

Posts can have several types of embeds: record embeds, images and exernal embeds (like link/webpage cards, which is the preview that shows up when you post a URL).

Replies

A complete reply post record looks like:

{
  "$type": "app.bsky.feed.post",
  "text": "example of a reply",
  "createdAt": "2023-08-07T05:49:40.501974Z",
  "reply": {
    "root": {
      "uri": "at://did:plc:u5cwb2mwiv2bfq53cjufe6yn/app.bsky.feed.post/3k43tv4rft22g",
      "cid": "bafyreig2fjxi3rptqdgylg7e5hmjl6mcke7rn2b6cugzlqq3i4zu6rq52q"
    },
    "parent": {
      "uri": "at://did:plc:u5cwb2mwiv2bfq53cjufe6yn/app.bsky.feed.post/3k43tv4rft22g",
      "cid": "bafyreig2fjxi3rptqdgylg7e5hmjl6mcke7rn2b6cugzlqq3i4zu6rq52q"
    }
  }
}

Since threads of replies can get pretty long, reply posts need to reference both the immediate "parent" post and the original "root" post of the thread.

Here's a Python script to find the parent and root values:

# Resolve the parent record and copy whatever the root reply reference there is
# If none exists, then the parent record was a top-level post, so that parent reference can be reused as the root value
def get_reply_refs(parent_uri: str) -> Dict:
    uri_parts = parse_uri(parent_uri)

    resp = requests.get(
        "https://bsky.social/xrpc/com.atproto.repo.getRecord",
        params=uri_parts,
    )
    resp.raise_for_status()
    parent = resp.json()

    parent_reply = parent["value"].get("reply")
    if parent_reply is not None:
        root_uri = parent_reply["root"]["uri"]
        root_repo, root_collection, root_rkey = root_uri.split("/")[2:5]
        resp = requests.get(
            "https://bsky.social/xrpc/com.atproto.repo.getRecord",
            params={
                "repo": root_repo,
                "collection": root_collection,
                "rkey": root_rkey,
            },
        )
        resp.raise_for_status()
        root = resp.json()
    else:
        # The parent record is a top-level post, so it is also the root
        root = parent

    return {
        "root": {
            "uri": root["uri"],
            "cid": root["cid"],
        },
        "parent": {
            "uri": parent["uri"],
            "cid": parent["cid"],
        },
    }

The root and parent refs are stored in the reply field of posts:

post["reply"] = get_reply_refs("at://atproto.com/app.bsky.feed.post/3k43tv4rft22g")

Quote Posts

A quote post embeds a reference to another post record. A complete quote post record would look like:

{
  "$type": "app.bsky.feed.post",
  "text": "example of a quote-post",
  "createdAt": "2023-08-07T05:49:39.417839Z",
  "embed": {
    "$type": "app.bsky.embed.record",
    "record": {
      "uri": "at://did:plc:u5cwb2mwiv2bfq53cjufe6yn/app.bsky.feed.post/3k44deefqdk2g",
      "cid": "bafyreiecx6dujwoeqpdzl27w67z4h46hyklk3an4i4cvvmioaqb2qbyo5u"
    }
  }
}

The record embedded here is the post that's getting quoted. The post record type is app.bsky.feed.post, but you can also embed other record types in a post, like lists (app.bsky.graph.list) and feed generators (app.bsky.feed.generator).

Images Embeds

Images are also embedded objects in a post. This example code demonstrates reading an image file from disk and uploading it, capturing a blob in the response:

IMAGE_PATH = "./example.png"
IMAGE_MIMETYPE = "image/png"
IMAGE_ALT_TEXT = "brief alt text description of the image"

with open(IMAGE_PATH, "rb") as f:
    img_bytes = f.read()

# this size limit is specified in the app.bsky.embed.images lexicon
if len(img_bytes) > 1000000:
    raise Exception(
        f"image file size too large. 1000000 bytes maximum, got: {len(img_bytes)}"
    )

# TODO: strip EXIF metadata here, if needed

resp = requests.post(
    "https://bsky.social/xrpc/com.atproto.repo.uploadBlob",
    headers={
        "Content-Type": IMAGE_MIMETYPE,
        "Authorization": "Bearer " + session["accessJwt"],
    },
    data=img_bytes,
)
resp.raise_for_status()
blob = resp.json()["blob"]

The blob object, as JSON, would look something like:

{
    "$type": "blob",
    "ref": {
        "$link": "bafkreibabalobzn6cd366ukcsjycp4yymjymgfxcv6xczmlgpemzkz3cfa"
    },
    "mimeType": "image/png",
    "size": 760898
}

The blob is then included in a app.bsky.embed.images array, along with an alt-text string. The alt field is required for each image. Pass an empty string if there is no alt text available.

post["embed"] = {
    "$type": "app.bsky.embed.images",
    "images": [{
        "alt": IMAGE_ALT_TEXT,
        "image": blob,
    }],
}

A complete post record, containing two images, would look something like:

{
  "$type": "app.bsky.feed.post",
  "text": "example post with multiple images attached",
  "createdAt": "2023-08-07T05:49:35.422015Z",
  "embed": {
    "$type": "app.bsky.embed.images",
    "images": [
      {
        "alt": "brief alt text description of the first image",
        "image": {
          "$type": "blob",
          "ref": {
            "$link": "bafkreibabalobzn6cd366ukcsjycp4yymjymgfxcv6xczmlgpemzkz3cfa"
          },
          "mimeType": "image/webp",
          "size": 760898
        }
      },
      {
        "alt": "brief alt text description of the second image",
        "image": {
          "$type": "blob",
          "ref": {
            "$link": "bafkreif3fouono2i3fmm5moqypwskh3yjtp7snd5hfq5pr453oggygyrte"
          },
          "mimeType": "image/png",
          "size": 13208
        }
      }
    ]
  }
}

Each post contains up to four images, and each image can have its own alt text and is limited to 1,000,000 bytes in size. Image files are referenced by posts, but are not actually included in the post (eg, using bytes with base64 encoding). The image files are first uploaded as "blobs" using com.atproto.repo.uploadBlob, which returns a blob metadata object, which is then embedded in the post record itself.

It's strongly recommended best practice to strip image metadata before uploading. The server (PDS) may be more strict about blocking upload of such metadata by default in the future, but it is currently the responsibility of clients (and apps) to sanitize files before upload today.

Website Card Embeds

A website card embed, often called a "social card," is the rendered preview of a website link. A complete post record with an external embed, including image thumbnail blob, looks like:

{
  "$type": "app.bsky.feed.post",
  "text": "post which embeds an external URL as a card",
  "createdAt": "2023-08-07T05:46:14.423045Z",
  "embed": {
    "$type": "app.bsky.embed.external",
    "external": {
      "uri": "https://bsky.app",
      "title": "Bluesky Social",
      "description": "See what's next.",
      "thumb": {
        "$type": "blob",
        "ref": {
          "$link": "bafkreiash5eihfku2jg4skhyh5kes7j5d5fd6xxloaytdywcvb3r3zrzhu"
        },
        "mimeType": "image/png",
        "size": 23527
      }
    }
  }
}

Here's an example of embedding a website card:

from bs4 import BeautifulSoup

def fetch_embed_url_card(access_token: str, url: str) -> Dict:

    # the required fields for every embed card
    card = {
        "uri": url,
        "title": "",
        "description": "",
    }

    # fetch the HTML
    resp = requests.get(url)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")

    # parse out the "og:title" and "og:description" HTML meta tags
    title_tag = soup.find("meta", property="og:title")
    if title_tag:
        card["title"] = title_tag["content"]
    description_tag = soup.find("meta", property="og:description")
    if description_tag:
        card["description"] = description_tag["content"]

    # if there is an "og:image" HTML meta tag, fetch and upload that image
    image_tag = soup.find("meta", property="og:image")
    if image_tag:
        img_url = image_tag["content"]
        # naively turn a "relative" URL (just a path) into a full URL, if needed
        if "://" not in img_url:
            img_url = url + img_url
        resp = requests.get(img_url)
        resp.raise_for_status()

        blob_resp = requests.post(
            "https://bsky.social/xrpc/com.atproto.repo.uploadBlob",
            headers={
                "Content-Type": IMAGE_MIMETYPE,
                "Authorization": "Bearer " + access_token,
            },
            data=resp.content,
        )
        blob_resp.raise_for_status()
        card["thumb"] = blob_resp.json()["blob"]

    return {
        "$type": "app.bsky.embed.external",
        "external": card,
    }

An external embed is stored under embed like all the others:

post["embed"] = fetch_embed_url_card(session["accessJwt"], "https://bsky.app")

On Bluesky, each client fetches and embeds this card metadata, including blob upload if needed. Embedding the card content in the record ensures that it appears consistently to everyone and reduces waves of automated traffic being sent to the referenced website, but it does require some extra work by the client.

Putting It All Together

A complete script, with command-line argument parsing, is available from this Git repository.

As mentioned at the beginning, we expect most folks will use SDKs or libraries for their programming language of choice to help with most of the details described here. But sometimes it is helpful to see what is actually going on behind the abstractions.

Federation Developer Sandbox Guidelines

June 20, 2023 · 7 min read

Update June 2024: Since earlier this year it has been possible to federate directly in the live atproto network. The sandbox environment is no longer active and has been shut down.

Welcome to the atproto federation developer sandbox! ✨

This is a completely separate network from our production services that allows us to test out the federation architecture and wire protocol.

The federation sandbox environment is an area set up for exploration and testing of the technical components of the AT Protocol distributed social network. It is intended for developers and self-hosters to test out data availability in a federated environment.

To maintain a positive and productive developer experience, we've established this Code of Conduct that outlines our expectations and guidelines. This sandbox environment is initially meant to test the technical components of federation.

Given that this is a testing environment, we will be defederating from any instances that do not abide by these guidelines, or that cause unnecessary trouble, and will not be providing specific justifications for these decisions.

Using the sandbox environment means you agree to adhere to our Guidelines. Please read the following carefully:

Post responsibly. The sandbox environment is intended to test infrastructure, but user content may be created as part of this testing process. Content generation can be automated or manual. Do not post content that requires active moderation or violates the Bluesky Community Guidelines.

Keep the emphasis on testing. We’re striving to maintain a sandbox environment that fosters learning and technical growth. We will defederate with instances that recruit users without making it clear that this is a test environment.

Do limit account creation. We don't want any one server using a majority of the resources in the sandbox. To keep things balanced, to start, we’re only federating with Personal Data Servers (PDS) with up to 1000 accounts. However, we may change this if needed.

Don’t expect persistence or uptime. We will routinely be wiping the data on our infrastructure. This is intended to reset the network state and to test sync protocols. Accounts and content should not be mirrored or migrated between the sandbox and real-world environments.

Don't advertise your service as being "Bluesky." This is a developer sandbox and is meant for technical users. Do not promote your service as being a way for non-technical users to use Bluesky.

Do not mirror sandbox did:plcs to production.

Status and Wipes

🐉 Beware of dragons!

This hasn’t been production tested yet. It seems to work pretty well, but who knows what’s lurking under the surface — that's what this sandbox is for! Have patience with us as we prep for federation.

On that note, please give us feedback either in Issues (actual bugs) or Discussions (higher-level questions/discussions) on the atproto repo.

🗓 Routine wipes

As part of the sandbox, we will be doing routine wipes of all network data.

We expect to perform wipes on a weekly or bi-weekly basis, though we reserve the right to do a wipe at any point.

When we wipe data, we will be wiping it on all services (BGS, App View, PLC). We will also mark any existing DIDs as “invalid” & will refuse to index those accounts in the next epoch of the network to discourage users from attempting to “rollover” their accounts across wipes.

Getting started ✨

Now that you've read the sandbox guidelines, you're ready to self-host a PDS in the developer sandbox. For complete instructions on getting your PDS set up, check out the README.

To access your account, you’ll log in with the client of your choice in the exact same way that you log into production Bluesky, for instance the Bluesky web client. When you do so, please provide the url of your PDS as the service that you wish to log in to.

Auto-updates

We’ve included Watchtower in the PDS distribution. Every day at midnight PST, this will check our GitHub container registry to see if there is a new version of the PDS container & update it on your service.

This will allow us to rapidly iterate on protocol changes, as we’ll be able to push them out to the network on a daily basis.

When we do routine network wipes, we will be pushing out a database migration to participating PDS that wipes content and accounts.

You are within your rights to disable Watchtower auto-updates, but we strongly encourage their use and will not be providing support if you decide not to run the most up-to-date PDS distribution.

Odds & Ends & Warnings & Reminders

🧪 Experiment & have fun!

🤖 Run feed generators. They should work the exact same way as production - be sure to adjust your env to listen to Sandbox BGS!

🌈 Feel free to run your own AppView or BGS - although it’s a bit more involved & we’ll be providing limited support for this.

👤 Your PDS will provide your handle by default. Custom domain handles should work exactly the same in sandbox as they do on production Bluesky. Although you will not be able to re-use your handle from production Bluesky as you can only have one DID set per handle.

🚨 If you follow the self-hosted PDS setup instructions, you’ll have private key material in your env file - be careful about sharing that!

📣 This is a sandbox version of a public broadcast protocol - please do not share sensitive information.

🤝 Help each other out! Respond to issues & discussions, chat in the community-run Matrix or Discord, etc.

Learn more about atproto federation

Check out the high-level view of federation.

Dive deeper with the atproto docs.

Network Services

We are running three services: PLC, BGS, Bluesky "App View"

PLC

Hostname: plc.bsky-sandbox.dev

Code: https://github.com/did-method-plc/did-method-plc

PLC is the default DID provider for the network. DIDs are the root of your identity in the network. Sandbox PLC functions exactly the same as production PLC, but it is run as a separate service with a separate dataset. The DID resolution client in the self-hosted PDS package is set up to talk the correct PLC service.

BGS

Hostname: bgs.bsky-sandbox.dev

Code: https://github.com/bluesky-social/indigo/tree/main/bgs

BGS (Big Graph Service) is the firehose for the entire network. It collates data from PDSs & rebroadcasts them out on one giant websocket.

BGS has to find out about your server somehow, so when we do any sort of write, we ping BGS with com.atproto.sync.requestCrawl to notify it of new data. This is done automatically in the self-hosted PDS package.

If you’re familiar with the Bluesky production firehose, you can subscribe to the BGS firehose in the exact same manner, the interface & data should be identical

Bluesky App View

Hostname: api.bsky-sandbox.dev

Code: https://github.com/bluesky-social/atproto/tree/main/packages/bsky

The Bluesky App View aggregates data from across the network to service the Bluesky microblogging application. It consumes the firehose from the BGS, processing it into serviceable views of the network such as feeds, post threads, and user profiles. It functions as a fairly traditional web service.

When you request a Bluesky-related view from your PDS (getProfile for instance), your PDS will actually proxy the request up to App View.

Feel free to experiment with running your own App View if you like!

The PDS

The PDS (Personal Data Server) is where users host their social data such as posts, profiles, likes, and follows. The goal of the sandbox is to federate many PDS together, so we hope you’ll run your own.

We’re not actually running a Bluesky PDS in sandbox. You might see Bluesky team members' accounts in the sandbox environment, but those are self-hosted too.

The PDS that you’ll be running is much of the same code that is running on the Bluesky production PDS. Notably, all of the in-pds-appview code has been torn out. You can see the actual PDS code that you’re running on the atproto/simplify-pds branch.

Feedback

We're excited for you to join us in the developer sandbox soon! Please give us feedback either in Issues (actual bugs) or Discussions (higher-level questions/discussions) on the atproto repo.

Why are blocks on Bluesky public?

June 8, 2023 · 11 min read

The technical implementation of public blocks and some possibilities for more privacy preserving block implementations — an area of active research and experimentation.

In April, we shipped a block feature to all users. Unlike on other centralized platforms, blocks on Bluesky are public and enumerable data, because all servers across the network need to know that they exist in order to respect the user’s request.

The current system of public blocks is just one aspect of our composable moderation stack, which we are actively building during our beta period. We’re working on more sophisticated individual and community-level interaction controls and moderation tooling, and we also encourage third-party community developers to contribute to this ecosystem.

In this post, we’ll share the technical implementation of public blocks and discuss some possibilities for more privacy preserving block implementations — an area of active research and experimentation. We welcome community suggestions, so if you have a proposal to share with us on how to implement private blocks after you read this post, please contribute to our public discussion here.

What are blocks?

At an abstract level, across many social media platforms, blocks between two accounts usually have the following features:

Symmetric: the behavior is the same regardless of which account initiated a block first
Mutual mute: neither account can read any content (public or private) from the other account, while logged in
Mutual interaction block: direct interactions between the two accounts are not allowed. This includes direct mentions resulting in a notification, replies to posts, direct messages (DMs), and follows (which normally result in notifications).

Blocks add a significant and high-impact degree of friction. There are many cases where this friction alone is sufficient to de-escalate conflict.

However, it is important to note that blocking does not prevent all possible interaction (even on centralized social networks). For example, when content is public, as it is on Bluesky, blogs, or websites, blocked people can still easily access the content by simply logging out or opening an incognito browser tab. Posts can still be screenshotted and shared either on-network or off-network. Harassment can continue to occur even without direct mentions or replies (”subtweeting,” posting screenshots, etc.).

On most existing services, the blockee can detect that they’ve been blocked, though it may not be immediately obvious. For example, if they’re able to navigate to the blocker’s profile, they may see a screen that says they’ve been blocked, or the absence of the profile is indication enough that they have been blocked. Most social apps provide each user with a list of the accounts that they have blocked.

You can read more about blocking behaviors on other platforms:

Twitter: https://help.twitter.com/en/using-twitter/blocking-and-unblocking-accounts
Mastodon: https://docs.joinmastodon.org/user/moderating/#block
Instagram: https://help.instagram.com/447613741984126

How are blocks currently implemented in Bluesky?

Blocks prevent interaction. Blocked accounts will not be able to like, reply, mention, or follow you, and if they navigate directly to your profile, they will see that they have been blocked. Like other public social networks, if they log out of their account or use a different account, they will be able to view your content. (This much is standard across centralized social networks as well.)

Currently, on Bluesky, you can view a list of your blocked accounts, and while the list of people who have blocked you is not surfaced in the app, developers familiar with the API could crawl the network to parse this information. This section will dive into the technical constraints that cause blocks to be public, and in a later section, we’ll discuss possible alternative implementations.

Blocks in Bluesky are implemented as part of the app.bsky.* application protocol, which builds on top of the underlying AT Protocol (atproto). Blocks are a record stored in account repositories. They look and behave very similarly to “follows”: the app.bsky.graph.block and app.bsky.graph.follow record schemas are nearly identical.

The block behavior is then implemented by several pieces of software. Servers and clients will index the block records and prevent actions which would have violated the intended behaviors: posts will not appear in feeds and reply threads; profile fetches will be empty or annotated with block state; creation of reply posts, quote posts, embeds, and mentions are blocked; any notifications involving the other account are additionally suppressed.

One of the core principles of the AT Protocol, which Bluesky is built on, is that account holders have total control over their own data. This means that while protocol-compliant clients and servers prevent blocked accounts from creating replies or other disallowed records in each user’s data repository, it is technically possible to bypass those restrictions if a client refuses to be protocol-compliant. The act of being blocked also does not result in any change to the blockee’s repository, and any old replies or mentions remain in place, untouched. For example, in the user-facing app, if someone replies to your post and then you block them, their replies will now be hidden to you. If you later decide to unblock them, their replies to that post will appear again, because the replies themselves were not deleted.

Despite blocks not removing the content of other user’s repositories, the data is not shown because blocks are primarily enforced by other nodes and services — personal data servers (PDS), App Views, and clients. One side effect that comes out of this architecture is that follow relationships are not changed due to a block, and “soft blocks” (rapid block/unblock) do not work as a mechanism to remove a follower. While a follow relationship might still exist in the graph, the block prevents any actual viewing or delivery of content. As future work, we can also ensure that details such as ”like” counts and “follower” accounts are updated when block status changes.

How will blocks work with open federation?

Bluesky is a public social network built on a protocol to support public conversation, so similar to blogs and websites, you do not need a Bluesky account in order to see content posted to the app. In order to support open federation where many servers, clients, and App Views are collaborating to surface content to users, each account’s data repository — which contains information like follows and blocks — must be public. All of the servers across the network must be able to read the data. Servers must know which accounts you have blocked in order to be able to enforce that relationship.

Once we launch federation there will be many personal data servers (PDS), clients, and App Views. The expectation is that virtually all accounts will be using clients and servers that respect blocking behavior.

It is this need for multiple parties to coordinate that necessitates blocks being public. “Mute” behavior can be implemented entirely in a client app because it only impacts the view of the local account holder. Blocks require coordination and enforcement by other parties, because the views and actions of multiple (possibly antagonistic) parties are involved.

In theory, a bad actor could create their own rogue client or interface which ignores some of the blocking behaviors, since the content is posted to a public network. But showing content or notifications to the person who created the block won’t be possible, as that behavior is controlled by their own PDS and client. It’s technically possible for a rogue client to create replies and mentions, but they would be invisible or at least low-impact to the recipient account for the same reasons. Protocol-compliant software in the ecosystem will keep such content invisible to other accounts on the network. If a significant fraction of accounts elected to use noncompliant rogue infrastructure, we would consider that a failure of the entire ecosystem.

Remember that clever bypasses of the blocking behaviors are already possible on most networks (centralized or not), and it is the added friction that matters.

Are there other ways to implement blocks in federated systems?

Yes, and we are actively exploring other implementations and novel research areas to inform our development on the AT Protocol. We also welcome community suggestions and discussions on this topic.

One example is ActivityPub, which is the protocol that Mastodon is built on. ActivityPub does not require public blocks because content there is not globally public by default — this is also why picking which server you join matters, because it limits the content that you see. Despite this, Mastodon does sometimes show block information to other parties, which is a frequent topic of discussion in the ActivityPub ecosystem.

As we currently understand it, on Mastodon, you only see content when there is an explicit follow relationship between accounts and servers, and follows require mutual consent. (In practice, most follow requests are auto-accepted, so this behavior is not always obvious to end users.) The mutual-mute behavior that blocks require can be implemented on Mastodon by first, disallowing any follows between the two accounts, and second, by adding a regular “mute.” Similar to Bluesky, the interaction-block behavior relies on enforcement by both the server and the client. So on Mastodon too, it’s possible that a bad actor implements a server that ignores blocks and displays blocked replies in threads. Both ActivityPub and AT Protocol can use de-federation as an enforcement mechanism to disconnect from servers that don’t respect blocks.

Technical approaches we’ve considered for private blocks

One proposed mechanism to make blocks less public on Bluesky is the use of bloom filters. The basic idea is to encode block relationships in a statistical data structure, and to distribute that data structure instead of the set of actual blocks. The data structure would make it easy to check if there was a block relationship between two specific accounts, but not make it easy to list all of the blocks. Other servers and clients in the network would then use the data structure to enforce the blocking behaviors. The bloom filters could either be per-account (aka, a bloom filter stored in a record), or per-PDS, or effectively global, with individual PDS instances submitting block relationships to a trusted central service which would publish the bloom filter lists. We considered a scheme like this before implementing blocks, but there are a few issues and concerns:

Bloom filters don’t fully prevent enumerating blocks, and if a bad actor was only interested in specific accounts, they could still easily find the list of blocked accounts. Bloom filters really only add a mask, and it would still be relatively easy to enumerate blocks. While the full matrix of possible block relationships is NxN (where N is the number of accounts in the network, which could ultimately be upwards of hundreds of millions in the future) might be too large to test against, in reality, a bad actor would likely only be targeting prominent accounts or specific communities. In that case, only on the order of billions of possible relationships would need to be tested, which would be trivial on modern hardware.
Bloom filters are computationally expensive. While bloom filters are known for efficiently reducing the storage size for looking up a large number of hashes, they have a large overhead compared to individual hashes. In the context of blocks, every creation or deletion of a block record would potentially require the generation and distribution of a full-sized bloom filter. The storage and bandwidth overhead becomes significant at scale, especially since a significant fraction of social media accounts could have many thousands of blocks.
Latency problems persist in mitigations for bloom filter overhead. The above storage and bandwidth concerns could be mitigated by “batching,” or through a trusted central service. But those solutions have their own problems with latency (time until block is enforced across the network) and trust and reliability (in a central service, which would have the full enumeration of block relationships).

The team is still actively discussing this option, and it’s possible that the extra effort and resources required by bloom filters is worth the imperfect but additional friction that they provide. At the moment, it’s not entirely obvious to us that the tradeoff is worth it. While we’re currently iterating on other moderation and account safety features, we decided to initially release blocks with this simple public system as a first pass.

Some other proposals we’re exploring include:

Label-based block enforcement. Instead of trying to prevent all violations of blocking relationships across the network, scan for violations of them and label them.
Interaction gating. Place authority for post threads and quote posts in the original poster’s PDS, so block information doesn’t need to leave that server.
Zero-knowledge proofs. We’re aware of existing ZK approaches to distributed blocks, such as SNARKBlock, and we’re speaking with trusted advisors about this open area of research and experimentation. Perhaps this research might lead to us deploying a novel system in the future.
Trusted App Views. Accounts could privately register their blocks with their PDS, and then these servers would forward block metadata to a small number of “blessed” App Views.

If you have experience here or have thoughts about how to implement private block relationships in decentralized systems, we’d love to hear from you. Please contribute to our discussion here.

An open network of services​

Decentralized moderation​

Where moderation is applied​

How are labels defined?​

Running a labeler​

Infrastructure moderation​

In summary​

Early Access Limitations​

Account Migration​

Getting Started​

Privacy Notice​

Download a Repository​

On Bluesky's Main PDS Instance​

On Another Instance​

Parse Records from CAR File as JSON​

What's a CAR File?​

The Repository CAR File​

Downloading Blobs​

Authentication​

Post Record Structure​

Setting the Post's Language​

Mentions and Links​

Replies, Quote Posts, and Embeds​

Replies​

Quote Posts​

Images Embeds​

Website Card Embeds​

Putting It All Together​

Status and Wipes​