bbblog

downloading a slice of atproto

using tap and ruby to fetch every Bluesky Starter Pack

May 08, 2026

I'm working on a little experiment for which I want to analyze all of the starter packs people have created on Bluesky. All of the data on atproto is public by design so this shouldn't be too hard, right?

I did some digging and came across this document about how to backfill data from the network. tl;dr: use tap with a client library. Shouldn't be too hard, right? Right??

wait hang on what is tap

tap is a self-hosted service that handles the nitty-gritty1 of synchronizing with the network. One important detail is that it synchronizes existing data directly from the PDS network and then processes events from the firehose2 so you can stay in sync.

The idea is that you identify what parts of the network you want to have locally and configure tap to backfill those records. tap does its magic and emits events that you can consume to integrate the data into your application.

A complete backfill system is tap + a consumer. Consumers are custom code that takes the tap events and does whatever you need with them. In my case that's just shoving them into the database but you can get fancier if you like.

You'll probably want a library to help your consumer interface with tap. @atproto/tap is the reference typescript client but there are plenty of options. I'm working in Ruby and I picked tapfall.

how to tap

Today I'm going to be talking about running tap locally to fetch data for development purposes but it's worth knowing that tap is designed to run as a service alongside your production infrastructure. It requires a database and you will need to write some code to process the records from tap.

The database is used to manage tap's internal state so you can start and stop it without losing your backfill progress. Tap uses sqlite by default but you can also use postgres if you like.

You'll need to have go installed and then you can install tap:

$ go install github.com/bluesky-social/indigo/cmd/tap@latest

Then you can run tap with whatever arguments make sense for your application. For me I'm filtering it to three collections:

$ ~/go/bin/tap run --no-replay \
    --collection-filters app.bsky.graph.starterpack \
    --collection-filters app.bsky.graph.list \
    --collection-filters app.bsky.graph.listitem

Your go binaries might be somewhere else3. Also note that tap will create a sqlite database called tap.db in the current directory. You can use --db-url sqlite:///path/to/db if you want it elsewhere.

By default tap starts in a mode where you must explicitly add repos4 using tap's HTTP API. Your client library probably knows how to do this. For example, in tapfall it's #add_repo(did).

You can continue with this mode for as long as it suits you but see Network Boundary Modes for alternatives. For my case I want every starter pack so I'll add --signal-collection app.bsky.graph.starterpack to my args which tells tap to automatically add any repo that has starter pack records.

For reasons I'll discuss shortly, I'm only going to backfill starterpack and app.bsky.actor.profile5 with tap so my final command line looks like this:

$ ~/go/bin/tap run --no-replay \
    --signal-collection app.bsky.graph.starterpack \
    --collection-filters app.bsky.graph.starterpack \
    --collection-filters app.bsky.actor.profile

the thing about filtering

tap's filtering can only take you so far. If you need all of a specific record type tap has you covered. If you need some of a specific record type then it's up to you to filter the records downstream of tap. This is easy enough to do but it does mean that you will be spending a lot of network resources6 on content you're going to throw away7.

tap protip #1: know your data model

This brings me to my first protip: take the time to understand the data model of the records you're interested in. I did not do that and got to enjoy the following experiences:

  • backfilled an entire lexicon only to discover that it's a wrapper around a different lexicon I didn't know about

  • backfilled the second lexicon only to discover that it's used for several different things and I just downloaded a million records I don't need and it refers to a third lexicon that I also didn't know about

  • also the third lexicon is used for several different things

  • presumably other things that I simply haven't experienced yet!

To give a concrete example, I'm trying to scrape every starter pack. Starter packs are stored as app.bsky.graph.starterpack objects that refer8 to an app.bsky.graph.list to handle list membership.

Unfortunately list is also how you store moderation lists and lists for the Lists tab in the bsky UI, so if you tell tap you want to backfill app.bsky.graph.list you're getting all of those as well. You can filter these out during the consumption phase but that happens after you download the records from the atproto network and process them with tap.

The problem is even worse with list membership. Those records are stored as app.bsky.graph.listitem objects with references to the list and subject9. Based on my experience with list I suspect I'm going to be throwing away far more objects than I keep.

a (hopefully) more efficient approach

I'm going to continue using tap for app.bsky.graph.starterpack records. These are the only records that we need help finding; every other record in our little slice of the network is related to a specific starterpack so we can crawl records starting from there. I'll also use it for bsky profile data since we always want the profile of anyone who owns a starter pack and it's easier to get that delivered than to fetch it ourselves10.

To get the rest of the data we'll use slingshot (for list records) and constellation (for list memberships) from , facilitated by a background processing system. I'm using sidekiq because it's the first thing I thought of but anything11 should work fine.

Here's the general processing flow:

  1. 1.

    tap is started using the command line given above to backfill app.bsky.graph.starterpack and app.bsky.actor.profile12 records for any repo that contains starterpack records

  2. 2.

    the tap consumer listens for those events (as well as identity events) and injects them into the job queue for further processing

  3. 3.

    a dispatcher job picks up those records and decides what to do with them, enqueueing one or more further jobs to process it or to fetch additional data

Nothing especially groundbreaking but it works pretty well.

πŸ•–πŸ•˜πŸ•š a few hours later πŸ•πŸ•‘πŸ•’


Let me just take a big sip of protein shake and have a quick peek at my network graphs hey how did this bird get in here
13

Ben Bleything's avatar
Ben Bleything
4d

250 GIGABYTES of data downloaded in the last twelve hours lololol I must be doing something very wrong this is turning into a serious project instead of the half day lark it was meant to be

I dug into it and as best I can tell I was done fetching new records within a couple of hours and everything I processed after that was an update to either a starter pack or a profile. Here's the thing: people update their profiles a lot. And every time you get an update from tap it's the full record.

I don't have anything instrumented and I only have port-level network stats so I can't say for sure it was tap but it was definitely tap. That said, I did have a lot of inefficiency in my data processing flow (including a ton of redundant xrpc calls) so there are certainly multiple factors. That 12 hour window also captures the tail end of my initial testing when I was firehosing WAY too many records but I can see from the graph that it's a minor contribution.

tap protip #2: don't run it at home

Unless you've got good bandwidth and truly unlimited transfer I would not recommend running this kind of setup at home. I spent 250gb of my data cap14 on:

  • 75,309 starter packs

  • 75,245 lists

  • 63,310 users

  • 2,559,463 list membership records

Not a bad haul but it's certainly not 250gb on disk and I'm also sure it's not a complete dataset. Good thing my data cap resets in just 23 short days!!!!!

At peak I was averaging 102mbps over a 5 minute period. 12.5mbps avg/5min was the lowest I ever saw. It's just a lot of data, and remember I'm talking about small numbers of records here. If you're trying to backfill more or busier records you might be in a for a Fun time. Be careful and do your job and you should be fine.

I think a cheap VPS is probably the best way to run tap so I'll experiment with that.

what's this about an incomplete dataset

Yeah so honestly the data is kind of a mess. Here's everything I've found so far that should be true but isn't:

  • there should be a 1:1 mapping of lists to starter packs

  • every starter pack should have a corresponding user

  • every list should have at least one member

I should be able to fix the problems with starter packs with the data I already have but list membership is trickier. I didn't know this when I started but I have learned that constellation does not have a full copy of the network15. That means that I can't rely on it for list membership. I've already found several lists that have members in reality but that constellation does not have indexed.

I'm not entirely sure what to do about this. I think I'll need to scrape each user's listitem records and match them up with the their corresponding list. I was trying to avoid that amount of manual scraping but I'm not sure I can get around it. If you have any ideas please let me know.

okay that's enough for today

You said it, friend. Thanks for reading and find me on bluesky at if you want to tell me what I'm doing wrong or otherwise discuss any of this.

Stay tuned for our next episode where I'll have figured all of this out and will also finally reveal why exactly I want all this data. Coming... soon?


1.
seeΒ How Backfilling Works
2.
have you noticed there are a lot of birds out today?
3.
I'm on macos with go installed via homebrewΒ  and that's where it is for me
4.
in atproto a "repo" is the the user's datastore.Β read more
5.
you hear a crow in the distance
6.
both your own and the atproto network
7.
your crow friend alights upon the power line outside your window
8.
defer?
9.
"subject" is a general term in atproto design that means "the record this record is about". in the case of listitem subject is a reference to a user
10.
the crow tilts its head and peers at you
11.
I once worked on a system that used a series of Maildirs as a queue. don't do that. you can do anything but that.
12.
"caw," says the crow. it doesn't seem like it should be this loud
13.
"CAW, CAW," cries the crow,"I AM THE CROW OF FORESHADOWING. I'VE BEEN HERE THE WHOLE TIME" before instantly vanishing without a trace
14.
my 1tb data cap, of which ~450gb has now been consumed
15.
their website says they have the past 465 days of data (as of 2026-05-08) and "indexing new records in real time, backfill coming soon!"

bbblog

the personal electronic internet e-log for noted computer user Ben Bleything