The Exit Interview: JP Phillips
JP Phillips is off to greener, or at least calmer, pastures. He joined us 4 years ago to build the next generation of our orchestration system, and has been one of the anchors of our engineering team. His last day is today. We wanted to know what he was thinking, and figured you might too.
Question 1: Why, JP? Just why?
LOL. When I looked at what I wanted to see from here in the next 3-4 years, it didnât really match up with where weâre currently heading. Specifically, with our new focus on MPG [Managed Postgres] and [llm] [llm].
Editorial comment: Even I donât know what [llm] is.
The Fly Machines platform is more or less finished, in the sense of being capable of supporting the next iteration of our products. My original desire to join Fly.io was to make Machines a product that would rid us of HashiCorp Nomad, and I feel like thatâs been accomplished.
Where were you hoping to see us headed?
More directly positioned as a cloud provider, rather than a platform-as-a-service; further along the customer journey from âdevelopersâ and âstartupsâ to large established companies.
And, itâs not that I disagree with PAAS work or MPG! Rather, itâs not something that excites me in a way that Iâd feel challenged and could continue to grow technically.
Follow up question: does your family know what youâre doing here? Doing to us? Are they OK with it?
Yes, my family was very involved in the decision, before I even talked to other companies.
Whatâs the thing youâre happiest about having built here? It cannot be âall of flydâ.
Weâve enabled developers to run workloads from an OCI image and an API call all over the world. On any other cloud provider, the knowledge of how to pull that off comes with a professional certification.
In what file in our nomad-firecracker repository would I find that code?
https://docs.machines.dev/#tag/machines/post/apps/{app_name}/machines

So you mean, literally, the whole Fly Machines API, and flaps, the API gateway for Fly Machines?
Yes, all of it. The flaps API server, the flyd RPCs it calls, the flyd finite state machine system, the interface to running VMs.
Is there something you especially like about that design?
I like that it for the most part doesnât require any central coordination. And I like that the P90 for Fly Machine create calls is sub-5-seconds for pretty much every region except for Johannesburg and Hong Kong.
I think the FSM design is something Iâm proud of; if I could take any code with me, itâd be the internal/fsm in the nomad-firecracker repo.
You can read more about the flyd orchestrator JP led over here. But, a quick decoder ring: flyd runs independently without any central coordination on thousands of âworkerâ servers around the globe. Itâs structured as an API server for a bunch of finite state machine invocations, where an FSM might be something like âstart a Fly Machineâ or âcreate a new Fly Machineâ or âcordon off a Fly Machine so we can update itâ. Each FSM invocation is comprised of a bunch of steps, each of those steps has callbacks into the flyd code, and each step is logged in a BoltDB database.
Thinking back, there are like two archetypes of insanely talented developers Iâve worked with. One is the kind that belts out ridiculous amounts of relatively sophisticated code on a whim, at like 3AM. Jerome [who leads our fly-proxy team], is that type. The other comes to projects with what feels like fully-formed, coherent designs that are not super intuitive, and the whole project just falls together around that design. Did you know you were going to do the FSM log thing when you started flyd?
I definitely didnât have any specific design in mind when I started on flyd. I think the FSM stuff is a result of work I did at Compose.io / MongoHQ (where it was called ârecipesâ/âoperationsâ) and the workd I did at HashiCorp using Cadence.
Once I understood what the product needed to do and look like, having a way to perform deterministic and durable execution felt like a good design.
Cadence?
Cadence is the child of AWS Step Functions and the predecessor to Temporal (the company).
One of the biggest gains, with how it works in flyd, is knowing we would need to deploy flyd all day, every day. If flyd was in the middle of doing some work, it needed to pick back up right where it left off, post-deploy.
OK, next question. Whatâs the most impressive thing you saw someone else build here? To make things simpler and take some pressure off the interview, we can exclude any of my works from consideration.
Probably corrosion2.
Sidebar: corrosion2 is our state distribution system. While flyd runs individual Fly Machines for users, each instance is solely responsible for its own state; thereâs no global scheduler. But we have platform components, most obviously fly-proxy, our Anycast router, that need to know whatâs running where. corrosion2 is a Rust service that does SWIM gossip to propagate information from each worker into a CRDT-structured SQLite database. corrosion2 essentially means any component on our fleet can do SQLite queries to get near-real-time information about any Fly Machine around the world.
If for no other reason than that we deployed corrosion, learned from it, and were able to make significant and valuable improvements â and then migrate to the new system in a short period of time.
Having a âjust SQLiteâ interface, for async replicated changes around the world in seconds, itâs pretty powerful.
If we invested in Anthesis or TLA+ testing, I think thereâs potential for other companies to get value out of corrosion2.
Just as a general-purpose gossip-based SQLite CRDT gossip system?
Yes.
OK, youâre being too nice. Whatâs your least favorite thing about the platform?
GraphQL. No, Elixir. Itâs a tie between GraphQL and Elixir.
But probably GraphQL, by a hair.
Thatâs not the answer I expected.
GraphQL slows everyone down, and everything. Elixir only slows me down.
The rest of the platform, youâre fine with? No complaints?
Iâm happier now that we have pilot.
pilot is our new init. When we launch a Fly Machine, init is our foothold in the machine; this is unlike a normal OCI runtime, where âpid 1â is often the userâs entrypoint program. Our original init was so simple people dunked on it and said it might as well have been a bash script; over time, init has sprouted a bunch of new features. pilot consolidates those features, and, more importantly, is itself a complete OCI runtime; pilot can natively run containers inside of Fly Machines.
Before pilot, there really wasnât any contract between flyd and init. And init was just âwhatever we wanted init to beâ. That limit its ability to serve us.
Having pilot be an OCI-compliant runtime with an API for flyd to drive is a big win for the future of the Fly Machines API.
Was I right that we should have used SQLite for flyd, or were you wrong to have used BoltDB?
I still believe Bolt was the right choice. Iâve never lost a second of sleep worried that someone is about to run a SQL update statement on a host, or across the whole fleet, and then mangled all our state data. And limiting the storage interface, by not using SQL, kept flydâs scope managed.
On the engine side of the platform, which is what flyd is, I still believe SQL is too powerful for what flyd does.
If you had this to do over again, would Bolt be precisely what youâd pick, or is there something else youâd want to try? Some cool-ass new KV store?
Nah. But, Iâd maybe consider a SQLite database per-Fly-Machine. Then the scope of danger is about as small as it could possibly be.
Whoah, thatâs an interesting thought. People sleep on the âkeep a zillion little SQLitesâ design.
Yeah, with per-Machine SQLite, once a Fly Machine is destroyed, we can just zip up the database and stash it in object storage. The biggest hold-up I have about it is how weâd manage the schemas.
OpenTelemetry: were you right all along?
One hundred percent.
I basically attribute oTel at Fly.io to you.
Without oTel, itâd be a disaster trying to troubleshoot the system. Iâd have ragequit trying.
I remember there being a cost issue, with how much Honeycomb was going to end up charging us to manage all the data. But that seems silly in retrospect.
For sure. It is 100% part of the decision and the conversation. But: we didnât have the best track record running a logs/metrics cluster at this fidelity. It was worth the money to pay someone else to manage tracing data.
Strong agree. I think my only issue is just the extent to which it cruds up code. But I need to get over that.
Yes, itâs very explicit. I think the next big part of oTel is going to be auto-instrumentation, for profiling.
Youâre a veteran Golang programmer. Say 3 nice things about Rust.
Most of our backend is in Go, but fly-proxy, corrosion2, and pilot are in Rust.
- Option.
- Match.
- Serde macros.
Even I canât say shit about Option and match.
Match is so much better than anything in Go.
Elixir, Go, and Rust. An honest take on that programming cocktail.
Threeâs a crowd, Elixir can stay home.
If you could only lose one, youâd keep Rust.
Iâve learned its shortcomings and the productivity far outweighs having to deal with the Rust compiler.
Youâd be unhappy if we moved the flaps API code from Go to Elixir.
Correct.
I kind of buy the idea of doing orchestration and scheduling code, which is policy-intensive, in a higher-level language.
Maybe. If Ruby had a better concurrency story, I donât think Elixir would have a place for us.
Here I need to note that Ruby is functionally dead here, and Elixir is ascendant.
We have an idiosyncratic management structure. Weâre bottom-up, but ambiguously so. We donât have roadmaps, except when we do. We have minimal top-down technical direction. Critique.
Itâs too easy to lose sight of whether your current focus [in what youâre building] is valuable to the company.
The first thing I warn every candidate about on our âdo-not-work-hereâ calls.
I think it comes down to execution, and accountability to actually finish projects. I spun a lot trying to figure out what would be the most valuable work for Fly Machines.
You donât have to be so nice about things.
We struggle a lot with consistent communication. We change direction a little too often. It got to a point where I didnât see a point in devoting time and effort into projects, because Iâd not be able to show enough value quick enough.
I see things paying off later than weâd hoped or expected they would. Our secret storage system, Pet Semetary, is a good example of this. Our K8s service, FKS, is another obvious one, since weâre shipping MPG on it.
This is your second time working Kurt, at a company where heâs the CEO. Give him a 1-4 star rating. He can take it! At least, I think he can take it.
2022: â â â â
2023: â â
2024: â â â©
2025: â â â â©
On a four-star scale.
Whoah. I did not expect a histogram. Say more about 2023!
We hired too many people, too quickly, and didnât have the guardrails and structure in place for everybody to be successful.
Also: GPUs!
Yes. That was my next comment.
Do we secretly agree about GPUs?
I think so.
Our side won the argument in the end! But at what cost?
They were a killer distraction.
Final question: how long will you remain in the first-responder on-call rotation after you leave? I assume at least until August. I have a shift this weekend; can you swap with me? I keep getting weekends.
I am going to be asleep all weekend if any of my previous job changes are indicative.
I sleep through on-call too! But nobody can yell at you for it now. I think you have the comparative advantage over me in on-calling.
Yes I will absolutely take all your future on-call shifts, you have convinced me.
All this aside: it has been a privilege watching you work. I hope your next gig is 100x more relaxing than this was. Or maybe I just hope that for myself. Except: Iâll never escape this place. Thank you so much for doing this.
Thank you! Iâm forever grateful for having the opportunity to be a part of Fly.io.
