Steinar H. Gunderson: Rewriting Git merge history, part 1
I remember that when Git was new and hip (around 2005), one of the supposed
advantages was that “merging is so great!”. Well, to be honest, the
competition at the time (mostly CVS and Subversion) wasn’t fantastic,
so I guess it was a huge improvement, but it’s still… problematic.
And this is even more visible when trying to rewrite history.
The case in question was that I needed to move
Stockfish‘s cluster (MPI) branch up-to-date
with master, which nobody had done for a year and and a half because there
had been a number of sort-of tricky internal refactorings that caused merge
conflicts. I fairly quickly realized that just doing “git merge master”
would create a huge mess of unrelated conflicts that would be impossible
to review and bisect, so I settled on a different strategy: Take one conflict
at a time.
So I basically merged up as far as I could without any conflicts (essentially
by bisecting), noted that as a merge commit, then merged one conflicting
commit, noted that as another merge (with commit notes if the merge was
nontrivial, e.g., if it required new code or a new approach), and then
repeat. Notably, Git doesn’t seem to have any kind of native support for
this flow; I did it manually at first, and then only later realized that there
were so many segments (20+) that I should write a script to get everything
consistent. Notably, this approach means that a merge commit can have
significant new code that was not in either parent. (Git does support
this kind of flow, because a commit is just a list of zero or more parent
commits and then the contents of the entire tree; git show does a diff
on-the-fly, and object deduplication and compression makes this work without
ballooning the size. But it is still surprising to those that don’t do a
lot of merges.)
That’s where the nice parts ended, and the problems began. (Even ignoring
that a conflict-free merge could break the compile, of course.) Because I realized
that while I had merged everything, it wasn’t actually done; the MPI
support didn’t even compile, for one, and once I had fixed that, I realized
that I wanted to fix typos in commit messages, fix bugs pointed out to me
by reviewers, and so on. In short, I wanted to rewrite history. And that’s
not where Git shines.
Everyone who works with a patch-based review flow (as opposed to having
a throwaway branch per feature with lots of commits like “answer review
comments #13” and then squash-merging it or similar) will know that git’s
basic answer to this is git rebase. rebase essentially sets up a script
of what commits you’ve done, then executes a script (potentially
at a different starting point, so you could get conflicts). Interactive
rebase simply lets you edit that script in various ways, so that you can
e.g. modify a commit message on the way, or take out a commit, or (more
interestingly) make changes to a commit before continuing.
However, when merges are involved, regular interactive rebase just breaks
down completely. It assumes that you don’t really want merges; you just
want a nice linear series of commits. And that’s nice, except that in this
case, I wanted the merges because the entire point was to upmerge.
So then I needed to invoke git rebase --rebase-merges, which makes the
script language into a somewhat different one that’s subtly different
and vastly more complicated (it basically sets up a list of ephemeral
branches as “labels” to specify the trees that are merged into the various
merge commits). And this is fine—until you want to edit that script.
In particular, let’s take a fairly trivial change: Modifying a commit
message. The merge command in the rebase script takes in a commit hash
that’s only used for the commit message and nothing else (the contents
of the tree are ignored), and you can choose to either use a different
hash or modify the message in an editor after-the-fact. And you can try
to do this, but… then you get a merge conflict later in the rebase. What?
It turns out that git has a native machinery for remembering conflict
resolutions. It basically remembers that you tried to merge commit
A and B and ended up committing C (possibly after manual conflict resolution);
so any merge of A and B will cause git to look that up and just use C.
But that’s not what really happened; since you modified the commit message
of A (or even just its commit date), it changed its hash and became A’,
and now you’re trying to merge A’ and B, for which git has no conflict
resolution remembered, and you’re back to square one and have to do
the resolution yourself. I had assumed that the merge remembered how to
merge trees, but evidently it’s on entire commits.
But wait, I hear you say; the solution for this is
git-rerere!
rerere exists precisely for this purpose; it remembers conflict
resolutions you’ve done before and tries to reapply them. It only
remembers merge conflicts you did when rerere was actually active,
but there’s a contrib script to “learn” from before that time,
which works OK. So I tried to run the learn script and run the
rebase… and it stopped with a merge conflict. You see, git rerere
doesn’t stop the conflicts, it just resolves them and then you
have to continue the rebase yourself from the shell as usual.
So I did that 20+ times (I can tell you, this gets tedious
real quick)… and ended up with a different result. The tree
simply wasn’t the same as before the merge, even though I had
only changed a commit message.
See, the problem is that rerere remembers conflicts, not
merges. It has to, in order to reach its goal of being able to
reapply conflict resolutions even if other parts of the file
have changed. (Otherwise, it would be only marginally more useful
than git’s existing native support, which we discussed earlier.)
But in this case, two or more conflicts in the rebase looked too similar to
each other, yet needed different resolutions. So it picked the wrong
resolution and ended up with a silent mismerge. And there’s no
way to guide it towards which one should apply when, so rerere
was also out of the question.
This post is already long enough as it is; next time, we’ll
discuss the (horrible) workaround I used to actually (mostly)
solve the problem.
