Steinar H. Gunderson: Rewriting Git merge history, part 2
In part 1,
we discovered the problem of rewriting git history in the presence of
nontrivial merges. Today, we’ll discuss the workaround I chose.
As I previously mentioned, and as Julia Evans’ excellent data model document
explains, a git commit is just a snapshot of a tree (suitably deduplicated
by means of content hashes), a commit message and a (possibly empty)
set of parents. So fundamentally, we don’t really need to mess with diffs;
if we can make the changes we want directly to the tree (well, technically,
make a new tree that looks like what we want, and a new commit using that
tree), we’re good.
(Diffs in git are, generally, just git diff looking at two trees and
trying to make sense of it. This has the unfortunate result that there
is no solid way of representing a rename; there are heuristics, but if
you rename a file and change it in the same commit, they may fail
and stuff like git blame or git log may be broken, depending on flags.
Gerrit doesn’t even seem to understand a no-change copy.)
In earlier related cases, I’ve taken this to the extreme by simply
hand-writing a commit using git commit-tree. Create exactly the state
that you want by whatever means, commit it in some dummy commit and then use
that commit’s tree with some suitable commit message and parent(s); voila.
But it doesn’t help us with history; while we can fix up an older commit
in exactly the way we’d like, we also need the latter commits to have our
new fixed-up commit as parent.
Thus, enter git filter-branch. git filter-branch comes with a suitable
set of warnings about eating your repository and being deprecated (I never
really figured out its supposed replacement git filter-repo, so I won’t
talk much about it), but it’s useful when all else fails.
In particular, git filter-branch allows you to do arbitrary changes
to the tree of a series of commits, updating the parent commit IDs
as rewrites happen. So if you can express your desired changes in a
way that’s better than ârun the editorâ (or if you’re happy running
the editor and making the same edit manually 300 times!), you can
just run that command over all commits in the entire branch
(forgive me for breaking lines a bit):
git filter-branch -f --tree-filter
'! [ -f src/cluster.cpp ] || sed -i "s/if (mi.rank != 0)/if (mi.rank != 0
&& mi.rank == rank())/" src/cluster.cpp'
665155410753978998c8080c813da660fc64bbfe^..cluster-master
This is suitably terrible. Remember, if we only did this for one commit,
the change wouldn’t be there in the next one (git diff would show that
it was immediately reverted), so filter-branch needs to do this over and
over again, once for each commit (tree) in the branch. And I wanted multiple
fixups, so I had a bunch of these; some of them were as simple as âcopy this
file from /tmpâ and some were shell scripts that did things like running
clang-format.
You can do similar things for commit messages; at some point, I figured
I should write âclusterâ (the official name for the branch) and not
âcluster-masterâ (my local name) in the merge messages, so I could just do
git filter-branch --commit-msg-filter 'sed s/cluster-master/cluster/g' 665155410753978998c8080c813da660fc64bbfe^..cluster-master
I also did a bunch of them to fix up my email address (GIT_COMMITTER_EMAIL
wasn’t properly set), although I cannot honestly remember whether I used
--env-filter or something else.
Perhaps that was actually with git rebase and `-r –exec ‘git commit –amend
–no-edit –author ⊒` or similar. There are many ways to do ugly things. đ
Eventually, I had the branch mostly in a state where I thought it would be
ready for review, but after uploading to GitHub, one reviewer commented that
some of my merges against master were commits that didn’t exist in master.
Huh? That’s⊠surprising.
It took a fair bit of digging to figure out what had happened:
git filter-branch had rewritten some commits that it didn’t
actually have to; the merge sources from upstream. This is normally
harmless, since git hashes are deterministic, but these commits were signed
by the author! And filter-branch (or perhaps fast-export, upon which it
builds?) generally assumes that it can’t sign stuff with other people’s keys,
so it just strips the signatures, deeming that better than having
invalid ones sitting around. Now, of course, these commit signatures would still
be valid since we didn’t change anything, but evidently, filter-branch
doesn’t have any special code for that.
Removing an object like this (a âgpgsigâ attribute, it seems) changes the
commit hash, which is where the phantom commits came from. I couldn’t get
filter-branch to turn it off⊠but again, parents can be freely changed,
diffs don’t exist anyway. So I wrote a little script that took in
parameters suitable for git commit-tree (mostly the parent list),
rewrote known-bad parents to known-good parents, gave the script to
git filter-branch --commit-filter, and that solved the problem.
(I guess --parent-filter would also have worked; I don’t think I saw
it in the man page at the time.)
So, well, I won’t claim this is an exercise in elegancy. (Perhaps my next
adventure will be figuring out how this works in jj, which supposedly
has conflicts as more of a first-class concept.) But it got the
job done in a couple of hours after fighting with rebase for a long time,
the PR was reviewed, and now the Stockfish cluster branch is a little bit
more alive.
