[Prev] Thread [Next]  |  [Prev] Date [Next]

Re: Gerrit and git-filter-branch Shawn Pearce Tue May 04 15:00:31 2010

Mark <[EMAIL PROTECTED]> wrote:
> We're using gerrit with a git tree that was converted from svn. Our
> svn repository had some largish binary objects - not massive but they
> take up far more space than the actual code.
> Some of our developers are working remotely without access to reliable
> broadband and so doing a full clone is quite a slow process. On top of
> that it does seem unnecessary as we have those files stored elsewhere
> now. I've read up about how to do a git-filter-branch to remove all
> versions of an object from history etc. I have a few questions about
> how this will work with Gerrit if it can...

Yup.  I've done this sort of rewrite exactly once... for the
JGit project.  We were forced to rewrite our history through
filter-branch, but I wanted to save the Gerrit Code Review records.
> The problem, as I see it, is that all the commit ids going back a very
> long way will have changed and Gerrit will not be able to reconcile
> it's database with the Git repository.

That's true.

> Assuming that this does work
> (and I'd like to know how...) then if I do the necessary git filter-
> branch changes to my local repository will it be sufficient to simply
> push the rewritten history to gerrit?

If you push the rewritten history to Gerrit, it'll update the Git
branches to point to them, but all reviews will still be looking
at the old pre-rewritten versions.  So effectively you will have
two copies of every reviewed-and-submitted change:

 * one from before filter-branch, attached to the review;
 * one from after filter-branch, merged into the branch(es)

> Would I also need to force the
> expiry of unreachable references and run gc on the server repository?

To completely discard these, yes, you would need to delete everything
from the refs/changes/ area on the Gerrit repository, and also delete
those change records from the SQL database.  That's not very pretty,
but it can be done.

If you go this route, make sure to check not just $project.git/refs/changes
but also $project.git/packed-refs to purge out the refs/changes/ names.  At
that point you also can probably just delete every change record from the
database for the affected projects.

> So then, assuming that it works and is possible, everyone else's local
> repositories will be so far out of sync with the server that they'd
> probably have to clone again (which should suddenly be much much
> faster). Right?

> Assuming that I'm on the right track to this point what would happen
> to existing reviews etc in Gerrit? Would Gerrit be able to 'find' them
> based on the ChangeIds or something like that? If there was an
> existing change in Gerrit awaiting review I'd assume we'd need to
> rebase it onto the new master before trying to merge... What about
> changes that have already been merged with Gerrit - will Gerrit still
> be able to attach the review history to those changes?

A rewrite is *painful*.

The "naive just run filter-branch and push it up to Gerrit" process
will disconnect the review data from the actual commits that are
in the rewritten branch now.  It won't correlate the way you want
it to.  But it might be OK that the commit SHA-1s don't match.
Or it might not be.

If you need them to match, you also need to rewrite the Gerrit
SQL data.

I used the following script to execute filter-branch when I rewrote
JGit's history:


For your needs, the only part that matter is the commit-filter,
it saves a mapping of old commit SHA-1 to new commit SHA-1 so we
can update Gerrit later:



  git filter-branch --commit-filter '
    n=$(git commit-tree "$@")
    echo $n
  ' \
  $(git for-each-ref --format='%(refname)' refs/heads refs/changes)

I ran this on a bare repository that was a full clone from
the server, so we would have both refs/heads and refs/changes
present. That is:

  git clone --bare ssh://...:29418/project.git project.git

After I was happy with the rewrite:

I loaded that commit.map file into my SQL database into a
temporary table:

    old_id VARCHAR(40) NOT NULL,
        new_id VARCHAR(40) NOT NULL)

Loaded commit.map into that table, and then updated the information
tables in Gerrit's database:

  CREATE INDEX commit_map_idx ON commit_map(old_id);

  UPDATE patch_sets SET revision =
  (SELECT new_id FROM commit_map WHERE old_id = revision)
  WHERE change_id IN (SELECT change_id
    FROM changes
    WHERE dest_project_name = 'project');

  UPDATE patch_set_ancestors SET ancestor_revision =
  (SELECT new_id FROM commit_map WHERE old_id = ancestor_revision)
  WHERE change_id IN (SELECT change_id
    FROM changes
    WHERE dest_project_name = 'project');

And then I updated all of the refs/heads/ and refs/changes/ behind
Gerrit's back by directly replacing the entire Git repository with
the rewritten one.

I did all of the above with the Gerrit server shutdown, so nobody
could modify the repository or the SQL database while I was doing it.
This turned out to be quite necessary with MySQL for example, the
above queries basically need to lock the entire tables to execute.

> Or should I forget trying to maintain all this in the same repository
> and just create a new repository in Gerrit and push the rewritten
> history into that? (And then, later, remove the existing repository
> from Gerrit as we won't be using it any more). I don't really want to
> do this but it would be a possible option.

Its not pretty to try and preserve the existing data.  Its not
exactly a case we designed Gerrit to support.  I've only done it
once myself, and it was painful, but I managed to make it work.

To unsubscribe, email [EMAIL PROTECTED]
More info at http://groups.google.com/group/repo-discuss?hl=en