I've been trying subtly (and in a few cases not-so-subtly) for years now to convert my colleagues to the gospel of git and GitHub. The git version control system has its quirks no doubt, but there is—in my opinion—no more powerful system for open collaboration on software and science than git and GitHub. A quote I attribute (hopefully correctly) to Software Carpentry's Greg Wilson (@gvwilson) captures my sentiments exactly.

Learning git is the tax you pay to use GitHub.

Over Christmas break my advisor finally decided it was time to bite the bullet and see what git/GitHub offered that he couldn't get with Subversion. It took him a while to warm up to the idea, but once he decided to try it, he was committed to giving it his best shot. He read up on the git/GitHub workflow and immersed himself in it for a week or more. By the time we reconnected in the new year, he was convinced this was something he wanted to adopt for himself and the rest of the lab.

Since then, we've been gradually piecing together a lab handbook with lab procedures, guidelines, and tips. My advisor wrote up a basic git/GitHub tutorial a few weeks ago, and yesterday I wrote a follow-up on using git branches in your daily workflow. The text of the tutorial is below.


01.1-Howto-github-branches

The git version control system has a feature called branches that make it safe and easy to implement bug fixes or new features, and to isolate those changes from the main development environment until they are vetted. The concept of branches is not unique to git: users of Subversion or CVS will recognize the common organizational strategy of placing the main development environment in a trunk/ directory, and then creating experimental branches by copying that directory to a separate branches/ directory. Indeed, with Subversion, branching is just that—a strategy—and merging branches back into the main trunk can be very tedious and frustrating. With git, however, branches are a built-in feature: they are trivial to create, and can almost always be merged back into the main line of development (master) with little or no hassle.

For a quick intro to git branches, see this tutorial from Atlassian. The remainder of this HowTo presents a hypothetical vignette demonstrating how branches could and should be used as part of your daily workflow, and how they help mitigate merge conflicts when working collaboratively.

Setting

Alice and Bob are two scientists collaborating on a research project they hope to publish soon. They have implemented a prototype R script to analyze data they have collected, and the R code is stored in a repository on GitHub called BrendelGroup/proj. Alice and Bob have their own forks at alice/proj and bob/proj, respectively. Having prototyped the analysis procedure and tested it on small sample data sets, they are ready to move forward by polishing up the prototype, running the analysis on complete data sets, reporting their results, and discussing their interpretation.

Monday morning, Alice and Bob spent some time working on their manuscript, documenting the motivation for their work, describing their methodology, and putting in placeholders for results they will soon generate. Monday afternoon, after working on the manuscript, they discussed what needed to be done next at a technical level. - First, they need to fix a small bug in the R prototype. They opened a new thread on the BrendelGroup/proj issue tracker on GitHub to describe the bug, what causes the script to fail, and their plan to fix the bug. The issue tracker designated this thread #14. - Second, they need to add one more step to their analysis procedure. They described this step at (a high level) in the Methods section of their paper earlier that morning, but they took some time now to start a new thread on the GitHub issue tracker and describe their precise plan for implementating this additional step in R. The issue tracker designated this thread #15.

First branch

On Tuesday morning, Alice comes in to the office to work on the project. She decides to get to work right away fixing the bug they discussed the previous day. She logs in to her laptop, fires up a terminal, changes to the directory containing her local clone of the repository, and checks her status...

cd /home/alice/proj/
git status

...to which her terminal responds as follows.

On branch master
nothing to commit, working directory clean

She's on the main/default development branch master, and the working directory is clean, meaning no changes have been made to the code since the last commit.

Since everything looks good, Alice creates a new branch called bugfix.

git branch bugfix
git checkout bugfix

She then opens up the R script and takes about 30 minutes to fix the code and test her solution. Once she's confident it is correct, she commits a new snapshot to her local repository. She references the corresponding GitHub thread in her commit message, using syntax that will automatically close the thread in the issue tracker once the commit is merged into the master branch of BrendelGroup/proj.

git add analysis.R
git commit -m 'Fixed estimation bug. Closes #14.'

Pull request

At this point, the git repository on Alice's laptop has the bugfix, but the changes haven't been pushed anywhere else: her fork on GitHub (alice/proj), the original "upstream" repository on GitHub (BrendelGroup/proj), or Bob's fork (bob/proj). Referring back to the creative triangle described in the GitHub howto, Alice proceeds by pushing her latest commit (stored on a branch called bugfix) to her fork on GitHub.

git push origin bugfix

She now goes to GitHub and opens a pull request to merge alice/proj:bugfix into BrendelGroup/proj:master.

Second branch

With the bug fix resolved, Alice now turns to the final step in their analysis procedure. The bug has nothing to do with this final step, so she starts by returning to the master branch.

git checkout master

Her directory now looks just like it did before she started working this morning. This provides a clean slate for Alice to begin working on the analysis procedure. She proceeds by creating a new branch in her local repo.

git branch laststep
git checkout laststep

She then takes an hour or so to prepare a few small data files she will use to test the R script after the new feature is implemented. Next, once she is confident the tests are correct, she opens up the R script again and implements the new procedure. With test data already in place, it is very easy for Alice to confirm when the procedure is working correctly.

Finally, after all her tests pass, Alice commits a new snapshot in her local repository. Again, she references the corresponding thread on the GitHub issue tracker.

git add analysis.R testdatafiles/
git commit -m 'Implemented final analysis step. Closes #15.'

Another pull request

Just as before, Alice's new feature is present only on her laptop. She needs to push out to her fork and open a pull request to get these changes integrated into the main repository.

First she pushes to her fork...

git push origin laststep

...and then goes to GitHub to open a pull request to merge alice/proj:laststep into BrendelGroup/proj:master.

At this point, she has two open pull requests waiting for Bob to review and accept. Having worked for a few hours at this point, Alice takes a break to have lunch and attend a seminar.

Bob's review

When Bob arrives at his office late on Tuesday morning, he's suspicious that the solution he discussed with Alice the day before is incorrect. After digging through the data for a bit, he confirms his suspicion that they have made some incorrect assumptions about the input data, and that the bugfix they discussed previously isn't valid. He logs on to GitHub to comment on thread #14 and explain his findings. Bob notes that Alice has already submitted a pull request to fix the bug, and adds a comment to the thread explaining why their proposed solution won't work.

Bob also notes that Alice has submitted a pull request with her implementation of the final step of the procedure. He reviews the changes to the code on her laststep branch, understands the changes she made, and notes that she has provided test cases for validating the procedure. Since everything looks good, Bob accepts the pull request. GitHub automatically closes issue #15 at this point, since Alice's commit message included the text Closes #15. Bob opens up his terminal, goes to his local clone of the repository, and pulls the latest changes from BrendelGroup/proj.

cd /home/bob/proj
git pull upstream master

He then pushes those changes out to his fork to keep everything in sync.

git push origin master

Bob's bug fix

After his morning data slogging session, Bob has a much better grasp on how to fix the bug in their R code. He has a few minutes before he has to teach a class, so he takes a stab at implementing the bug fix. Bob starts by creating a new branch in his local repo.

git branch newbugfix
git checkout newbugfix

He then edits the R code and tests that it works on a variety of inputs. Once he's confident the bug fix handles the input data correctly, he makes a new commit.

git add analysis.R
git commit -m 'Alternative approach to fixing the estimation bug. See my comments in #14. Closes #14.'

Bob then pushes this new branch to his fork on GitHub...

git push origin newbugfix

...and opens a pull request on GitHub to merge bob/proj:newbugfix into BrendelGroup/proj:master. He then rushes out the door to go to his class.

Alice's review

When Alice returns from her lunch and seminar, she goes to GitHub to check on the status of her pull requests. She's happy to see that Bob has accepted her second pull request, but sees that the other is still open. She reads Bob's comments on #14 and after a few minutes of thinking things over herself she understands and agrees with his assessment. She closes her unmerged pull request and deletes the bugfix branch from her fork on GitHub.

git push origin :bugfix

Alice then takes a look at Bob's proposed solution, reviews the code, and finding it solid she merges his pull request. GitHub automatically closes issue #14 at this point, since Bob's commit message included the text Closes #14.

Now Alice needs to syncronize her laptop with the new changes. First, she switches back to the main master branch.

git checkout master

At this point, her directory looks just like it did before she started work this morning. The changes she made before lunch are stored in dedicated branches, and she has not touched the main master branch on her laptop. To update her local master branch, she pulls from BrendelGroup/proj...

git pull upstream master

...and then pushes those updates out to her fork.

git push origin master

Cleanup

At this point the master branch is up-to-date both on Alice's fork and her laptop. She has also deleted the bugfix branch on her fork, but the bugfix branch is still present on her laptop, and the laststep branch is still present both on her fork and her laptop. She wants to delete all of these branches now, since laststep has been merged and bugfix is a dead end.

Alice can list the branches in her local clone, highlighting the active branch, with the command...

git branch

...and she can list the branches in her GitHub fork (along with some other information) using the following command.

git remote show origin

Alice then uses the following commands to delete the remaining extraneous branches.

git push origin :laststep # Delete `laststep` on `alice/proj`
git branch -d laststep    # Delete `laststep` branch on laptop
git branch -D bugfix      # Force deletion of the unmerged `bugfix` branch on laptop

Epilogue

This story could have turned out very differently had Alice not used branches in her workflow. She would have committed her incorrect implementation of the bug fix to her master branch, followed by her correct implementation of the final step of their analysis procedure to the same master branch. Had she then submitted a pull request to merge alice/proj:master into BrendelGroup/proj:master, Bob would not have been able to merge it because of the erroneous bug fix. There would be no easy way for her to keep the latest commit while discarding an earlier commit, or to integrate Bob's new solution without merge conflicts. She likely would have had to rewind two commits in her history to get rid of her bug fix and re-implement the final step to their analysis procedure, and then use the --force option to overwrite the commit history she had previously pushed to her fork.

By using branches, Alice cleanly isolated each set of changes from the main development branch, and was able to discard one set of changes without affecting the other set of changes. Using branches with git and GitHub is an excellent mechanism to keep your work organized into small manageable chunks that can be developed, evaluated, and merged independently.

Comments