My tutorial on git banches
I've been trying subtly (and in a few cases not-so-subtly) for years now to convert my colleagues to the gospel of git and GitHub. The git version control system has its quirks no doubt, but there is—in my opinion—no more powerful system for open collaboration on software and science than git and GitHub. A quote I attribute (hopefully correctly) to Software Carpentry's Greg Wilson (@gvwilson) captures my sentiments exactly.
Learning git is the tax you pay to use GitHub.
Over Christmas break my advisor finally decided it was time to bite the bullet and see what git/GitHub offered that he couldn't get with Subversion. It took him a while to warm up to the idea, but once he decided to try it, he was committed to giving it his best shot. He read up on the git/GitHub workflow and immersed himself in it for a week or more. By the time we reconnected in the new year, he was convinced this was something he wanted to adopt for himself and the rest of the lab.
Since then, we've been gradually piecing together a lab handbook with lab procedures, guidelines, and tips. My advisor wrote up a basic git/GitHub tutorial a few weeks ago, and yesterday I wrote a follow-up on using git branches in your daily workflow. The text of the tutorial is below.
The git version control system has a feature called branches that make it safe and easy to implement bug fixes or new features, and to isolate those changes from the main development environment until they are vetted.
The concept of branches is not unique to git: users of Subversion or CVS will recognize the common organizational strategy of placing the main development environment in a
trunk/ directory, and then creating experimental branches by copying that directory to a separate
Indeed, with Subversion, branching is just that—a strategy—and merging branches back into the main trunk can be very tedious and frustrating.
With git, however, branches are a built-in feature: they are trivial to create, and can almost always be merged back into the main line of development (master) with little or no hassle.
For a quick intro to git branches, see this tutorial from Atlassian. The remainder of this HowTo presents a hypothetical vignette demonstrating how branches could and should be used as part of your daily workflow, and how they help mitigate merge conflicts when working collaboratively.
Alice and Bob are two scientists collaborating on a research project they hope to publish soon.
They have implemented a prototype R script to analyze data they have collected, and the R code is stored in a repository on GitHub called
Alice and Bob have their own forks at
Having prototyped the analysis procedure and tested it on small sample data sets, they are ready to move forward by polishing up the prototype, running the analysis on complete data sets, reporting their results, and discussing their interpretation.
Monday morning, Alice and Bob spent some time working on their manuscript, documenting the motivation for their work, describing their methodology, and putting in placeholders for results they will soon generate.
Monday afternoon, after working on the manuscript, they discussed what needed to be done next at a technical level.
- First, they need to fix a small bug in the R prototype.
They opened a new thread on the
BrendelGroup/proj issue tracker on GitHub to describe the bug, what causes the script to fail, and their plan to fix the bug.
The issue tracker designated this thread
- Second, they need to add one more step to their analysis procedure.
They described this step at (a high level) in the Methods section of their paper earlier that morning, but they took some time now to start a new thread on the GitHub issue tracker and describe their precise plan for implementating this additional step in R.
The issue tracker designated this thread
On Tuesday morning, Alice comes in to the office to work on the project. She decides to get to work right away fixing the bug they discussed the previous day. She logs in to her laptop, fires up a terminal, changes to the directory containing her local clone of the repository, and checks her status...
cd /home/alice/proj/ git status
...to which her terminal responds as follows.
On branch master nothing to commit, working directory clean
She's on the main/default development branch master, and the working directory is clean, meaning no changes have been made to the code since the last commit.
Since everything looks good, Alice creates a new branch called
git branch bugfix git checkout bugfix
She then opens up the R script and takes about 30 minutes to fix the code and test her solution.
Once she's confident it is correct, she commits a new snapshot to her local repository.
She references the corresponding GitHub thread in her commit message, using syntax that will automatically close the thread in the issue tracker once the commit is merged into the
master branch of
git add analysis.R git commit -m 'Fixed estimation bug. Closes #14.'
At this point, the git repository on Alice's laptop has the bugfix, but the changes haven't been pushed anywhere else: her fork on GitHub (
alice/proj), the original "upstream" repository on GitHub (
BrendelGroup/proj), or Bob's fork (
Referring back to the creative triangle described in the GitHub howto, Alice proceeds by pushing her latest commit (stored on a branch called
bugfix) to her fork on GitHub.
git push origin bugfix
She now goes to GitHub and opens a pull request to merge
With the bug fix resolved, Alice now turns to the final step in their analysis procedure. The bug has nothing to do with this final step, so she starts by returning to the master branch.
git checkout master
Her directory now looks just like it did before she started working this morning. This provides a clean slate for Alice to begin working on the analysis procedure. She proceeds by creating a new branch in her local repo.
git branch laststep git checkout laststep
She then takes an hour or so to prepare a few small data files she will use to test the R script after the new feature is implemented. Next, once she is confident the tests are correct, she opens up the R script again and implements the new procedure. With test data already in place, it is very easy for Alice to confirm when the procedure is working correctly.
Finally, after all her tests pass, Alice commits a new snapshot in her local repository. Again, she references the corresponding thread on the GitHub issue tracker.
git add analysis.R testdatafiles/ git commit -m 'Implemented final analysis step. Closes #15.'
Another pull request
Just as before, Alice's new feature is present only on her laptop. She needs to push out to her fork and open a pull request to get these changes integrated into the main repository.
First she pushes to her fork...
git push origin laststep
...and then goes to GitHub to open a pull request to merge
At this point, she has two open pull requests waiting for Bob to review and accept. Having worked for a few hours at this point, Alice takes a break to have lunch and attend a seminar.
When Bob arrives at his office late on Tuesday morning, he's suspicious that the solution he discussed with Alice the day before is incorrect.
After digging through the data for a bit, he confirms his suspicion that they have made some incorrect assumptions about the input data, and that the bugfix they discussed previously isn't valid.
He logs on to GitHub to comment on thread
#14 and explain his findings.
Bob notes that Alice has already submitted a pull request to fix the bug, and adds a comment to the thread explaining why their proposed solution won't work.
Bob also notes that Alice has submitted a pull request with her implementation of the final step of the procedure.
He reviews the changes to the code on her
laststep branch, understands the changes she made, and notes that she has provided test cases for validating the procedure.
Since everything looks good, Bob accepts the pull request.
GitHub automatically closes issue
#15 at this point, since Alice's commit message included the text
Bob opens up his terminal, goes to his local clone of the repository, and pulls the latest changes from
cd /home/bob/proj git pull upstream master
He then pushes those changes out to his fork to keep everything in sync.
git push origin master
Bob's bug fix
After his morning data slogging session, Bob has a much better grasp on how to fix the bug in their R code. He has a few minutes before he has to teach a class, so he takes a stab at implementing the bug fix. Bob starts by creating a new branch in his local repo.
git branch newbugfix git checkout newbugfix
He then edits the R code and tests that it works on a variety of inputs. Once he's confident the bug fix handles the input data correctly, he makes a new commit.
git add analysis.R git commit -m 'Alternative approach to fixing the estimation bug. See my comments in #14. Closes #14.'
Bob then pushes this new branch to his fork on GitHub...
git push origin newbugfix
...and opens a pull request on GitHub to merge
He then rushes out the door to go to his class.
When Alice returns from her lunch and seminar, she goes to GitHub to check on the status of her pull requests.
She's happy to see that Bob has accepted her second pull request, but sees that the other is still open.
She reads Bob's comments on
#14 and after a few minutes of thinking things over herself she understands and agrees with his assessment.
She closes her unmerged pull request and deletes the
bugfix branch from her fork on GitHub.
git push origin :bugfix
Alice then takes a look at Bob's proposed solution, reviews the code, and finding it solid she merges his pull request.
GitHub automatically closes issue
#14 at this point, since Bob's commit message included the text
Now Alice needs to syncronize her laptop with the new changes.
First, she switches back to the main
git checkout master
At this point, her directory looks just like it did before she started work this morning.
The changes she made before lunch are stored in dedicated branches, and she has not touched the main
master branch on her laptop.
To update her local
master branch, she pulls from
git pull upstream master
...and then pushes those updates out to her fork.
git push origin master
At this point the
master branch is up-to-date both on Alice's fork and her laptop.
She has also deleted the
bugfix branch on her fork, but the
bugfix branch is still present on her laptop, and the
laststep branch is still present both on her fork and her laptop.
She wants to delete all of these branches now, since
laststep has been merged and
bugfix is a dead end.
Alice can list the branches in her local clone, highlighting the active branch, with the command...
...and she can list the branches in her GitHub fork (along with some other information) using the following command.
git remote show origin
Alice then uses the following commands to delete the remaining extraneous branches.
git push origin :laststep # Delete `laststep` on `alice/proj` git branch -d laststep # Delete `laststep` branch on laptop git branch -D bugfix # Force deletion of the unmerged `bugfix` branch on laptop
This story could have turned out very differently had Alice not used branches in her workflow.
She would have committed her incorrect implementation of the bug fix to her
master branch, followed by her correct implementation of the final step of their analysis procedure to the same
Had she then submitted a pull request to merge
BrendelGroup/proj:master, Bob would not have been able to merge it because of the erroneous bug fix.
There would be no easy way for her to keep the latest commit while discarding an earlier commit, or to integrate Bob's new solution without merge conflicts.
She likely would have had to rewind two commits in her history to get rid of her bug fix and re-implement the final step to their analysis procedure, and then use the
--force option to overwrite the commit history she had previously pushed to her fork.
By using branches, Alice cleanly isolated each set of changes from the main development branch, and was able to discard one set of changes without affecting the other set of changes. Using branches with git and GitHub is an excellent mechanism to keep your work organized into small manageable chunks that can be developed, evaluated, and merged independently.