Effective Merging Using Git
Git seems to be taking over the world, as far as Version Control Systems are concerned. 10 years ago, it was Subversion that had captured our interest (the red line below). Now Git (the blue line) is firmly in control.
Still, there is a lot to learn with Git. Many of us came to Git after spending a lot of time with Subversion or some other competitor, and the way we use Git now reflects our experience with and understanding of those other tools. We learned just enough to survive and swiftly went back to our day jobs. But Git is very, very different from any other version control system you may have used. If you understand Git, it's your best friend. If you don't, you can use it in a way that creates unnecessary risk for you and your team. Surprisingly, production issues could result from using Git in a way it wasn't meant to be used.
As we'll see, Git merges have inherent risk associated with them beyond developers making a mess of merge conflict resolution. That risk increases dramatically the longer we wait to integrate our code. It turns out that large merges are much more risky than small ones, and yet many development teams do large merges as a way of life. Your branching model may be negatively impacting quality in production, and you just don't know it yet. Let's first take a look at how Git merges work, then we can determine how best to use them. In the examples below, we'll use SourceTree to help us visualize what's happening.
The Man Behind the Curtain
You can think of Git merges like shuffling a deck of cards. Git takes two separate card decks and weaves them together into a single deck, tossing extras when duplicates are found.
Consider the example below:
This is a simple Groovy class file that does nothing but make println() calls. Now two developers, Jim and Fred, have each modified this file at the same time which will require a merge to resolve. Each has created a branch, which is just a name to which a series of commits can be associated. The branch (ie. the name) is directly associated with the most recent commit, so when a new commit is done the branch name moves to the new one. Each commit points back to it's predecessor(s)--they never point forward.
Here's Jim's commit:
Notice that Jim edited line 21 to add a comment and Git sees it as one contiguous change which includes a removal and an addition.
And here's Fred's commit:
What's important is that there are several distinct changes here, some that overlap and some that are only one line apart. To get a preview of the merge, we can highlight both commits in SourceTree, click the source file, and view the resulting diff.
With both commits selected, we see the diff below:
Git identifies discrete changes, each being a contiguous block of lines that have changed. Git will attempt to weave together the changed blocks with the unchanged blocks. If two changed blocks do not have at least one unchanged line in between them, Git considers this a merge conflict.
In this example, we can already see where the merge conflicts will be:
- line 5
- line 21
Because the red and green changes are "touching" in the diff, Git will not know which one to put on top. So, it flags these changes as merge conflicts and asks the developer to figure it out. Since the other changes are separated by common lines of code, there is a clear order for Git to follow. From Git's perspective, it's all about how to weave things together and make sure it's done in the correct order. If Git's not sure, it asks you to resolve it yourself.
Here's the result from the actual merge:
As you can see, the merge conflicts showed up where they were expected. Lines 29 and 30 are considered one contiguous change and so are included in the merge conflict together. Because the other changes do not overlap, Git is able to determine the correct order and does the rest for you.
The Shark in the Water
Git is definitely being helpful by resolving conflicts for us. Developers love this feature! Many of us lean hard on Git's auto-merging capability as if Git is a futuristic android with mad coding skills who takes care of our light work so we don't have to sweat it. And yet, Git knows nothing about context. It does no semantic parsing and has no way of determining if the changes from two merging source files actually belong together or are mutually exclusive. Consider the example below:
This is a very simple Groovy source file with a multiply(int, int) method and a main method that calls multiply(). This is actually the code that's at the head of the master branch below.
Fred has added a new call to the multiply() method:
Unbeknownst to Fred, Jim decided to retire the multiply method, removing it and the method calls that depend on it.
The end result of the merge (shown below) includes a call to the multiply() method, but now that method no longer exists.
While most of the time Git does a fine job resolving conflicts for you, the opportunity is there for mistakes to occur. If you couple that with instances of developers mishandling merge conflicts, Git merges definitely have risk associated with them. So how do we mitigate that risk?
The Take Away
Many development teams have adopted feature branches as an integral part of their branching model. Feature branches are a first class citizen of the popular "GitFlow", and many organizations have created custom branching models that lean heavily on feature branches. Feature branches allow individual developers to work in isolation until their feature is complete, so they won't have to be negatively impacted by someone else's changes. Some organizations go so far as to use long running branches to isolate entire teams when multiple teams are working in the same code base. Others use them to isolate multiple releases that are being built concurrently. Eventually, all of these things built in isolation must be merged together to get them to production, and that's where the risk appears. When two branches are merged together and each has large amounts of change, it's nearly impossible to know for sure that, between Git and our manual conflict resolution, every situation was handled correctly. Again, Git knows nothing about context, purpose, or semantics. It only thinks about order.
Developers are taught that long running feature branches are fine as long as you periodically merge new changes in the target branch (the destination for the upcoming merge) into your feature branch. This is supposed to keep you in line with everyone else. Unfortunately, this is a mirage. If 5 developers on a team have all been working on separate feature branches for some amount of time, it doesn't matter how often they merge the target branch into their respective branches. They're still not integrated. A significant amount of change has occurred that is not visible to anyone because people aren't ready to merge yet. What if it's not 5 developers but 25 or 75, all working in the same code base? These merges are performed near the end of the sprint, and its very time consuming to verify everything was handled correctly. Delayed integration always creates unnecessary risk, and often puts it squarely when you least want it--when you're wrapping up a sprint or a release.
Now let's consider trunk-based development, which asks developers to push small, well tested commits daily if not several times a day into a common trunk branch typically named "master". Maintenance branches are created as releases go out, but all new development goes directly into the master branch.
Large features are broken down into small bite-sized chunks, and developers use feature toggle to hide their changes until it's time to go live. This is real continuous integration, which has several important implications for us:
- There are no feature branches, so each developer is building today's code on top of everyone's code from yesterday.
- No merging of branches--just commit and push changes to trunk and maintenance branches.
- Bugs destined for a maintenance branch are always fixed in the trunk first and then cherry picked into the maintenance branch (to avoid regression issues).
- The opportunity for merge conflicts is drastically reduced. Less code has changed so we have little opportunity for conflict.
- Frequent commits into the trunk force developers to consider quality throughout the construction cycle rather than saving that until the end (small change, test, push; small change, test, push).
- Conflicts are caught early during construction, rather than late at merge time.
- Continuous integration creates a real window into the current state of the code, the release, etc. Nothing is hiding in the shadows.
Delayed integration can also force you to stabilize the same code multiple times. For instance, some teams do testing and stabilization of features in the feature branch so it can be tested in isolation. Once the merge does occur, it's very possible that the feature has destabilized and now you have to go through that process all over again. Or you might assume it's stable since it was working in the feature branch and simply let it go out the door that way. This could all be avoided if we just integrate early.
Git has many fantastic features. It's merging capability is head and shoulders above what is provided by Git's competitors. Those of us who've used it have all seen Git merge changes successfully without our help, which can lull us into a false sense of security. The problem isn't Git at all. It's how we use it. The wisest thing we can do is actually put effort into understanding what the tool is and what it's not. Once we've done that, we can use it the way it was intended and stop hurting ourselves with it.