An Incremental Approach to Content Management Using Git
One of the many challenges with building or refreshing a website is the selection of a Content Management System (CMS). Despite our best efforts the CMS can often be a source of difficulty in a project, but there are alternatives. Read about the approach we took on www.thoughtworks.com to developing functionality to support content management in an incremental fashion.
When it comes to content management, organisations will often select a CMS product early in the life cycle of a website development. This is frequently the source of pain later on in the project; frameworks usually enforce their view of the world upon their developers, and trying to choose correctly at the point where you understand the least about the project is nigh-on impossible.
Apart from wanting to avoid a big and difficult decision at the outset, there are a number of reasons why we took the decision not to use a CMS on www.thoughtworks.com from the get-go:
- We needed to start quickly without having to learn and configure a framework.
- We wanted to use all of the practices which we ourselves espouse (TDD, continuous integration, continuous delivery). Many content management systems have poor support for these practices.
- We wanted a service-based architecture, so that we could evolve the system, experiment with different technologies and replace parts of the whole without a complete rewrite.
- We wanted flexibility in terms of the languages and tools we used.
- We are, after all, software developers!
Without a framework to install and configure we were able to start building the website quickly. The first pages were working within the first week or two, and within eight weeks we had a functioning website. The website you see today represents more than a year of additional iterative development, with the team delivering value every week.
#1 Content as static files
In order to get up and running quickly we initially built a largely static site, using templates to encapsulate and share presentation elements, and keeping the page data in separate content files stored as JSON [https://en.wikipedia.org/wiki/Json] and Slim[http://slim-lang.com/] (a HAML[http://haml.info/]-like mark-up and templating system for Ruby). With a CD pipeline which allowed us to push changes to live easily we were able to maintain and push content without absorbing much developer time, but there we still requirements we needed to meet:
- Content needed to be released outside of the application release cycle.
- Contributors wanted to be able to change content without needing intervention from the development team (and the development team did not want to be involved in every content change).
Some form of content management was required.
#2 Not all content is equal
One of the features of our content we observed was that the frequency of changes of different kinds of content is very different – for example event or news information is constantly in motion, and each event has a limited shelf-life. Information on some other pages (for example software-testing [https://www.thoughtworks.com/software-testing]) is much longer-lived and slow moving.
This allowed us to prioritise work on making content editable, with slower moving items remaining as static text. It was clear that we didn't have to replicate an entire CMS, or even make all of the content on the website editable – we just needed to make sure that the fast moving items could be changed.
#3 Introducing the content service
Our first step was to move a single fast moving item (news) to a new content service, initially still as static files. Moving just a single kind of data allowed us to stand up a working service relatively quickly, and then we added other kinds content gradually. Even this simple change added value; we could now change news items by releasing the content service, we no longer needed to release the entire web application.
#4 Moving to GitHub [github.com]
The next stage was to allow the content service to accept updates to the information it was managing. At this point we needed to consider the kind of data store we would use going forward, and how we would ensure that the data was consistent between different load balanced instances of the content service. Typically this would be the point that we might move to a database (actually the data store is one of the decisions usually made right at the start – and indeed we had GitHub in mind from the early stages of the project).
The decision to use git and GitHub was based on a number of factors:
- git provides a simple migration from a file-based data store; the files can just be checked-in to a repository.
- git is widely used and understood.
- git provides version control and workflow, which can be used to support content management.
- GitHub provides a safe cloud-based store for data.
- Simple editing and publishing can be achieved just by committing.
In order to move the content data files a number of pieces of functionality were required:
- A new content repository was created and all content was committed to the repository
- A local repository was created on each instance of the content service as part of the deployment process
- Files were served from the local repository folders – no change from serving them from the file system as we were previously.
- The local repositories in each instance of the content service had to be updated when changes were made to the master copy (in GitHub). This is straightforward as GitHub provides a facility called Webhooks [https://help.GitHub.com/articles/post-receive-hooks], where a post will automatically be sent to a list of URLs when the repository is changed.
We made the decision to use a local repository for two reasons:
- Serving content directly from GitHub proved too slow to be practical
- A local repository gives a robust solution – even if there were problems with GitHub, our application could continue to serve content to the web.
At this point we had completely disconnected the content publishing cycle from the application (and service) releases. Content changes (for content served through the content service) could be made simply by pushing to the content GitHub repository.
#5 Allowing content updates through the content service
So far we were only supporting reading GitHub content from the content service. For a fully functional service we needed to be able to support the creation and updating of content. Allowing updates to the local repository in a load balanced environment and then pushing to git would introduce the possibility of clashes between different content service instances. Instead we elected to push all changes directly to the GitHub API, and then let these changes propagate back to all the content service instances through the Web hook mechanism.
#6 The GitHub API [https://developer.github.com/v3/]
The GitHub API provides access to most of the features of git via a RESTful [https://en.wikipedia.org/wiki/Representational_State_Transfer] interface. You can manipulate the data held in GitHub by referencing the underlying building blocks of git (Blobs, Commits and Trees) or through a high-level interface by referring to files. We started by using the more high-level interface, thus:
def create_file(content, path, message) base64_content = Base64.strict_encode64(content) @octokit_client.create_contents("my-git-account/my-repo", message, base64_content) end create_file("Contents to store","my/path/to/content.txt","Commit msg")
This example illustrates creating a file in GitHub using the Octokit Ruby gem [https://github.com/octokit/octokit.rb] to take care of the communication directly with GitHub. The file contents must be Base64 encoded, and then it's a single method call telling GitHub where to store the file. The API creates a new commit with this file added to the repository.
We found that the simple interface proved unreliable in practice – GitHub API had a nasty habit of not always updating the head ref, leaving the repository in an inconsistent state. The API is still under development, and we have found that many of the earlier wrinkles have been ironed out, so this may not be a problem now.
We moved to the lower level interface, which we found more reliable, but as you can see from the following equivalent code snippet involves more work:
def create_content(content, path, message) content_reference = @octokit_client.create_blob(content) commit_reference = create_commit_reference(message,content_reference, path) @octokit_client.update_head_ref_to(commit_reference) end def create_commit_reference(commit_message, content_reference, path) head_reference = @octokit_client.get_head_reference base_tree_reference = @octokit_client.get_tree(head_reference) tree_reference = @octokit_client.create_tree(base_tree_reference, content_reference, path) @octokit_client.create_commit(head_reference, tree_reference, commit_message) end create_content("Contents to store","my/path/to/content.txt","Commit msg")
At the lower level, you need to:
1. Create your new content blob (file contents)
2. Get the current head reference and it's associated tree
3. Create a new tree based on the current tree, but with the new blob added with the correct path
4. Create a new commit referencing the new tree
5. Move the head reference to point to the new commit
This is actually what git does when you do a commit, but thankfully most of the time you don't need to think about the lower level operations. If you are interested in reading more, git from the bottom up [http://ftp.newartisans.com/pub/git.from.bottom.up.pdf] is a comprehensive walk through how git actually works under the covers.
With content updates being passed through to the GitHub API, we now had a fully functioning content service.
Read Part 2 where I go over the details of how we implemented the high level content management operations (save and publish) using git.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.