NoSQL, No Problem: An Intro to NoSQL Databases
Unraveling NoSQL and trying to explain what it is and whether you'd be interested in it or not is difficult. This article aims to give a high level introduction to NoSQL and provides a comparison of the latest technologies in this space.
- Introduction
- What is NoSQL?
- Why is it happening now?
- Are people really using it?
- Why would you want to use a NoSQL database?
- Query Language maturity
- New technology, new challenges
- Types of NoSQL datastores
- Conclusion
The term covers a wide range of technologies, data architectures and priorities; it represents as much a movement or a school of thought as it does any particular technology. Even the name is confusing, for some it means literally any data storage that does not use SQL but thus far the industry seems to have settled on "Not Only SQL". As time goes on it is likely that the scope of the term is going to grow and grow until it becomes meaningless by itself and sub-divisions will be needed to clarify the meaning of the term.
NoSQL movement is a piece of guerilla marketing that brings together a broad group of technologists and technology under one banner. The ideas that underpin the myriad solutions that exist under the term "NoSQL" have previously only been available to those whose unique needs meant they had to develop and build them. In the areas where such solutions are a necessity they have already proven themselves, now their use has become an option for others at a much smaller investment cost. For any organisation that has the choice between NoSQL and traditional relational data, there is the difficult question of which should be used? It is still too early to provide a decisive and definitive answer but it is clear that many organisations would benefit from a data model that better matches the kind of storage and retrieval that they perform in practice rather than in theory. It also seems likely that most solutions will consist of a hybrid mix of storage solutions, much as mixes of n-tier and client-server structures tend to be more common than absolute commitments to one strategy.
Technical leaders have an important role in understanding the available options and adapting the software, products and services most applicable to their own domain. Having a logical and localised strategy for adopting the best of NoSQL is going to be what differentiates success from failure in adoption.
Just as NoSQL presents new challenges it also offers significant rewards to those who can successfully incorporate it into their solution portfolio. The key benefits are going to emerge around improved data comprehension, flexible scaling solutions and productivity. The rich variety of new business models have data storage needs that support them and the decades of coercing data into relational forms lies behind us.
NoSQL is a large and expanding field, for the purposes of this paper the common features of NoSQL data stores are:
- Easy to use in conventional load-balanced clusters
- Persistent data (not just caches)
- Scale to available memory
- Have no fixed schemas and allow schema migration without downtime
- Have individual query systems rather than using a standard query language
- Are ACID within a node of the cluster and eventually consistent across the cluster
Not every product in this paper has every one of these properties but the majority of the stores we are going to talk about support most of them.
There are three key drivers behind the increased interest in NoSQL. The first is the appearance of a new form of traffic profile driven by what might be referred to as Web 2.0 or the Social Web as well as the maturing of Internet retail.
"Web scale", as it is commonly referred to, is a capacity planning, scale and provisioning issue that has become pressing for many web businesses over the last five years. As the world becomes more connected it is possible for sites to experience massive variations of traffic. Some of these are related to predictable events: the World Cup or Christmas; others are unpredictable and global, for example September 11th posed massive challenges for news sites. Sites like Facebook have made it easy for sites to experience massive upswings of popularity as items "go viral" and are distributed by global world of mouth.
User-generated content causes particular headaches as the issues of scaling for "read-heavy" websites is well understood with the use of static content and Content Distribution Networks (CDNs). User-generated content means that sites become more "read-write" balanced. Sites like Twitter experience massive surges in write traffic in very narrow time frames (a goal scored or denied, an election declaration or TV finale), their infrastructure needs to adapt rapidly and not be stuck in the wrong mode at the wrong time. The normal approach to scaling has been to add webservers, which works until the traffic through the database (which has historically been a single instance) becomes the bottleneck. The answer then has been to buy progressively more powerful hardware until the database can serve all the traffic. Web scale invalidates this model as you face the dilemma of having to purchase hardware to meet your peak demand (Christmas, the World Cup) but which is operating very far below capacity day to day. For some businesses it is simply impossible to purchase the hardware and licenses to meet their peak demand solely through a single server. These businesses have been seeking a scalable data solution that mirrors their web architecture.
The second driver is the fact that data changes over time. As the business model evolves concepts and data models often struggle to evolve and keep pace with changes. The result is often a data structure that is filled with archaic language and patched and adapted data. As anyone who has had to explain that the value in a column has a different meaning depending on whether it is less than or greater than 100 or that "bakeries" are actually "warehouses" due to historical accident knows that the weight of history in the data model can be a serious drag in maintaining a system or incorporating new business ideas.
The final factor is that the NoSQL technology is now starting to become a commodity. Once an Amazon or Google had no choice but to create a bespoke solution that answered their problems of scale. The cost of writing such a solution prevented enterprises that did not have these issues at the heart of their business model from exploiting this new technology. Recently a series of donations of code to bodies such as the Apache Foundation or other open source groups which provide community-driven support and development, has lead to the possibility of using extremely sophisticated code at little cost in upkeep. Such code puts NoSQL firmly in the reach of smaller companies. Instead of being an esoteric subject, now NoSQL data stores can be downloaded and made part of an enterprise architecture in weeks.
A common question that is asked about NoSQL is whether people are really using it or whether it is just hype. The answer is that if you have ever used Amazon, Yahoo or Google then you have had your data served via a NoSQL solution. If you have used eBay or Twitter you have indirectly used datastores that bear little resemblance to traditional databases, (for example eBay does not use transactions and Twitter uses a custom graph database to track who follows whom). Usually the question really means, are people like me using it?. The answer is that if you are facing issues dealing with certain types of data then there is potential competitive advantage to be gained by looking at a NoSQL solution. The area is new enough that most businesses would feel uncomfortable running critical work anywhere other than in mature relational data stores, even if those relational stores cause a lot of issues in their own right.
Why would you want to use a NoSQL database?
One of the fundamental drivers is that you have challenges in your business that are difficult to solve using traditional relational database technology. If you have an excellent relational model running on a mature database that provides all the features you need then there is probably little need to change your data storage mechanism. Here are some use cases where it is sub-optimal to use a conventional database: -
- Your relational database will not scale to your traffic at an acceptable cost
- Your data is supplied in small updates spread over time so the number of tables required to maintain a normal form has grown disproportionally to the data being held. Informally if you can no longer print your ERD on an A3 piece of paper you may have hit this problem or you are storing too much in a single database.
- Your business model generates a lot of temporary data that does not really belong in the main data store. Common examples include shopping carts, retained searches, site personalisation and incomplete user questionnaires.
- Your relational database has already been denormalised for reasons of performance or for convenience in manipulating the data in your application.
- Your dataset consists of large quantities of text or images and the column definition is simply a Large Object (CLOB or BLOB).
- You need to run queries against your data that do not involve simple hierarchical relations; common examples are recommendations or business intelligence questions that involve an absence of data. For the latter consider "all women in Paris who do have a dog and whose ex sister-in-laws have not yet purchased a paperback this year" as a contrived example, "all people in a social network who have not purchased a book this year who are once removed from people who have" is a real one if you want to target advertising on a site that says "Fred bought X".
- You have local data transactions that do not have to be very durable. For example "liking" items on websites: creating transactions for these kind of interactions are overkill because if the action fails the user is likely to just repeat it until it works. AJAX-heavy websites tend to have a lot of these use-cases.
One of the babies that risk being thrown out with the bathwater is SQL itself. NoSQL has chosen SQL as its bete noire even though in reality it is just a standard that is often mixed up with compromised implementations. SQL has many advantages that the NoSQL products will have to address over time. Firstly it is mature, refined and generally meets the expectation of its users. It has a coherent full-featured syntax which means that people who produce complex SQL queries are likely to balk at being asked to replicate operators like SUM, ORDER BY and GROUP in a map-reduce job that they have to create themselves in Javascript.
Even the vendors themselves recognise the problem, if they are unable to find a common set of data manipulation operations themselves then it is likely that one or another implementation will become popular and users will either migrate to the product that solves their problem or that all vendors have to implement their market leader's command-set to be competitive.
There are some standards already available such as SparQL, a standard for querying RDF or tuple-data. This could be adapted to both document and graph databases but currently there is nothing that provides a genuine modular set of query syntax that could be compared to SQL.
It is an irony that NoSQL products more complex than the Key-Value stores are likely to have to implement something very similar to SQL if they want to achieve the same broad usage as Relational data products do today. In some ways this fact lies behind the "Not only SQL" slogan, truly doing away with SQL would just be too painful.
New technology, new challenges
Trying to incorporate NoSQL into existing large-scale systems it is obviously easiest if the solution already has loose coupling between components. In this situation it is easier to identify areas that would benefit from a NoSQL solution and then implement a piecemeal adoption. In the situation where data storage is monolithic and systems may actually be depending on certain properties of relational data, for example data types or transactional consistency, then the problem is much harder. In some ways de-coupling data provision needs to be the first task rather than migrating data storage.
From a solution point of view there needs to be a clear analysis of what data is relational and what is stored in relational stores currently but only due to the lack of alternatives. It is also important to review historic decisions to see if they were made with historical constraints in mind. A particular example is the use of a graph database instead of very complex relational tables. It is entirely possible to create sets of many to many relations in relational data and then query the intersections of these relationships but expressing just the relationships may result in a much simpler solution.
There are some obvious areas where NoSQL can be applied immediately. Website content can generally be expressed in terms of document and key-value datastores. Particular examples of suitable situations are forms and wizard-style metaphors. Any web form can find ready expression in a document form. Lookup data is another example, lots of reference data consists of maps, lists and sets, for example, referrers, countries, reasons for cancelation, counties, provinces and states. Looking for these patterns in data should allow identification of opportunities.
Looking more strategically, systems that need to evolve and change their data frequently offer a chance to use a schema-less data store. If being able to migrate data structures without taking the data store offline would be advantageous you have a strong indicator that looking for a NoSQL solution would be valuable.
The following section describes the different types of NoSQL datastores.
Key Value stores
Examples: Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB Typical applications: Content caching Strengths: Fast lookups Weaknesses: Stored data has no schema
Example application: You are writing forum software where you have a home profile page that gives the user's statistics (messages posted, etc) and the last ten messages by them. The page reads from a key that is based on the user's id and retrieves a string of JSON that represents all the relevant information. A background process recalculates the information every 15 minutes and writes to the store independently.
Document databases
Examples: CouchDB, MongoDb Typical applications: Web applications Strengths: Tolerant of incomplete data Weaknesses: Query performance, no standard query syntax
Example application: You are creating software that creates profiles of refugee children with the aim of reuniting them with their families. The details you need to record for each child vary tremendously with circumstances of the event and they are built up piecemeal, for example a young child may know their first name and you can take a picture of them but they may not know their parent's first names. Later a local may claim to recognise the child and provide you with additional information that you definitely want to record but until you can verify the information you have to treat it sceptically.
Graph databases
Examples: Neo4J, InfoGrid, Infinite Graph Typical applications: Social networking, Recommendations Strengths: Graph algorithms e.g. shortest path, connectedness, n degree relationships, etc. Weaknesses: Has to traverse the entire graph to achieve a definitive answer. Not easy to cluster.
Example application: Any application that requires social networking is best suited to a graph database. These same principles can be extended to any application where you need to understand what people are doing, buying or enjoying so that you can recommend further things for them to do, buy or like. Any time you need to answer the question along the lines of "What restaurants do the sisters of people who are over-40, enjoy skiing and have visited Kenya dislike?" a graph database will usually help.
XML databases
Examples: Exist, Oracle, MarkLogic Typical applications: Publishing Strengths: Mature search technologies, Schema validation Weaknesses: No real binary solution, easier to re-write documents than update them
Example application: A publishing company that uses bespoke XML formats to produce web, print and eBook versions of their articles. Editors need to quickly search either text or semantic sections of the markup (e.g. articles whose summary contains diabetes, where the author's institution is Liverpool University and Stephen was a revising editor at some point in the document history). They store the XML of finished articles in the XML database and wrap it in a readable-URL web service for the document production systems. Workflow metadata (which stage a manuscript is in) is held in a separate RDBMS. When system-wide changes are required, XQuery updates bulk update all the documents to match the new format.
Distributed Peer Stores
Examples: Cassandra, HBase, Riak Typical applications: Distributed file systems Strengths: Fast lookups, good distributed storage of data Weaknesses: Very low-level API Example application:
You have a news site where any piece of content: articles, comments, author profiles, can be voted on and an optional comment supplied on the vote. You create one store per user and one store per piece of content, using a UUID as the key (generating one for each piece of content and user). The user's store holds every vote they have ever made while the content "bucket" contains a copy of every vote that has been made on the piece of content. Overnight you run a batch job to identify content that users have voted on, you generate a list of content for each user that has high votes but which they have not voted on. You then push this list of recommended articles into the user's "bucket".
Object stores
Examples: Oracle Coherence, db4o, ObjectStore, GemStone, Polar Typical applications: Finance systems Strengths: Matches OO development paradigm, low-latency ACID, mature technology Weaknesses: Limited querying or batch-update options
Example application: A global trading company has a monoculture of development and wants to have trades done on desks in Japan and New York pass through a risk checking process in London. An object representing the trade is pushed into the object store and the risk checker is listening to for appearance or modification of trade objects. When the object is replicated into the local European space the risk checker reads the Trade and assesses the risk. It then rewrites the object to indicate that the trade is approved and generates an actual trade fulfillment request. The trader's client is listening for changes to objects that contain the trader's id and updates the local detail of the trade in the client indicating to the trader that the trader has been approved. The trading system will consume the trade fulfillment and when the trade elapses or is fulfilled feeds back the information to the risk assessor.
Data remains tabular and the spreadsheet is still a business's favourite data modeling tool. SQL is not going away any time soon. However until now we've been creative in working with and around the constraints of a typical relational datastore. NoSQL offers the chance to think differently about data and that is a tremendously exciting prospect.
Find out how an agile approach helps you quickly realise the value of big data.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.