Last fall, I heard about Riak and thought it sounded awesome, and it boasts a lot of neat features that I’ll go into detail about below. I was further impressed when I met the Basho team at nosqleast in Atlanta. They really seem to know what they’re doing.
In December, John Nunemaker wrote a post entitled Why I think MongoDB is to Databases what Rails was to Frameworks. I commented on twitter:
If as @jnunemaker says, #mongodb is like Rails for databases, then #riak is like Merb.
His reply was quite humorous (no, John, Riak won’t merge into MongoDB), but it got me thinking. So here are my…
7 Reasons why Riak is the Merb of databases (or better)
…and why it should be the database for your next Rails app.
Scales horizontally without pain
Riak runs just about in the same way on 1 node as it does 100. This is because, in short, it is an implementation of Amazon’s Dynamo. Need more storage capacity, throughput and availability? Add more nodes.
Now, a lot of people throw around the word “scalability” without really knowing what it means. It’s really a ratio of the desired properties of your system to the cost inherent in achieving those goals. High scalability means you have a low cost in proportion to your ability to grow. It has nothing to do with performance (although good performance can be a side-effect of good scalability). By this measure, actually, Rails scales well (shared nothing) — but that doesn’t mean it’s always performant (slow Ruby implementations, large memory consumption, etc). Despite the trolls, it’s pretty well known that Twitter’s scaling problems were not really caused by Rails.
So what makes Riak more capable of handling growth than MongoDB, CouchDB or MySQL? It’s built with master-less replication in mind from the beginning. When you grow a MongoDB or MySQL database, you typically add a slave server which receives and replays the transaction logs of the master server. Ideally, this slave can take over if the master goes down. There are also master-master setups and other ways of configuring replication, but in general it’s a thing you add on, not built-in.
Instead, Riak has no notion of a master database node. Every node participates equally in the cluster and spreads the load in a predictable way using consistent hashing (another feature of Dynamo). So yes, increasing capacity and throughput and all those other things you expect from your database is as easy as starting up new nodes and having them join the ring. Data will be replicated in the background. You can even run Riak nodes on less-powerful machines (think EC2) and get decent results. Large, monolithic DB servers are the last vestige of the mainframe era, so democratize your database layer!
HTTP-Compliant REST interface
Rails nerds like me love REST, but most don’t really understand what it means, or rather think that it is a one-to-one mapping to CRUD. Rails’ concept of REST is at best incomplete and potentially misleading. (See also Scott Raymond’s RailsConf 2007 talk)
The Riak developers spent a lot of time creating an awesome toolkit in Erlang for building RFC-compliant HTTP resources, and Riak has several of them. This means that when interacting with Riak, you use HTTP in a natural way and get predictable responses — even the hard stuff like content-negotiation, entity tags and conditional methods.
The side effect of composing Riak of well-behaved HTTP resources is that it plays very nicely with other HTTP-related infrastructure, including proxies, load balancers, caches, and clients of all stripes.
Robust failure recovery
One of Riak’s most compelling features is that it handles node failure robustly. If a node goes down in your cluster, its replicas will take over for it until it comes back, a feature known as hinted handoff. If it doesn’t come back — as happens all too often on EC2 — you can add a new node and the cluster will rebalance. The failure recovery story for MySQL, even in a replicated scenario, is much more difficult, time-consuming, and costly.
Riak has fine-grained robustness as well. It was built primarily in Erlang and with the OTP Design Principles in mind. One aspect of the OTP design is that the processes in the system are organized in a supervision tree, so that when one crashes, it will be restarted by its supervisor process (which is in turn supervised). Send some bad data to the web interface and get a 500 error? That internal error doesn’t crash the whole system, bringing your database node down.
Justin Sheehy, Basho’s CTO, tells a story of a customer whose cluster had two nodes go down late one night. The Basho team looked for less than 10 minutes at the cluster, verified that it was still going, and then decided to wait until the morning to fix the lost nodes. As Joe Armstrong, Father of Erlang, says (paraphrased), “In order to have fault tolerance, you need more than one computer.” Riak embraces that philosophy.
Links make graph-like structures possible
Most Dynamo implementations are simple key-value stores. While that’s still pretty useful — Dynomite is really awesome at storing large binary objects, for example — sometimes you need to find things without a priori knowledge of the key.
The first way that Riak deals with this is with link-walking. Every datum stored in Riak can have one-way relationships to other data via the Link HTTP header. In the canonical example, you know the key of a band that you have stored in the “artists” bucket (Riak buckets are like database tables or S3 buckets). If that artist is linked to its albums, which are in turn linked to the tracks on the albums, you can find all of the tracks produced in a single request. As I’ll describe in the next section, this is much less painful than a JOIN in SQL because each item is operated on independently, rather than a table at a time. Here’s what that query would look like:
“/raw” is the top of the URL namespace, “artists” is the bucket, “TheBeatles” is the source object key. What follows are match specifications for which links to follow, in the form of bucket,tag,keep triples, where underscores match anything. The third parameter, “keep” says to return results from that step, meaning that you can retrieve results from any step you want, in any combination. I don’t know about you, but to me that feels more natural than this:
SELECT tracks.* FROM tracks
INNER JOIN albums ON tracks.album_id = albums.id
INNER JOIN artists ON albums.artist_id = artists.id
WHERE artists.name = "The Beatles"
The caveat of links is that they are inherently unidirectional, but this can be overcome with little difficulty in your application. Without referential integrity constraints in your SQL database (which ActiveRecord has made painful in the past), you have no solid guarantee that your DELETE or UPDATE won’t cause a row to become orphaned, anyway. We’re kind of spoiled because ActiveRecord handles the linkage of associations automatically.
The place where the link-walking feature really shines is in self-referential and deep transitive relationships (think
has_many :through writ large). Since you don’t have to create a virtual table via a JOIN and alias different versions of the same table, you can easily do things like social network graphs (friends-of-friends-of-friends), and data structures like trees and lists.
Powerful Map-Reduce and soon, Lucene search
The second way to find data stored in Riak is using Map-Reduce. Unlike links, where you have to start with a known key, you can run map-reduce jobs over any number of keys or an entire bucket.
One thing that might not be immediately obvious when using an SQL database is that the declarative nature of the query language hides the imperative nature of performing the query. SQL databases have sophisticated query planners that decompose your query into several steps (see EXPLAIN command on MySQL) and attempt to optimize the order of those steps based on known table properties, like indices, keys, and row counts. Riak’s map-reduce, while still largely declarative, lets you decide which order to run steps in. You trade a little abstraction for a lot of power.
Riak’s map-reduce is also quite different from CouchDB’s views:
- You’re not trying to build an index which you’ll query later, you’re performing the query.
- You can have any number and combination of map, reduce and link phases (link is actually a special case of map).
- Since it’s not an index, your query need not be contiguous across the keyspace.
- There’s no concept of re-reduce because reduce phases are only run once.
Although not complete or released, Basho also has an awesome Lucene-compatible search system in the pipeline (already in use at Collecta, or so I hear). I saw it in action this past week while in Boston and was impressed. It’s fairly comparable in performance to Solr for large datasets, but scales across your Riak cluster.
Beyond schema-less: Content-type agnostic
One thing that may seem unintuitive at first is that Riak doesn’t care what type of content you put in it. CouchDB and MongoDB use JSON and BSON (Binary JSON), respectively, to store objects. Instead of forcing a format, Riak lets you store pretty much anything. There’s no concept of “attachments” or “GridFS” (unless you add it yourself), so just store the file directly in a bucket/key with a PUT or POST request. As long as you specify the “Content-Type” header, Riak will remember it and give you back the right thing when you access it the next time. No need to do base 64 encoding to store a picture, PDF, Word document, podcast, or whatever. This could enable you to replace a distributed filesystem like GFS, or even S3, with a Riak cluster. Bonus: you get replication and fail-over for no extra cost.
There are some minor kinks with large files (>100MB), but I’ve been assured by Basho that they are addressing the issue. In the meantime, you could chunk the file manually and follow links to get the pieces and put them back together.
Tunable levels of consistency, durability, and performance
The knee-jerk argument against Dynamo-like data stores is that they don’t have ACID properties and that “eventual consistency” is too eventual. In practice, especially in large deployments and high throughput scenarios, ACID breaks down because it requires actions to be “all or nothing”, effectively creating bottlenecks while clients wait for transaction handles. If you’ve done any distributed or concurrent computing, you know that contention for shared resources is a primary cause of failure, including problems like deadlock, starvation and race conditions. For the sake of reducing single points of failure (bottlenecks), Riak implements eventual consistency — where storage operations are accepted immediately, and then propagated across the cluster in an asynchronous fashion. This makes it nearly always available for writes and reads.
In order to deal with inconsistency, Riak tags each datum with a vector clock that internally reveals the datum’s lineage (who modified what version). When there are conflicts — that is, two parallel versions of the same datum — Riak returns you both versions so that your application can decide how to resolve it. In an SQL database with this kind of conflict, your transaction might fail and rollback, forcing you to resolve it and retry anyway.
Beyond just eventual consistency, Riak lets you tune the amount of consistency, durability, and availability you want, even somewhat at request-time. If you need more read availability and replication, you can increase the N value for a bucket (number of replicas, default 3). If you want to be sure the data you’re reading is consistent across the cluster, increase the R value at request time (R, read quorum, is always <= N). If you want high assurance that your data is stored, increase the W and DW values (W = write quorum, DW = durable-write quorum, both always <= N). If you want better performance, e.g. to use Riak as a persistent cache, keep the R, W, and DW values low. You also have the choice of which backend to use when storing your data on the replicas (anything from in-memory to on-disk, or combinations), including the recently released Innostore. Although in many cases the defaults are good enough, you have the flexibility to choose a model that fits your application.
(Note: currently, the N value must be set before you insert any data into the bucket).
Why Riak for Rails?
Ok, so those are the reasons why Riak is awesome in my mind. Why should you use it in your Rails app (or other Ruby application)? Here are some possibilities:
- Scale up (or down) cheaply: Riak can grow horizontally with your app, either in lock-step (have database nodes on your app servers!) or independently. A colleague described it as a “Christmas tree” — you could have a load-balanced cluster of Ruby processes and behind that a load-balanced cluster of Riak nodes.
- Ease of deployment: No longer do you need to break out or specialize nodes to be database-only, unless you want to. The homogeneity will make your sysadmins happy, and HTTP-friendliness means they already know how to grow the infrastructure.
- Flexibility: You can use Riak as a document database, a cache store, a session store, a log server, or a distributed filesystem, with custom settings for each scenario. Soon you’ll also be able to use it as a Solr search replacement.
- Powerful modeling: Use links to create relationships in a natural fashion, reducing the number of JOIN-like operations, and then discover data using arbitrary-depth link-walking, or create custom queries with map-reduce.
- Background processing and analytics, a.k.a. Big Data: With high write availability and the powerful map-reduce framework, you could defer complicated calculations and later run them in parallel across the Riak cluster. Think data warehouses or Hadoop, but with 1/10 the LoC and no Java.
Where Riak has room to grow
I love Riak’s capabilities, but let’s face it — no system is perfect. Here’s where it could be improved:
- Library support: Although it’s REST so you can theoretically use any HTTP client, the included client libraries are pretty basic and use the soon-to-be-deprecated “jiak” interface. I expect that this will improve real soon — we’ve already seen an open-source Java client released.
- Efficient ad-hoc lookup: If you only know the bucket of an object but not its key, it’s potentially expensive to find it, since Riak has no concept of indices other than the key (for obvious reasons – it doesn’t know the structure of your content). In the meantime you have two options – eagerly create and maintain them yourself by adding objects with meaningful keys that link back to your original object; or create a map-reduce job to achieve the same result.
- Reduce-phase bottleneck: Map and link phases happen with great data-locality, on the nodes where the data is stored, but reduce phases happen in a single node. This is logically simple, but introduces a potential bottleneck and point of failure. If reduce phases are also referentially transparent and idempotent, they could be federated as well, and only incur the single-node hit at the very end.
- Large files: Due to some YAGNI-related issues in Webmachine and Riak’s object storage, you can’t load big files into Riak. Break your “l33t xv1dz” into chunks or store them in the filesystem for now.
What to do next
I have lots more to say about Riak (hard to believe, right?), so watch for more blog posts in the future. If you’re ready to dive in, check out the Riak dev site, sign up for the mailing list, or join in the discussion in the IRC channel on Freenode.