More on CAP

This blog post and the comments have a good discussion on the CAP Theorem that captures the main gist of my issues with many discussions of CAP, although I may still have some additional comments, particularly about the practical implications for real distributed systems and why the “CA/CP” dichotomy seems a bit misguided.

Twitter Conversation about CAP

I am capturing a brief Twitter conversation about the CAP Theorem here. I’ll try to follow up with a longer post about the misconceptions that seem to float around about the CAP Theorem.

@chad_walters:
This diagram is interesting but somehow strikes me as a bit misguided: http://bit.ly/dbZFxU #NoSQL
3:25 PM Mar 14th via web

@chad_walters:
I agree that the systems listed under “AP” lack strong consistency. But the division between the “CA” and “CP” systems seems wrong #NoSQL
3:27 PM Mar 14th via web

@chad_walters:
Those systems have individually varying points where availability breaks down under network partitioning (and other kinds of faults). #NoSQL
3:31 PM Mar 14th via web

@chad_walters:
And my impression is that the NoSQL (“CP”) systems are generally more fault tolerant than the RDBMS (“CA”) systems. #NoSQL
3:32 PM Mar 14th via web

@chad_walters:
But the diagram creates the mistaken impression that RDBMS are “more available” than NoSQL, which is simply not the case AFIACT #NoSQL
3:34 PM Mar 14th via web

@clehene:
@chad_walters Just a few letters mixed up a bit. BigTable and HBase were CA, not CP last time I checked.
3:29 PM Mar 14th via TweetDeck in reply to chad_walters

@chad_walters:
@clehene My point is that I don’t think “CA” and “CP” are sensible classifications, at least as I read the formal CAP proof. #NoSQL
3:38 PM Mar 14th via web in reply to clehene

@LusciousPear:
@chad_walters @clehene Every time I say BigTable is CP or CA, I’m told I’m wrong and the other is the truth. #headdesk
3:40 PM Mar 14th via Echofon

@clehene:
@LusciousPear @chad_walters I guess people like taxonomies. It’s just hard to classify things when labels are rather philosophical. #nosql
3:41 PM Mar 14th via TweetDeck in reply to LusciousPear

@chad_walters:
@LusciousPear Right. I don’t think it is the right distinction. Unlike “fast/cheap/good”, CAP does not mean “pick two”. #NoSQL
3:43 PM Mar 14th via web in reply to LusciousPear

@chad_walters:
@clehene The CAP theorem has a formal proof: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf /cc @LusciousPear #NoSQL3:48 PM Mar 14th via web in reply to clehene

@LusciousPear:
@chad_walters @clehene Right. CAP is more like knobs to turn, not blocks to build from.
3:48 PM Mar 14th via Echofon in reply to chad_walters

@chad_walters:
@LusciousPear Well I see strong consistency as the main choice point. Beyond that you have varying degrees of fault tolerance. /cc @clehene
3:52 PM Mar 14th via web in reply to LusciousPear

@chad_walters:
@LusciousPear CAP says that weakly consistent systems can maintain A (under the formal definition…) in the face of P /cc @clehene 3:54 PM Mar 14th via web in reply to LusciousPear

@clehene:
@chad_walters @LusciousPear I will soon come out with some reflections about that, just need to polish things a bit 🙂
3:55 PM Mar 14th via TweetDeck

@chad_walters:
@LusciousPear But is the formal definition of A really the most interesting aspect of fault tolerance? I’m not convinced. /cc @clehene
3:56 PM Mar 14th via web in reply to LusciousPear

@chad_walters:
@LusciousPear I’m much more interested in practical fault tolerance than in theoretical fault tolerance /cc @clehene
4:03 PM Mar 14th via web

@LusciousPear:
@chad_walters I’m 100% with you. I think certain contexts of A are overrated, and I’d rather have something practical than “elegant”
4:11 PM Mar 14th via Seesmic in reply to chad_walters

New Role, New Location

I recently started a new role as Dev Manager for the Bing Index Serve team at Microsoft. It’s a very exciting opportunity — Bing is gaining tremendous traction and the Index Serve dev team is packed with smart, talented, and dedicated engineers. In addition, the role is a great fit with my skills and experience and also positions me to have a big impact.

I am commuting back and forth between San Francisco and Bellevue WA for a couple months until my family is ready to relocate permanently. I will certainly miss the Bay Area — which has been my home for the past 22 years — and the many friends, colleagues, collaborators and co-conspirators that have made San Francisco such a fun and rewarding place to play and work. However, the Seattle area is not so far off and I am already starting to experience some of the great things that the Pacific Northwest has to offer.

I am hoping to do some regular blogging here… We’ll see how that pans out.

Caching in Search Systems

Greg Linden recently posted two articles of direct interest to me. One is about HBase, an open source project that my team is contributing to — I’ll probably talk more about HBase some time soon.

The other article is prompted by a paper from Yahoo! Research about caching and search engine performance (PDF). I wholeheartedly agree with Greg’s final paragraph on the subject:

Going back to the paper, I think it is a worthwhile read, not only for the research work reported, but also for the thoughts it provokes. Caching may seem simple, but getting it right is not. There are more questions to ponder here than at first it might appear.

I read this paper a few weeks ago when it was first made available. Having spent a good deal of my time at Yahoo! working on query result caching and postings caching, the topic is one with which I am intimately familiar.

One of the primary findings of the paper is that a statically-generated postings cache using a cache value measure built from historical query logs can provide a very high cache hit rate and outperforms various other static and dynamic caching schemes that they examined.

I have worked with a static postings caching scheme very similar to the one described and it can indeed achieve a very high hit rate. As the researchers correctly point out, while the percentage of change in queries over time is high, the frequencies of the keywords within the queries actually change very little over time. Given that this is the case, there is a certain amount of intuitive sense to the notion that static caching could be more efficient than dynamic caching — dynamism generally comes with a certain amount of overhead in terms of space, which cuts into the space which can be spent on the actual postings themselves. Also, dynamic caches generally incur some additional CPU overhead since they require a certain amount of data movement.

I do think that the paper sets up a bit of a strawman on the dynamic caching side. Since the static caching scheme exploits the relatively slow rate of change of query term frequency, one could construct a dynamic caching scheme that also does the same. A slowly-evolving cache that tracks term frequencies over a long period of time can actually outperform a static caching scheme in practice.

Furthermore, I would offer some advice to budding search engine builders out there: please look very closely at your query domain, your operational constraints, and other factors before you commit too heavily to a static caching scheme. There are lots of gotchas that can arise if you aren’t careful.

First, there are also some methodological assumptions made in the paper that could trip you up if your query domain is different from the one they examine. The paper starts with a relatively homogeneous query log (Yahoo UK) and builds from there. If you plan on serving queries sent by world-wide users in multiple languages, you may find that a static cache performs worse than you expect. For example:

  • If you gather your term frequencies across the whole day, you will have lowered performance at peak load time because your frequencies are polluted by the term frequencies from other parts of the day that may have very different characteristics.
  • If you just collect term frequencies from your peak hours, you run the risk of having your cache hit rate at the offpeak hours drop to the point where your disk activity may actually be higher during the offpeak hours than at peak.
  • If your load mix from different markets shifts over time, you will experience shifts in your term frequencies and consequently shifts in your cache hit rates.

A dynamic cache can adjust to the changes in your load over time, including intra-day shifts due to geographic distribution and gradual changes in your mix of load from different regions.

Another issue to consider is that the query log analysis described in the paper assumes a simple picture of search engine term retrieval, namely that the terms in the query log translate directly into the terms that are retrieved for those queries. If you look closely at the current state of the art for major search engines, you’ll see that, in fact, the mapping from input terms to retrieved terms is not so direct and further that these mappings are constantly being refined and adjusted over time. Stemming, spell correction, tokenization, hard phrases, and named entities are processed on top of the raw input queries to construct a more complex set of retrieved terms. And the implementations of all these subsystems are continually being extended, upgraded, tweaked, and otherwise modified. These factors hugely complicate the task of offline term frequency collection and may result in significant degradation in the cache hit rates.

A dynamic cache also has certain procedural benefits. Since the static cache has to be built from offline data collected over some period of time, you have to actually engineer a process to manage and collect the data. Also, since the underlying frequencies change infrequently, the process of collecting them tends to happen infrequently. Activities that only need to happen infrequently can be problematic — because you tend not to engineer sufficiently good processes to get that activity done painlessly. This creates a feedback loop: the infrequent process is painful, so you tend to do it even less often than you should. This can lead to greater degradation than you expected up front when you bought into the static caching methodology.

All in all, a well-designed dynamic cache shouldn’t be substantially worse than a static cache, even at the theoretical level, and in practice will often perform better and with substantially fewer headaches.

This post has run on long enough so I’ll defer discussion of some of the other interesting topics touched on by this paper and by Greg’s post to another time.

Introduction

I am starting up this site to blog about search, distributed computing, software engineering, and other topics of interest to me. Hopefully, they will also be of some interest to other folks who end up here.