Going back to the paper, I think it is a worthwhile read, not only for the research work reported, but also for the thoughts it provokes. Caching may seem simple, but getting it right is not. There are more questions to ponder here than at first it might appear.
I read this paper a few weeks ago when it was first made available. Having spent a good deal of my time at Yahoo! working on query result caching and postings caching, the topic is one with which I am intimately familiar.
One of the primary findings of the paper is that a statically-generated postings cache using a cache value measure built from historical query logs can provide a very high cache hit rate and outperforms various other static and dynamic caching schemes that they examined.
I have worked with a static postings caching scheme very similar to the one described and it can indeed achieve a very high hit rate. As the researchers correctly point out, while the percentage of change in queries over time is high, the frequencies of the keywords within the queries actually change very little over time. Given that this is the case, there is a certain amount of intuitive sense to the notion that static caching could be more efficient than dynamic caching — dynamism generally comes with a certain amount of overhead in terms of space, which cuts into the space which can be spent on the actual postings themselves. Also, dynamic caches generally incur some additional CPU overhead since they require a certain amount of data movement.
I do think that the paper sets up a bit of a strawman on the dynamic caching side. Since the static caching scheme exploits the relatively slow rate of change of query term frequency, one could construct a dynamic caching scheme that also does the same. A slowly-evolving cache that tracks term frequencies over a long period of time can actually outperform a static caching scheme in practice.
Furthermore, I would offer some advice to budding search engine builders out there: please look very closely at your query domain, your operational constraints, and other factors before you commit too heavily to a static caching scheme. There are lots of gotchas that can arise if you aren’t careful.
First, there are also some methodological assumptions made in the paper that could trip you up if your query domain is different from the one they examine. The paper starts with a relatively homogeneous query log (Yahoo UK) and builds from there. If you plan on serving queries sent by world-wide users in multiple languages, you may find that a static cache performs worse than you expect. For example:
- If you gather your term frequencies across the whole day, you will have lowered performance at peak load time because your frequencies are polluted by the term frequencies from other parts of the day that may have very different characteristics.
- If you just collect term frequencies from your peak hours, you run the risk of having your cache hit rate at the offpeak hours drop to the point where your disk activity may actually be higher during the offpeak hours than at peak.
- If your load mix from different markets shifts over time, you will experience shifts in your term frequencies and consequently shifts in your cache hit rates.
A dynamic cache can adjust to the changes in your load over time, including intra-day shifts due to geographic distribution and gradual changes in your mix of load from different regions.
Another issue to consider is that the query log analysis described in the paper assumes a simple picture of search engine term retrieval, namely that the terms in the query log translate directly into the terms that are retrieved for those queries. If you look closely at the current state of the art for major search engines, you’ll see that, in fact, the mapping from input terms to retrieved terms is not so direct and further that these mappings are constantly being refined and adjusted over time. Stemming, spell correction, tokenization, hard phrases, and named entities are processed on top of the raw input queries to construct a more complex set of retrieved terms. And the implementations of all these subsystems are continually being extended, upgraded, tweaked, and otherwise modified. These factors hugely complicate the task of offline term frequency collection and may result in significant degradation in the cache hit rates.
A dynamic cache also has certain procedural benefits. Since the static cache has to be built from offline data collected over some period of time, you have to actually engineer a process to manage and collect the data. Also, since the underlying frequencies change infrequently, the process of collecting them tends to happen infrequently. Activities that only need to happen infrequently can be problematic — because you tend not to engineer sufficiently good processes to get that activity done painlessly. This creates a feedback loop: the infrequent process is painful, so you tend to do it even less often than you should. This can lead to greater degradation than you expected up front when you bought into the static caching methodology.
All in all, a well-designed dynamic cache shouldn’t be substantially worse than a static cache, even at the theoretical level, and in practice will often perform better and with substantially fewer headaches.
This post has run on long enough so I’ll defer discussion of some of the other interesting topics touched on by this paper and by Greg’s post to another time.