At the eleventh hour, I joined Vanderbilt’s task force reviewing search engines to use for the university. That was about a month ago, and I’m most grateful to have been included. It’s true I’m an iPhone fan-girl, an RSS evangelist, a Drupal enthusiast, etc., etc., but in the end what matters to me most are search engines. Once a librarian, always a librarian.
Vanderbilt’s search engine contract is up for renewal, and rather than rubber-stamping the current solution, the university opted to review four contenders: IBM, Google Search Appliance (GSA), Microsoft Search Engine and Ultraseek. IBM pulled itself out of the running a couple of weeks ago and Microsoft was fraught with technical problems. Just trying to look at it for 10 days in a row, I could never see it — not even once.
Thus the decision came down to Ultraseek, our current search engine, vs. Google Search Appliance. Aside from financial considerations, we based our analyses on three categories.
1. Technical. This encompassed things such as how difficult it was for the IT team to set up and what kind of support they will get if it crashes.
2. Administrative. From the perspective of those who administer the search engine (in particular our university webmaster), we looked at how intuitive the interface was, how much control it gave over the end results, and how it accommodated separate instantiations and templates for large divisions within Vanderbilt that need their own search engine.
3. End User. Most important of all, we assessed the effectiveness of the results. How likely would users be to find what they are looking for?
I don’t have administrative rights to Ultraseek, so I was only able to review it as an end user. Because of this, I won’t be reviewing it, except to say that I have used it as a Vanderbilt webmaster for seven or so years, and have been surprisingly happy with it, particularly how it can be adapted for subsites. The search results, when tuned properly by the administrators, have been decent. I would give it a B overall from an end user perspective. When searched in combination with classic Google using the “site:vanderbilt.edu” string, I can almost always find what I’m looking form. To help end users of the site I administer, I set up an advanced search page that easily allows them to do the same.
Google Search Appliance: The Good and the Not-so-good – or – Even Google Isn’t Perfect
Google Search Appliance (GSA), on the other hand, I was allowed to administer, and thus the majority of this review is an analysis of GSA. It’s a very strong contender to replace Ultraseek.
Jumping to the conclusion and then working backwards, I gave GSA the lowest marks of anyone on the task force. There’s some irony here. I’m a huge Google fan. Google classic has been my search-engine-of-choice since it launched ten years ago. I started using it even before my other librarian friends did. Not only that, I’d been sprinkled by Google Bus fairy dust just this Thursday and was wearing their t-shirt while the task force was in deliberations. But weighing the pros and cons, in the end I judged it to be about equal with Ultraseek. My fellow task force members were notably more enthusiastic than I.
In many ways, it was difficult to compare the two. In particular, Google is more likely to continue improving in the near future, but this is hard to quantify. They’ve made many improvements in the last couple of years, and promise several more soon, giving it a very slight edge. All told I gave it the equivalent of a B+. Here’s a breakdown.
Relevance Rankings. To my surprise, the default order of GSA’s search engine results seems random or worse. I tested using Vanderbilt-based search terms where I’m very familiar with the results on a variety of search engines, internal and external. As best I could tell the GSA results were ranked primarily by their domain names — using the order these domains were crawled. Thus searches for even a cancer term typically listed all vanderbilt.edu results before any vicc.org results. (NB: vicc.org is Vanderbilt’s cancer center, with the most authoritative cancer information at Vanderbilt for patients.)
De-duplications of Pages. Ultraseek natively handles duplicate pages better than GSA. GSA, for example, pulls two versions of the exact same page — the original plus a “larger text” query string. Thus it will show both (1) www.vicc.org/dd/dz/results.php?id=34 and (2) www.vicc.org/dd/dz/results.php?q=textlarger&id=34, where Ultraseek automatically only shows the first. In theory a webmaster can control this with the robots.txt file, but I followed GSA’s instructions for doing this over four days ago, and either the crawler still hasn’t reindexed or the instructions were misleading, because I still see many of these duplicates.
Speed. Assessing GSA was particularly challenging since our test was running on older, remote machines. This meant it was much slower. Not only were the results slower than they will be if we purchase GSA, but so was the reindexing. How fast will it be if we buy it? To get a sense, I went to other comparable institutions that have purchased GSA. Yale.edu is a good example. I searched cancer terms there, and they pull up quickly. However, this method only helped for search results. It’s impossible to tell how quickly their sites reindex. And this can be an important issue for a search engine adminstrator, since sometimes you have to get rid of particular results quickly. Having to wait a day, or heaven help us, four days, is simply not acceptable.
Meta-data handling. Another big unknown is GSA’s upcoming improvements to relevance rankings. We were told the next version will allow the administrator to tune results based on meta data. If true, this will help the relevance ranking problems a great deal.
Number of pages indexed. In just a few weeks, the GSA crawler found 20 million files on vanderbilt servers. Ultraseek only crawls 38,000 files. Presumably some of this reflects Ultraseek’s de-duplication, but the number is so different, it clearly demonstrates that GSA has the potential to find much more. This will be particularly important if we deploy the search engine to our intranets.
Authentication, HIPAA and FERPA. To quote Stanford, another GSA user: “As FERPA and HIPAA regulations begin to have an effect on the availability of web content (requiring some pages to be access-restricted, for example), the campus search appliance can be authenticated to crawl and index where outside search engines cannot.” I’m not sure precisely what this means, but it sounds both significant and promising. From talking to our GSA rep, I believe it signifies that GSA will work well on things like our Medical Center intranets, assuming we have our websites’ authentication set up properly.
The Administrative Interface is intuitive and easy to use. Here is a screen-shot of the home page.
The Documentation and Help Screens for administrators are thorough.
Synonym-handling is much more sophisticated than Ultraseek’s. Out-of-the-box GSA’s search results include “Narrow your search” terms that seem almost magical to me. Search “sarcoma” and it will suggest terms such as “kaposi sarcoma.” Even more wonderous, you can add taxonomies such as the ICD-10 to your search engine.
Results can by customized in many and multiple ways. Results can be grouped in various collections (e.g. a Medical Center collection as well as a University collection), plus webmasters of individual departments can add search boxes to their sites that are restricted to just their URL. For look-and-feel, the results are typically XML using style sheets. Again, you can have different style sheets for different units.
Google caché. Google isn’t just familiar to your average web user. It’s the most trusted brand on the ‘net, and that trust was earned by their search engine. When Vanderbilt users aren’t happy with the current search engine, they will often ask, why aren’t you using Google instead? If we get GSA, we will be — at least in their eyes. The vast majority won’t know or care that the formula we’re using is necessarily different.
That’s it for my analysis of Google Search Appliance. It’s been a blast getting to peek under the hood of a search engine — especially the progeny of the most popular and sophisticated search engine in the world. I certainly hope I can do more of this in the near future. Oh — and did I mention it’s actually — get this — cute? In person, it’s a cheese-like yellow box. I think I want one for me too, but starting at $30,360.95 methinks I can’t afford it. Here’s hoping Vanderbilt can.