Google Search Appliance: Under the Hood Pros and Cons

At the eleventh hour, I joined Vanderbilt’s task force reviewing search engines to use for the university. That was about a month ago, and I’m most grateful to have been included. It’s true I’m an iPhone fan-girl, an RSS evangelist, a Drupal enthusiast, etc., etc., but in the end what matters to me most are search engines. Once a librarian, always a librarian.

Vanderbilt’s search engine contract is up for renewal, and rather than rubber-stamping the current solution, the university opted to review four contenders: IBM, Google Search Appliance (GSA), Microsoft Search Engine and Ultraseek. IBM pulled itself out of the running a couple of weeks ago and Microsoft was fraught with technical problems. Just trying to look at it for 10 days in a row, I could never see it — not even once.

Thus the decision came down to Ultraseek, our current search engine, vs. Google Search Appliance. Aside from financial considerations, we based our analyses on three categories.

1. Technical. This encompassed things such as how difficult it was for the IT team to set up and what kind of support they will get if it crashes.

2. Administrative. From the perspective of those who administer the search engine (in particular our university webmaster), we looked at how intuitive the interface was, how much control it gave over the end results, and how it accommodated separate instantiations and templates for large divisions within Vanderbilt that need their own search engine.

3. End User. Most important of all, we assessed the effectiveness of the results. How likely would users be to find what they are looking for?

I don’t have administrative rights to Ultraseek, so I was only able to review it as an end user. Because of this, I won’t be reviewing it, except to say that I have used it as a Vanderbilt webmaster for seven or so years, and have been surprisingly happy with it, particularly how it can be adapted for subsites. The search results, when tuned properly by the administrators, have been decent. I would give it a B overall from an end user perspective. When searched in combination with classic Google using the “site:vanderbilt.edu” string, I can almost always find what I’m looking form. To help end users of the site I administer, I set up an advanced search page that easily allows them to do the same.

Google Search Appliance: The Good and the Not-so-good – or – Even Google Isn’t Perfect

Google Search Appliance (GSA), on the other hand, I was allowed to administer, and thus the majority of this review is an analysis of GSA. It’s a very strong contender to replace Ultraseek.

Jumping to the conclusion and then working backwards, I gave GSA the lowest marks of anyone on the task force. There’s some irony here. I’m a huge Google fan. Google classic has been my search-engine-of-choice since it launched ten years ago. I started using it even before my other librarian friends did. Not only that, I’d been sprinkled by Google Bus fairy dust just this Thursday and was wearing their t-shirt while the task force was in deliberations. But weighing the pros and cons, in the end I judged it to be about equal with Ultraseek. My fellow task force members were notably more enthusiastic than I.

In many ways, it was difficult to compare the two. In particular, Google is more likely to continue improving in the near future, but this is hard to quantify. They’ve made many improvements in the last couple of years, and promise several more soon, giving it a very slight edge. All told I gave it the equivalent of a B+. Here’s a breakdown.

CON

Relevance Rankings. To my surprise, the default order of GSA’s search engine results seems random or worse. I tested using Vanderbilt-based search terms where I’m very familiar with the results on a variety of search engines, internal and external. As best I could tell the GSA results were ranked primarily by their domain names — using the order these domains were crawled. Thus searches for even a cancer term typically listed all vanderbilt.edu results before any vicc.org results. (NB: vicc.org is Vanderbilt’s cancer center, with the most authoritative cancer information at Vanderbilt for patients.)

De-duplications of Pages. Ultraseek natively handles duplicate pages better than GSA. GSA, for example, pulls two versions of the exact same page — the original plus a “larger text” query string. Thus it will show both (1) www.vicc.org/dd/dz/results.php?id=34 and (2) www.vicc.org/dd/dz/results.php?q=textlarger&id=34, where Ultraseek automatically only shows the first. In theory a webmaster can control this with the robots.txt file, but I followed GSA’s instructions for doing this over four days ago, and either the crawler still hasn’t reindexed or the instructions were misleading, because I still see many of these duplicates.

UNKNOWN

Speed. Assessing GSA was particularly challenging since our test was running on older, remote machines. This meant it was much slower. Not only were the results slower than they will be if we purchase GSA, but so was the reindexing. How fast will it be if we buy it? To get a sense, I went to other comparable institutions that have purchased GSA. Yale.edu is a good example. I searched cancer terms there, and they pull up quickly. However, this method only helped for search results. It’s impossible to tell how quickly their sites reindex. And this can be an important issue for a search engine adminstrator, since sometimes you have to get rid of particular results quickly. Having to wait a day, or heaven help us, four days, is simply not acceptable.

Meta-data handling. Another big unknown is GSA’s upcoming improvements to relevance rankings. We were told the next version will allow the administrator to tune results based on meta data. If true, this will help the relevance ranking problems a great deal.

PRO

Number of pages indexed. In just a few weeks, the GSA crawler found 20 million files on vanderbilt servers. Ultraseek only crawls 38,000 files. Presumably some of this reflects Ultraseek’s de-duplication, but the number is so different, it clearly demonstrates that GSA has the potential to find much more. This will be particularly important if we deploy the search engine to our intranets.

Authentication, HIPAA and FERPA. To quote Stanford, another GSA user: “As FERPA and HIPAA regulations begin to have an effect on the availability of web content (requiring some pages to be access-restricted, for example), the campus search appliance can be authenticated to crawl and index where outside search engines cannot.” I’m not sure precisely what this means, but it sounds both significant and promising. From talking to our GSA rep, I believe it signifies that GSA will work well on things like our Medical Center intranets, assuming we have our websites’ authentication set up properly.

The Administrative Interface is intuitive and easy to use. Here is a screen-shot of the home page.

Google Search Appliance home page

The Documentation and Help Screens for administrators are thorough.

Synonym-handling is much more sophisticated than Ultraseek’s. Out-of-the-box GSA’s search results include “Narrow your search” terms that seem almost magical to me. Search “sarcoma” and it will suggest terms such as “kaposi sarcoma.” Even more wonderous, you can add taxonomies such as the ICD-10 to your search engine.

Results can by customized in many and multiple ways. Results can be grouped in various collections (e.g. a Medical Center collection as well as a University collection), plus webmasters of individual departments can add search boxes to their sites that are restricted to just their URL. For look-and-feel, the results are typically XML using style sheets. Again, you can have different style sheets for different units.

Google caché. Google isn’t just familiar to your average web user. It’s the most trusted brand on the ‘net, and that trust was earned by their search engine. When Vanderbilt users aren’t happy with the current search engine, they will often ask, why aren’t you using Google instead? If we get GSA, we will be — at least in their eyes. The vast majority won’t know or care that the formula we’re using is necessarily different.

That’s it for my analysis of Google Search Appliance. It’s been a blast getting to peek under the hood of a search engine — especially the progeny of the most popular and sophisticated search engine in the world. I certainly hope I can do more of this in the near future. Oh — and did I mention it’s actually — get this — cute? In person, it’s a cheese-like yellow box. I think I want one for me too, but starting at $30,360.95 methinks I can’t afford it. Here’s hoping Vanderbilt can.

6 Responses

  1. Very interesting overview. we just concluded a similar comparison of Ulktraseek, Intkomi and Google GSa for several Fortune-500 corporations. The results were similar to yours, however, we had an unfair advantage going for Google, where GSA was running coupled with QUSA security and relevance enterprise appliance (www.queplix.com/products/universal-enterprise-search/) So our number of results produced by GSA were fewer than get it all Inktomi for example because of default security but a LOT more relevant based on the searcher role within the company and access to various data storages in addition to files.

  2. Stephen — Most interesting! Thanks so much. I hadn’t heard of QUSA before, and will be sure to pass this along to our group.

  3. We have used Ultraseek since March 2000. For years I was pretty certain it was the best choice. Now there are a lot more options in the market such as FAST, so hard to say. I have not done any extensive testing but I view Ultraseek and Google as pretty much equivalent. Looking thru your list above you should download a free 30 day trial of Ultraseek and explore the admin interface. It looks like everything you mention is addressable in the admin interface of Ultraseek.

    For instance authentication in GSA sounds the same as has been in Ultraseek. You provide user a ID and password within Ultraseek admin interface and the search engine can spider your secure content.

    Relevance ranking tuning via meta tags has been around in Ultraseek for as long as I can remember.

    Incidentally, I don’t work for Ultraseek.

    You can go here and download it and experiment with the admin features for 30 days yourself while indexing 25000 documents.

    http://www.ultraseek.com

  4. Thanks, John. Great feedback. I suspect you are right about Ultraseek and Google Search Appliance being about even, except in one area, and that’s public perception. Google branding… wow….

  5. The public perception is interesting. I had a couple folks talk to me about search and both used Google. I said show me what you do. It was interesting one did a search
    as an example using Google.com and said it saved time. I looked at the results and noted his request for PDF files associated with a particular name left less than 50% of results page with pdf files and the additional string wasnt even at same organization. Then did the same search in Ultraseek. All PDF output focused on additional subject. But the google perception to him was he was saving time. While in reality it was not even close.

    It looks like if you want to improve search you would look at IDOL or FAST. Maybe Endeca or Dieselpoint.

    Going from Ultraseek to Google Appliance may be just spending $25k? a year to get what you already have.

    My guess is if you want to improve results make a big push on educating your web community on good titles, etc. Would have a dramatic impact.

    You should download Ultraseek and look thru the admin interface.