In search of the deep Web

When Yahoo announced its Content Acquisition Program on March 2, press coverage zeroed in on its controversial paid inclusion program, whereby customers can pony up in exchange for enhanced search coverage and a vaunted "trusted feed" status. But lost amid the inevitable search-wars storyline was another, more intriguing development: the unlocking of the deep Web.

Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.

Today, the deep Web remains invisible except when we engage in a focused transaction: searching a catalog, booking a flight, looking for a job. That's about to change. In addition to Yahoo, outfits like Google and IBM, along with a raft of startups, are developing new approaches for trawling the deep Web. And while their solutions differ, they are all pursuing the same goal: to expand the reach of search engines into our cultural, economic and civic lives.

As new search spiders penetrate the thickets of corporate databases, government documents and scholarly research databanks, they will not only help users retrieve better search results but also siphon transactions away from the organizations that traditionally mediate access to that data. As organizations commingle more of their data with the deep Web search engines, they are entering into a complex bargain, one they may not fully understand.

Case in point: In 1999, the CIA issued a revised edition of "The Chemical and Biological Warfare Threat." It's a public document, but you won't find it on Google. To find a copy, you need to know your way around the U.S. Government Printing Office catalog database.

The world's largest publisher, the U.S. federal government generates millions of documents every year: laws, economic forecasts, crop reports, press releases and milk pricing regulations. The government does maintain an ostensible government-wide search portal at FirstGov -- but it performs no better than Google at locating the Hatfill report. Other government branches maintain thousands of other publicly accessible search engines, from the Library of Congress catalog to the U.S. Federal Fish Finder.

"The U.S. Government Printing Office has the mandate of making the documents of the democracy available to everyone for free," says Tim Bray, CTO of Antarctica Systems. "But the poor guys have no control over the upstream data flow that lands in their laps." The result: a sprawling pastiche of databases, unevenly tagged, independently owned and operated, with none of it searchable in a single authoritative place.

If deep Web search engines can penetrate the sprawling mass of government output, they will give the electorate a powerful lens into the public record. And in a world where we can Google our Match.com dates, why shouldn't we expect that kind of visibility into our government?

When former Treasury Secretary Paul O'Neill gave reporter Ron Suskind 19,000 unclassified government files as background for the recently published "Price of Loyalty," Suskind decided to conduct "an experiment in transparency," scanning in some of the documents and posting them to his Web site. If it weren't for the work of Suskind (or at least his intern), Yahoo Search would never find Alan Greenspan's scathing 2002 comments about corporate-governance reform.

The CIA and Dick Cheney notwithstanding, there is no secret government conspiracy to hide public documents from view; it's largely a matter of bureaucratic inertia. Federal information technology organizations may not solve that problem anytime soon. The deep Web search engines may just solve it for them.

For almost as long as there has been a Web, there have been Web search engines. So one might reasonably ask why the deep Web has remained out of view for so long.

Traditionally, Web search engines have grown their databases through simple brute force. All the major search engines survey the Web by dispatching legions of simple programs known as spiders, crawlers, robots or harvesters to trace their way through the endless chains of hyperlinks that tie Web pages together.

That method works well for the static HTML pages and predictable URLs that make up the upper strata of the Web. But the deep Web resides mostly in databases, shielded by a lattice of registration gateways, session cookies and dynamically generated links. Unless an organization consciously chooses to share its data, by opening up an API or Web services feed -- the way Amazon books show up in a Google search -- then the data will likely remain unseen to most users.

New search engines now under development are exploring methods for penetrating the database barriers. BrightPlanet has developed a formula for brokering queries across multiple deep Web data sources at once, aggregating the results and letting users compare changes to those results over time -- a process known as "differencing."

That capability has attracted considerable interest from certain government agencies that shall remain nameless. "Some of our clients are spooky," says BrightPlanet COO Duncan Wittes. Other BrightPlanet customers include state governments, competitive intelligence researchers, and political campaigns whose "oppo" teams may want not only to search for what a candidate has said but also for what he or she may have "unsaid" over time.

Soon-to-launch Dipsie is pursuing an alternative approach to unlocking the dynamic Web, by deploying a kind of souped-up spider that penetrates barriers like forms, drop-down lists, dynamically generated URLs and session cookies. Dipsie's spider works by emulating a "well-formed user" that, from the Web site's point of view, behaves just like a real flesh-and-mouse user, enabling the spider to cache the kind of data typically visible only to a human user.

Other search developers, including IBM, Google and Intelliseek, are exploring their own approaches to mining the deep Web. But in the wake of this week's announcement, Yahoo is now the elephant in the living room.

Yahoo won't discuss the specifics of how its search algorithms work. But the company does acknowledge that its Content Aggregation Program will give paying customers a more direct pipeline into its search database. Yahoo Search vice president Tim Cadogan says, "Ultimately we want to search the whole Web for free," but he nonetheless sees the CAP program as a way of enabling "direct, structured relationships with content providers" to "deliver a higher-quality search experience for users."

It takes a fine ear for P.R. nuance to distinguish "higher-quality search experience" from "better results." Yahoo has issued copious disclaimers assuring non-paying customers that they will receive the same algorithmic treatment as paying ones. But the company acknowledges that paying customers will likely benefit from a "quality review" designed to help companies improve their chances of showing up in search results.

"Cadogan claims that people who send money can't count on getting better results," says Bray. "Do you believe that? I don't."

Every year, the University of California at Davis pays the publisher John Wiley about $14,000 for a subscription to the Journal of Comparative Neurology, which publishes breaking research in its field. That may sound like a steep price tag for what is essentially a magazine subscription, but it's a tiny dollop of the $20 million the U.C. libraries spend every year on scholarly journals.

Scientific, technology and medical publishing constitutes an $11 billion industry. And like the rest of the publishing business, scholarly publishers have undergone massive consolidation in the past two decades. Once the province of small university presses and boutique academic imprints, scholarly journals now emanate from giant publishing conglomerates such as Elsevier, Thompson and Blackwells.

"The well-established subscription model that evolved around print journals is a cash cow," says Peter Lyman, professor at the UC-Berkeley School of Information Management and Systems. "One that the publishers are terrified of damaging accidentally, through online publishing."

But unlike trade-book publishers, who count on Amazon and Barnes & Noble to move physical units of the latest Harry Potter tome, scholarly publishers rely increasingly on electronic journal subscriptions and paid search services to fuel their revenues. Their customers -- mostly academic institutions and research organizations -- insist on providing Web access to journal content. To meet that demand while protecting their valuable data stores, the large publishers have responded by rolling out private permission-based search gateways to the contents of their journals, usually under highly restrictive license terms and tightly managed IP access.

But those pricey journal databases now compete for attention -- and search queries -- from students and faculty with ready access to Google, Yahoo and the rest. And while the public search engines may not find every article in the journal literature, a growing portion of published research also finds its way out onto the Web.

For example, when gene researchers identify a new DNA sequence, they usually submit the sequence to the National Institutes of Health's GenBank -- a public deep Web resource -- before submitting it to journals for publication.

Legislation pending in Congress would ensure that all research funded by federal taxpayers be made available free of charge to the public, over the Internet. Meanwhile, new cooperative academic initiatives like the Public Library of Science and the National Science Digital Library are trying to expand access to scholarly research, opening up more indirect competition for the proprietary publishing systems.

And as more scholarship finds its way onto the Web, page-ranking algorithms are also providing an alternative quality rating system to the traditional scholarly peer review that journals have always employed.

While page ranking won't replace the scholarly review process anytime soon, the expansion of public Web search engines will put downward pressure on the premium that publishers can command. "I don't think [page ranking] is more reliable," says Lyman, "but I do think it's perceived as legitimate. The cost of creating formally quality-controlled information may drive people to consider lower-cost alternatives."

Lyman adds, "When the public begins to use and accept non-qualified information -- relying on Google or other things to perform that function, like Technorati -- there are beginning to be quality mechanisms out there that are user-centric or generated by users,"

How will scholarly publishers react to the encroaching competition from deep Web search engines? "The publishing industry is not famous for being progressive, forward thinking or fast moving," Bray says. "But if they ignore [deep Web search], they could find themselves in a situation like the record companies, where someone finds a way to subvert them."

- - - - - - - - - - - -

The deep Web contains some 500 times more data than the surface Web; but to regard the deep Web as simply a bigger and better version of the current Web is to overlook the essential feature of databases, which is structure. Most of the deep Web is structured or semi-structured data, as opposed to the sea of flotsam HTML that bobs across the surface Web.

"Once you get into the deep Web, all of these data sources often have much more metadata available," says Bray. "This could be a huge opportunity for companies looking at new ways of presenting search results."

Deriving search results from structured data sets will open up new possibilities for search engines. In all likelihood, search engines will gradually abandon the flat listings-style result pattern you see on a typical 12-page Google result. (And who ever gets to the 12th page, anyway?) Not only could deep Web search engines present more useful and manipulable views into structured data but, given some basic lingua franca of structural vocabularies, they could also aggregate those results in endlessly permutable combinations.

"It's ridiculous to think that the one-dimensional result list is going to be the universal paradigm for all imaginable searches forever," Bray says. "If you type 'bicycle' into Google, you get a list of results having to do with bicycles. But that result is, in a very important way, a lie. It ignores the fact that some of these things are about bicycle racing, some are about bicycle manufacturing. It ignores things that Google might not even know about."

As deep Web search engines unearth the structures of large data sets and make those structures visible across organizations, they will create a powerful incentive for organizations to invest in more consistent, predictable structures (a trend already manifest in the growth of Web services and in Yahoo's search quality guidelines). In exchange for the benefits of increased exposure, these organizations will yield another level of autonomy.

While government and academic institutions may generate the greatest volume of deep Web content, corporations undoubtedly generate the most monetary value in Web data: customer databases, product catalogs, technical knowledge bases and myriad other data sources with quantifiable business value.

Over the last decade, companies have invested heavily in Web infrastructure, including countless local search engines. While many companies already outsource their public Web site search functions to companies like Google, many also have developed specialized search engines for their own deep Web data, like technical support databases.

Those investments make plenty of sense when that data won't readily show up in a public Web search. But as deep Web searchers penetrate these gateways, will companies continue to see the value of investing in their own public interfaces?

In the near term, deep Web search engines will likely dampen company expenditures on local search initiatives. But in the longer term, the changes may prove more far reaching. "The quality and ubiquity of Web search engines hides the fact that most organizations have really crappy search mechanisms," Bray says. "I think that's creating a tension within organizations."

As public search engines continue to supplant the role of organizations' own information-retrieval systems -- be they search databases, call centers or sales engineers -- once internal-facing systems will assume increasingly outward-facing roles. "When the ability to develop different messages for different audiences is curtailed by universal availability," says Gartner analyst Whit Andrews, "the nature of the message, its format and associated issues become paramount.

No one expects IT departments to go out of business, but the external pressures of deep Web search will almost certainly force long-term changes in the role, structure and autonomy of local IT organizations as they gradually lose direct control over customer transactions.

- - - - - - - - - - - -

Every search query is a unit of desire. Search companies, like all businesses, exist by transforming desire into hard currency. As deep Web search engines insinuate themselves into deeper and deeper levels of organizations, they will not only offload search traffic, they will trigger a series of massive disruptions in the information economy.

If you buy the Cluetrain maxim that "hyperlinks subvert hierarchy," then surely deep Web search engines will amplify that subversion. As search engines extend their reach deeper into and across organizations, the boundaries between those organizations will feel more fluid -- both to consumers and to the organizations themselves. The first thing most of us notice may be better search results.

Somewhere inside that complex apparatus of desire and fulfillment, a transformation is taking place, one whose effects we can barely foresee.

Editor's note: This story has been corrected since its original publication.

In search of the deep Web

The next generation of Web search engines will do more than give you a longer list of search results. They will disrupt the information economy.

By Alex Wright

Published March 9, 2004 8:42PM (EST)

Shares

By Alex Wright

Related Topics ------------------------------------------

Related Articles