In 2008, Google spidered its trillionth web page. That sounds impressive, but as LISNews, the Librarian And Information Science News, recently pointed out that figure represents but a tiny fraction of the information on the web. How so, you ask? Well, think of all those ecommerce databases, library catalogs, transport system fares and timetables… There are billions of pages that are only ever revealed to individual users when they access them with particular information requests. These pages are effectively invisible to search engine spiders and as such as known as the invisible web.
There have been search engines created that are capable of trawling catalogs by simulating user searches, but these only scrape the surface. Google itself recognizes the problem and has repeatedly announced efforts to reveal the invisible. But, as yet, there is no search engine that can answer a question, such as, “what’s the best and most inexpensive way to get me from an hotel near Mornington Crescent tube station in London to Los Angeles International Airport with a stop off in New York City?
Of course, such questions could have myriad answers and maybe there never will be a way to uncover enough of the invisible web without the intervention of expert human intermediaries, such as travel agents well versed in the London Underground and American Airlines flight paths and timetables.
However, you may have heard the notion of web 3.0 being bandied about during the last year or more. Web 1.0, was of course, the static flat web of hyperlinks and no interaction. Web 2.0 (ignoring the glossy mirrored logos and missing vowels [flickr etc]) is what we currently have. It’s the interactive web of comments on blogs, social bookmarking sites like del.icio.us, social networking sites such as LinkedIn and Facebook, microblogging (Plurk, Twitter, and the late Pownce), and all kinds of tools that converted the static flatland of html into the scrubbed dynamic web we all know and love(?) today.
Web 3.0 takes all this a step further adding machine-readable meaning to the packets of information. It is thus known to the technically minded as the semantic web. Once it is manifest the semantic web will take us to within a gnat’s whisker of that utopia in which you have the exact change for a trip from Mornington Crescent to LAX via JFK.
Before we get there though, there is the not-so-simple matter of enabling meaning within information sources. This concept brings us full circle to the early days of web design when every tool stressed the importance of meta tags. Meta tags were meant to provide the fledgling search engines way back in the 1990s with the means to extract significance and context – meaning in other words – from web pages.
Almost as soon as the first spiders read those meta tags, which may have included keywords, a description, and the name of the page author, and more, the so-called black hats of the search engine optimization (SEO) world began to game the system. They would stuff keywords into their sites’ meta tags that may or not have been related to the actual content of the site. The aim was to fool the search engines into ranking the site highly for particular keywords and so gain more traffic through this spammy technique than the site was naturally due.
Then, once the search engines recognized what was happening they deprecated the relevance of meta tags in the algorithms they used to generate the search engine results pages (SERPs). As such, meta tags have fallen out of favor. They still have relevance in a few of the simpler and less well-known search engines and they are often used to display key text in the SERPs. This means that it is not only black hats who have abandoned meta tags to some degree, but generalist webmasters often ignore their latent potency and simply do not include them in the pages they publish.
This could be a major blow to the emergence of the semantic web, the advent of web 3.0. Websites need their meta data, they need to be able to explain themselves to machines in an understandable way. Badawia Albassuny at the Department of Library and Information Science, King Abdulaziz University, Jeddah, Saudi Arabia, certainly recognizes this. She has recently surveyed the automatic metadata generation applications on the web, with a view to raising awareness of the possibilities.
If you use WordPress and other blogging tools and content management systems (CMS) you may have plugins installed that automatically add meta tags. If you use the Zemanta system and have customized your settings you may also have noticed that it has a built in system for adding semantics to links you include in your posts. I discussed Zemanta in a little more detail in a post entitled Free Blog Content recently.