NOV 16th 2007

Image credit: Node GardensTo begin with, there is a very simple idea: Websites should themselves indicate their changes to the search engines. I've already touched upon the subject in the previous part of this series, right now search engines have a reversed approach which consists of crawling the Web constantly looking for the slightest modification. Don't you think it's silly? Think about the number of Web pages to visit, imagine the cost to get the lowest frequency between each visit. Consequently, it seems difficult to consider the development of new search engines today. Nevertheless, the advent of the Semantic Web should lead to their multiplication, in a vertical way, while search engines are getting specialized more and more in specific fields.

Crawling seems to be the "boring" part for the search engines and if they want to be distinguishable from each other it won't be with crawling. The innovation should be in the indexing, ranking, etc. But can we consider that some day search engines like Google or Yahoo might agree to pool their crawling? Surprisingly, I think so.

But before we go further, we must know what we're talking about. All in all, crawling consists of making a sort of "backup" of the World Wide Web. Somehow, generalist search engines need to own a full copy of the Web. Therefore they need to scan and scan again to get the freshest copy. Worse, vertical search engines have got the same problem, even though their favorite domains are limited to a few topics, they still have to examine the whole Web since the information they are interested in could be anywhere. At least that is the present approach.

How can we improve things? First and foremost I'm more and more convinced we need to reverse the process. Search engines should not question Websites, Websites should inform search engines of any changes. Today a Web developer would have to set up a robot.txt file (or Sitemaps) to insure the best indexation as possible. Tomorrow he will add a mechanism able to inform search engines about any changes that have occurred in his database. It shouldn't be a problem for modern Websites based on the MVC (Model-View-Controller) paradigm, they'd just have to add a "plugin" at the model level. To sum things up, this plugin will be in charge of alerting search engines in order to report any "Create", "Update" or "Delete" action.

At the end, this reversal process should enable the development of real time search engines. Imagine that as you enter your keywords, results will appear as the Web changes! By the way, if Websites have to contact search engines, which ones are they going to pick? Are they going to restrict it to certain ones? Of course not, Websites should be able to spread their modifications towards a maximum number of search engines, from the most important to the most specialized, without even having to know them before hand.

How could this be? I'm thinking about some kind of relays diffusing the "modifications feed" as widely as possible. Let's call it "datahubs" if you want. Datahubs will be linked to each other in a completely decentralized way and if a Website sends information to a certain datahub, every other datahub will receive the exact same information, by a cascade process. In another way, if I want to create my own datahub, I will only need to connect somewhere, to another datahub, to receive all the changes happening everywhere on the Web. For its part my datahub will be able to spread the incoming data to other datahubs. Surely, my datahub will have enough bandwidth to let all that information transmit.

It would be interesting to evaluate the total bandwidth necessary to transfer in real time all the Web modifications, but I think we can estimate it shouldn't be too high. In fact it actually should be pretty low and if I had to guess and give an approximate number, I'd say that 1Gbps would be enough if we stick to the textual data! Today we can find some hosting companies able to provide this bandwidth for less than $100 a month. Try to figure out the total cost of the constant crawling done by Google, Yahoo and MSN (only to quote the main ones) compared to the few dollars necessary to accomplish the same thing with the datahub idea.

Nevertheless, crawling isn't everything. If we wish to create a true search engine we are going to need to accumulate a very important mass of information in order to achieve some basic operations such as parsing, indexing, ranking, etc. Consequently, if we want to create a new generalist search engine we'll still need to think about a huge infrastructure. On the other hand it would actually be very cheap to make vertical search engines specialized in specific fields, thanks to the datahub concept. Besides it would be one of the main features, datahubs allowing to declare the datatypes that they wish to receive and spread.

In the end, it appears to me that the necessity for datahubs is obvious and the potential is so big that I can hardly imagine all the possible applications. But this idea is fairly new to me, I just started and I barely know the actual state of research on that matter. Are they any people working on it already? Your feedback is welcome.

Trackback URL for this entry:

Spam protection by Akismet

Comments for this entry:

No one has left a comment for this entry. Be the first!

Post a comment

  1. Spam protection by Akismet