FEB 27th 2007

I like to consider myself fair and balanced when speaking about most topics. To educate the uneducated and to balance things out a bit I have compiled a list of 5 problems we will likely run into when we reach the Semantic Web. Each problem is a side-effect of advances in technology, rushes to fill new niches, or the previous two plus the desire to make a quick dollar.

1. Reduced anonymity on the Web

Unless you're already taking active measures to keep yourself non-indexed you may find that in the Semantic Web information about your identity, interests, and habits are trivial to discover. When you sign up for an account on sites like MySpace, Digg, Slashdot, etc you are feeding them information about yourself during and after registration — with your activities and contributions.

As the amount of available personal information increases we could begin seeing Websites that rely on querying the "Web as a database" for information about its visitors for mission critical functionality. If (once?) this change takes place having personal information on the Web may become the comfortable norm. One day we may see a shift in the importance of anonymity. Openness and transparency may become the "in thing."

2. Increased invasion of privacy

This problem stems from the issue of reduced anonymity on the Semantic Web. A Web that exposes vast amounts of information about everyone has its drawbacks. One downside to having so much information easily accessible to anyone is there will always be someone ready to abuse that information to make a quick dollar.

We may find ourselves in a new era of unwanted personalization. Contextual ads that examine a Website's content for hints of interests may be replaced with ads that target specific visitors based on their personal preferences, behaviors, lifestyle, friends, income, etc. In a similar way we will likely notice that e-commerce Websites will become better at figuring out just what it is we are going to want next.

Invasion of privacy brought about by the abuse of personal information — which would be more accessible than ever — will prove itself quite annoying. But we already have privacy issues now, don't we? If you've ever gotten a spam email then you know the answer is "yes," and there is so much more room for the problem to get worse before things will get better.

3. Intelligent content scraping

The content scrapers of today are really quite simple compared to what we will have to deal with in the Semantic Web. Essentially the scraper will access a Website or feed and extract and store the desired content. In most situations the content scraper must be customized or otherwise manually configured for the Website or feed (less so with feeds as they follow a standard format).

Content scrapers of the Semantic Web and beyond will be equipped with the ability to read the content within Web documents and feeds. Through natural language processing a semantic content scraper can read a blog entry (or several entries by different authors covering the same topic) and return a brand-new blog entry. The scraper would do so by extracting the facts and statements from the entries and regurgitating the information in another order or in entirely different wording.

The technology does not yet fully exist that would give us the ability to do what I described above, however the bottom-up approach to semantic content scraping would be to scrape the content of metadata written in RDF / OWL. The "bottom-up" scraper would not have the ability to extract information from the content in the way that a top-down content scraper (using an NLP agent) could but I expect to begin seeing this soon, if it hasn't already started.

4. Value paradigm shifts

In The value of current datasets in the Semantic Web I suggested that the ability to easily mine new and non-obvious types of data from the Semantic Web will turn information into more of a commodity than any past advancement in technology. Mix that with how simple it will become to access any kind of information and we may find that information is no longer the bottleneck in our development.

Where do we draw the line between commoditized information and information that would be better served as non-commoditized? Does such a distinction matter? Will simply publishing content make it subject to commoditization? Most Websites earn money through visitors clicking on advertisements and generally attract those visitors with their content. If in the Semantic Web any document published is essentially merged into the big picture, will content publishers continue to try to earn money in this way?

Issues of commoditization are already springing up as we continue to explore the usage of feeds to deliver content to readers. Currently there are really only two solutions to the issue and those are embedding advertisements in feeds and publishing partial content to encourage readers to click into your Website and continue reading. It's possible that if we do not develop ways to generate revenue from commoditized content we will never see the Semantic Web come to fruition because it would receive little commercial backing.

5. Vocabulary incompatibilities

The vocabularies we use to classify information are the backbone of the new information frontier. I say this because with these vocabularies we will classify and apply meaning to otherwise meaningless data (meaningless to a machine that is). One problem we're going to run into is when two different people are using two different vocabularies which happen to use the same terms to describe different meanings.

The problem with multiple vocabularies that contain the same terms but apply different meaning to them is that we destroy the author-intended meaning of the information if we attempt to merge the information. That said, it is bad to assume binary compatibility between the meanings expressed in vocabularies. There will be a great need for an open, unified vocabulary in the Semantic Web.

Wrapping it all up

Most of these issues exist today to a lesser extent and I doubt any will be prohibitive of reaching the Semantic Web. After all, each issue comes from the development of new and innovative technologies altering the landscape of the Internet. We'll get through it.

