SEP 13th 2007

This entry is a response to I will never support the Semantic Web by Brian of d'bug.

I'm getting tired of reading about how the Semantic Web is some kind of pipe dream that will never be realized. The Semantic Web is completely and entirely within our technological reach. People may have been given the impression that we cannot create the Semantic Web because of its complexity, the number of years it has been in development, or even the unanswered questions that still exist for certain problems we will face. These are valid reasons to doubt our progress, but progress is certainly what we are making.

I've read Brian's blog before and find most of what he writes to be interesting. However I'm singling out his post because I feel it may be particularly damaging to the credibility of the Semantic Web to users who don't know enough about it to form an educated opinion. I am also calling him out because he said that he is not a pessimist, but rather a realist who has "done the research" in order to come to his conclusions.

"Supposedly, the assemblage and delivery of this information can be accomplished when responsible bodies not unlike the W3C, can decide on a semantic language."

The Semantic Web relies on more than the semantic language (RDF) created by the W3C. RDF is undoubtedly the core of the Semantic Web, but merely deciding on a language does not bring us to the Semantic Web, nor does throwing all of our information into RDF statements and hoping it will all just come together and "work."

"The Semantic Web will be built to compliment, and eventually replace some current languages and technologies."

Compliment, yes, however the new languages are not meant to replace existing languages. The Semantic Web is a layer atop of the current Web in many ways, including the reliance on the standards we are already using.

"Developers creating Web sites will be responsible for implementing the proper semantic tags, user agents will interpret these tags, infer that certain relationships exist, and provide a set of results."

Yes and no. This is a dangerous statement to throw out there without providing the arguments against that. Does your average blog owner (or any Website owner for that matter) know the inner workings of their content management system? Most do not. In fact, most people do not know basic HTML. It's true that the Webmaster will be responsible for "semantifying" his or her Website, but we shouldn't assume that the content management system would not take care of this. As I've previously suggested, content management systems will help usher in the Semantic Web by doing the legwork for the average site owner.

People are always going to be people. It doesn't matter if we have a Semantic Web that relies on some degree of knowledge engineering (for advanced users) or we have the current Web which relies on the HTML/CSS/JS stack. You don't have to have a degree in Computer Science to have a node in the Web and that cannot change when transitioning between the current Web and the Semantic Web. Part of the reason the Web was successful is because of the low barrier of entry, so we must preserve that.

"It sounds rather nice. Search engines will no longer be necessary, and brilliant personal user agents could even replace browsers altogether."

By no means will search engines no longer be necessary. Search engines will likely always have a place on the Web. Information retrieval is not going to be abolished by the Semantic Web. Don't forget, while the Semantic Web is a Web of interconnected data, not all data is intended to be marked up in triple form. We will still have Websites, and they will still have normal content built by today's standards. This is an often overlooked fact about the Semantic Web.

Why would intelligent user agents replace Web browsers? It almost sounds as though his vision of the Semantic Web is a Web without Webpages.

His first argument is that the Semantic Web will be unable to identify relationships from natural language.

"In other words, there is more than one way to describe something. Aside from strict rules that govern some data formats, like a phone number or mailing address, a description can vary significantly."

Natural language processing and the Semantic Web are two totally different fields of study, and while I have always believed that NLP will help usher in the Semantic Web in many ways (by providing a top-down approach) it has nothing to do with the Semantic Web itself.

He goes on to say that he is considered to be of average size in his area (Houston, Texas) but in some other part of the world (the Philippines was his example) he would be considered a large individual. Interpretation based upon experience and environment will naturally be an issue as it is an issue today. The difference is that with standards like RDF and OWL you can map one ontological meaning to another, and you can map the Houston "average" with the Philippine "large." User agents of the Semantic Web will be able to apply such mappings for people of various locals to give them the correct perspective.

"The Semantic Web will not bring us any closer to this reality because machines are unable to interpret data that is subjective unless complex algorithms are built around it. These are algorithms that already exist today, and are built by humans to search the Web as it exists."

Machines will interpret the data as it is written. If something is labeled as large, then it is large within the context in which it's being read. Part of the purpose of the new standards is to establish correct context for information and avoid unnecessary ambiguity. If you establish an ontological agreement with another node on the Semantic Web then you have taken into account the differences and can map meanings accordingly. This sounds like a long and arduous process, but much of the mapping will be handled by "the crowds" working together much in the way Web 2.0 has brought more power to data by allowing the community to interact with it.

His second and third arguments is that the Semantic Web will be just as susceptible to information manipulation as the current Web, and will fail miserably at identifying trusted sources. This is, in my opinion, probably the only real valid argument in his entire post. Much like today, there will be spam and made-for-AdSense sites, and new problems will arise such as information poisoning (passing bad or malicious semantic data off as good semantic data).

"Even if the Semantic Web could manage the inferred relationships between two sources of information, who is to say that the content presented is accurate? It may be formatted properly, and communicate details and specifications based upon recognized standards, but what "weight" will it be given? Google has a PageRank, and Alexa an Alexa Rating, just to name two. What will the Semantic Web use to disseminate and aggregate data when almost identical relationships are available for consideration?"

PageRank and Alexa Rating are not systems for establishing the trustworthiness of information or its accuracy. New systems must be devised in order to accommodate this. My theory of how the trust layer will work for the Semantic Web is that we will have multiple ranking systems, or multiple layers within one ranking system to accommodate the need to find trusted sources of information. Here are just a few things I think we will need to take into consideration with the trust ranking system:

  • Is this source legitimately bringing information to the table and not just spamming?
  • What neighborhood does this source of information live in?
  • Do a lot of people cite this source as useful?
  • Is the source linked to or does it link to bad neighborhoods?
  • Does this source have authority in its domain?
  • Does the source contain duplicated content from other sources?

If we have a global knowledge database established (likely proprietary, but not necessarily) we can do fact checking and validation as well to ensure that the semantic data is not misleading based on what is already known about certain topics. The ranking and trust systems we see today are not part of the Web itself, but a mechanism created for the Web to keep it sane. In many ways this is how the Semantic Web will work as well.

He goes on to give an example of a shopper who wants to buy a pair of jeans and gives us this information about what the shopper wants:

  • Desired product: Jeans
  • Price range: $25-$50
  • Retailer's physical location: No more than 15 miles from the shopper
  • In-store pickup: Required

These requirements are easily met if the information about each retailer in the shopper's area (and the products they carry) are stored with RDF. This is usually called parametric search. From a data standpoint it is simple to accomplish this kind of narrowing-down and can be done without semantic technologies, however his issues are with the trustworthiness of the source.

"Theoretically, a Semantic Web accompanied by a user agent will be able to assist her with this seemingly precise purchase. That is, until she discovers not all merchants are actually selling the item she wants. They are advertising products that competitors have in stock, but do not inform her that she must special order the item."

This example, like most, has no context. What is the starting point of this search? Is it starting from a node such as Yahoo Shopping where all merchants are verified? If we wish to freely query the Semantic Web, we can expect to find information that cannot be trusted. The Semantic Web does not inherently try to reason about the trustworthiness of information, it is the job of the trust layer (in whatever manifestation) to allow some information to pass through while blocking bad information. As technology advances we will develop more sophisticated systems of establishing the trust rank we require.

"Presumably, in the Semantic Web, she will be able to use semantics to tag Web sites as untrusted. When her friend (with whom she has a semantic relationship), decides to purchase an identical pair of jeans, he will be assisted with this additional information. Unfortunately, in her frustration, she has mistakenly tagged several of her friend's favorite retailers as untrusted."

Information about the trustworthiness of a source cannot be based solely on people's opinions and labels. We have learned from that past that people alone cannot be a trusted source of information about trusted sources. We learned this by watching search engines evolve their algorithms to rely less and less on a Webmaster's input about his own site and more and more about its connections and relationships to other sites of authority.

Google cares very little about your meta tags because the person benefiting from them (you) can put anything you want to in them to inflate your own popularity. This makes meta tags an untrustworthy source for ranking pages by search engines, and is the same logic that will be applied to the Semantic Web and Joe User's input about my favorite retailer.

Finally, there is one more thing I want to bring up:

"The Semantic Web only serves to diminish the factors that make us human, and it will characterize our uniqueness through a series of predefined tags. The current Web offers us the ability to express ourselves in an unbound context, and the interpretations of those expressions can never be duplicated by a computer."

The "unbound context" he speaks of is called a lack of structure. That is not a pro, it's a con. We can characterize our uniqueness anyway we wish, because we can create any vocabulary we want in order to express anything we can think of. The Semantic Web is not about using one vocabulary or one ontology, but about the cooperation of nodes to establish ontological agreements. If you feel like you can't express yourself in the Semantic Web, you are given the tools to change that.

About the author

James Simmons

It's my goal to help bring about the Semantic Web. I also like to explore related topics like natural language processing, information retrieval, and web evolution. I'm the primary author of Semantic Focus and I'm currently working on several Semantic Web projects.

Trackback URL for this entry:

http://www.semanticfocus.com/blog/tr/id/19283/

Spam protection by Akismet

Comments for this entry:

No one has left a comment for this entry. Be the first!

Post a comment

  1. Spam protection by Akismet