JAN 9th 2007

The journey from now to the Semantic Web is a long one. What we currently have on our hands with the current version of the Web are billions of documents totaling terabytes of data. This data is usually found within HTML pages comprised mainly of non-validating markup and very little, if any, meta data.

While there are billions of documents on the Web that contain no meta data whatsoever there is one shining star of hope: Natural Language Processing. NLP can be used to sift through the "garbage" data to extract coherent statements about the information held within.

Requirements for practical use

To be of practical use, natural language processing for the Semantic Web would require more than the ability to extract the parts of speech. The processor would need to determine the context in which words are being used, which helps determine the meaning.

In the context of Web documents, it may also be necessary for next-generation NLP agents to determine context based where the information is located and how that information is represented in markup.

Semantic markup and Natural Language Processing

Meaning behind data can be silently expressed to NLP agents by simply using appropriate semantic markup. A definition list used to represent a list of terms may hint to the agent that their is a direct key/value relationship between the data being expressed in the list.

Taking it further

There are vast repositories of information in textual form on the Web that would make prime candidates for NLP test. Wikipedia is a good example of a Website with a lot of raw information that can be transformed into data we can use. An NLP agent could conceivably crawl Wikipedia, parse the wikitext to extract the information, and then populate a database with triples from what it "read."

In conclusion

We have a long way to go before we can transform today's Web into the Semantic Web, however advancements in Natural Language Processing will generate large progress towards meeting that goal.

James Simmons

James Simmons

