There's a lot of talk about new search engines and the promising technologies behind them. One technology that has more or less recently been applied to Web search is natural language processing. NLP allows search engines such as Hakia and Powerset to return results based on the query's meaning rather than relying on keyword distribution as a means of identifying relevant Web documents.
Tag: Natural Language Processing
For just about every area of research there exists documents online describing background information or techniques to accomplish a task in that domain of research. These documents are often referred to as white papers, provided their content is of technical or research orientation. The information held within white papers is essentially accessible by humans only because machines are not able to read and comprehend text in the same way humans can. If machines were able to read white papers and extract information in the same way humans can we would be able to store each fact and piece of knowledge from the documents. This method of indexing would facilitate much more detailed searches, allowing users to search by topic, theory, conclusion, methods, citations, references, etc.
I like to consider myself fair and balanced when speaking about most topics. To educate the uneducated and to balance things out a bit I have compiled a list of 5 problems we will likely run into when we reach the Semantic Web. Each problem is a side-effect of advances in technology, rushes to fill new niches, or the previous two plus the desire to make a quick dollar.
Two ways the Semantic Web may come to fruition are the top-down and bottom-up approaches. Using the bottom-up approach we would start from the bottom and work our way to the top by using a method like embedding RDF into Web documents to supply user agents with meta data. We are already seeing this type of action being taken by bloggers and other content creators. If we choose the top-down approach then we would start from the top and work our way down, using natural language processors to read existing Web documents and extract semantic metadata.
The journey from now to the Semantic Web is a long one. What we currently have on our hands with the current version of the Web are billions of documents totaling terabytes of data. This data is usually found within HTML pages comprised mainly of non-validating markup and very little, if any, meta data.
While there are billions of documents on the Web that contain no meta data whatsoever there is one shining star of hope: Natural Language Processing. NLP can be used to sift through the "garbage" data to extract coherent statements about the information held within.
Page 1 of 1