For just about every area of research there exists documents online describing background information or techniques to accomplish a task in that domain of research. These documents are often referred to as white papers, provided their content is of technical or research orientation. The information held within white papers is essentially accessible by humans only because machines are not able to read and comprehend text in the same way humans can. If machines were able to read white papers and extract information in the same way humans can we would be able to store each fact and piece of knowledge from the documents. This method of indexing would facilitate much more detailed searches, allowing users to search by topic, theory, conclusion, methods, citations, references, etc.
Unfortunately, extracting the kind of information described above requires the use of natural language processing which is still in its infancy. This isn't to say that the beginnings of the technology don't already exist. As I outlined in Future value paradigms of the Semantic Web there is progress being made towards bring NLP to the mainstream and I predict we will see more companies coming out with this kind of technology in the near future. In deployment today we're seeing part-of-speech taggers with higher-level analysis like noun phrase identification and named entity extraction. Both technologies are difficult to create, but more can be done.
A higher-level function that I believe will revolutionize the world is "property extraction." The process would scan a document and extract each statement about the subject and store it in triple form. Take the following sentence from the Wikipedia American Civil War page:
"The American Civil War (1861-1865) was a major war between the United States (the "Union") and eleven Southern slave states that declared their secession and formed the Confederate States of America, led by President Jefferson Davis."
From that short sentence we are able to determine the following things:
- The American Civil War was a war
- The American Civil War is considered major
- The American Civil War began in 1861
- The American Civil War ended in 1865
- The American Civil War involved the United States and eleven Southern slave states
- The United States was also known as the Union
- Eleven Southern slave states declared their secession
- Eleven Southern slave states formed the Confederate States of America
- President Jefferson Davis led the eleven Southern slave states
From the preceding example it is easy to see why natural language processing is an important technology. NLP would enable us to explore the information contained within documents of any kind, not simply white papers.
There is a site that could be taking advantage of the number of documents in its index to experiment with this kind of information extraction, and I don't mean the obvious choice of Google. Scribd is a company that recently got some attention for receiving $300,000 in investment funding. It can be described as a YouTube for documents and is a free online library where anyone can upload various document formats ranging from .pdf, .doc, .ppt, .xls, .txt, and more. They play host to a growing community of users and a repository of documents which they perform simple statistics on and, most importantly to its users, allow you to view your documents using the Flash .pdf viewer on their site. Scribd, in my opinion, would be an excellent candidate to test such a technology on.
We have a need for a technology that not enough people are trying to develop. One problem is that natural language processing is a large and challenging topic. As the technology advances and we are able to lower the bar for more developers to enter this area of research we will see more enthusiasm erupt over it, and more progress made.
About the author
Trackback URL for this entry:
Spam protection by Akismet
Post a comment