<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Gorges Blog &#187; natural language processing</title>
	<atom:link href="http://blog.GORGES.us/tag/natural-language-processing/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.GORGES.us</link>
	<description>Web Sites that Grow Your Business - our blog</description>
	<lastBuildDate>Mon, 19 Jul 2010 20:48:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Web Tools for Natural Language Processing</title>
		<link>http://blog.GORGES.us/2009/09/web-tools-for-natural-language-processing/</link>
		<comments>http://blog.GORGES.us/2009/09/web-tools-for-natural-language-processing/#comments</comments>
		<pubDate>Tue, 01 Sep 2009 13:00:24 +0000</pubDate>
		<dc:creator>Rasmus Schultz</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[Web Development]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[Web 3.0]]></category>

		<guid isPermaLink="false">http://blog.GORGES.us/?p=157</guid>
		<description><![CDATA[Natural language processing is an important part of the semantic web.  Here is a short survey of some tools that are available to make your web application smarter.]]></description>
			<content:encoded><![CDATA[<p>We have been researching Web 3.0, which is the moniker assigned to the next generation of web applications that really understands what you are trying to do.</p>
<p>Part of creating &#8220;smart&#8221; web applications is understanding the semantics of what people type in, which implies using natural language processing.  Natural language processing software examines unstructured documents, and generates structured metadata that computers can handle.</p>
<p>Our application needed to understand phrases that people enter into a web browser.  We found three different approaches to handling this unstructured text:</p>
<p><span style="text-decoration: underline;"><strong>SaaS APIs</strong></span></p>
<p>These are hosted applications. All offer limited services at no charge, commercial services are generally pretty expensive. The major players appear to be:</p>
<p><a href="http://www.zemanta.com/"><strong>Zemanta</strong></a>: offers an API with automatic tagging, among many other features.</p>
<p><a href="http://www.opencalais.com/"><strong>OpenCalais</strong></a>: while it is by no means &#8220;open&#8221;, this API is powered by Reuters &#8211; which means that their &#8220;corpus&#8221; (body of words understood by the system) was composed using one of the world&#8217;s largest and most accurate volumes of text.</p>
<p><a href="http://www.alchemyapi.com/"><strong>Alchemy API</strong></a>: offers automated categorization, tagging, keywords, etc.</p>
<p><span style="text-decoration: underline;"><strong>NLP Toolkits</strong></span></p>
<p>These are open-source toolkits (APIs that you can install on your own server) for analysis of unstructured text. Learning how to apply one of these might take a considerable effort &#8211; someone would have to learn at least the basics of NLP, to apply this software, or you might choose to hire a consultant with the the skills to develop this part of the application.</p>
<p><a href="http://www.nltk.org/"><strong>NLTK.org</strong></a>: a library written in Python, started in 2005, has been slowly creeping towards release 1.0 for the past year or so. While relatively young, it may be based on newer research than some of the more mature NLP libraries. Many corpora, grammar collections and trained models ready to use.</p>
<p><a href="http://gate.ac.uk/"><strong>GATE</strong></a>: General Architecture for Text Engineering. Stable and proven toolkit for Java &#8211; this project started in 1995. Countless subprojects leverage this toolkit for various purposes.</p>
<p><a href="http://garraf.epsevg.upc.es/freeling/"><strong>FreeLing</strong></a>: Widely used toolkit in C++, with APIs for Java, PERL and Python. Online demos of this library demonstrate graphically how a short sentence can be broken down to a kind of tree-structure (nested subject/object, verb/adverb, etc.)</p>
<p>These are just a few examples &#8211; there are so many toolkits, and applications using these toolkits, that it would be impossible to make a choice based on a superficial analysis. To make a qualified choice, we would need to study at least the basics, or we would need the help of someone who knows enough about it to make a recommendation based on our needs.</p>
<p><span style="text-decoration: underline;"><strong>Roll-your-own</strong></span></p>
<p>Using e.g. MySQL, the <a href="http://en.wikipedia.org/wiki/Stemming">Porter stemmer</a>, a stop-word list and various other techniques to roll a basic search engine. Perhaps throw in a Bayesian text similarity measurement, to help rank the results and create stronger/weaker links between tables of keywords and posts.</p>
<p>It&#8217;s not NLP, and it&#8217;s not &#8220;web 3.0&#8243;, or &#8220;the semantic web&#8221; that everyone is buzzing about these days &#8211; because it does not understand semantics, and this will not yield the same kind of results &#8211; NLP systems &#8220;understand&#8221; unstructured text, where words like &#8220;not&#8221; and &#8220;really&#8221; can reverse or amplify the meaning of a subject &#8211; whereas anything you can roll on your own would most likely just recognize and consider these words &#8220;stop words&#8221; (ignoring them).</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.GORGES.us/2009/09/web-tools-for-natural-language-processing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
