Posts Tagged ‘Web Development’

Web Tools for Natural Language Processing

Tuesday, September 1st, 2009

We have been researching Web 3.0, which is the moniker assigned to the next generation of web applications that really understands what you are trying to do.

Part of creating “smart” web applications is understanding the semantics of what people type in, which implies using natural language processing.  Natural language processing software examines unstructured documents, and generates structured metadata that computers can handle.

Our application needed to understand phrases that people enter into a web browser.  We found three different approaches to handling this unstructured text:

SaaS APIs

These are hosted applications. All offer limited services at no charge, commercial services are generally pretty expensive. The major players appear to be:

Zemanta: offers an API with automatic tagging, among many other features.

OpenCalais: while it is by no means “open”, this API is powered by Reuters – which means that their “corpus” (body of words understood by the system) was composed using one of the world’s largest and most accurate volumes of text.

Alchemy API: offers automated categorization, tagging, keywords, etc.

NLP Toolkits

These are open-source toolkits (APIs that you can install on your own server) for analysis of unstructured text. Learning how to apply one of these might take a considerable effort – someone would have to learn at least the basics of NLP, to apply this software, or you might choose to hire a consultant with the the skills to develop this part of the application.

NLTK.org: a library written in Python, started in 2005, has been slowly creeping towards release 1.0 for the past year or so. While relatively young, it may be based on newer research than some of the more mature NLP libraries. Many corpora, grammar collections and trained models ready to use.

GATE: General Architecture for Text Engineering. Stable and proven toolkit for Java – this project started in 1995. Countless subprojects leverage this toolkit for various purposes.

FreeLing: Widely used toolkit in C++, with APIs for Java, PERL and Python. Online demos of this library demonstrate graphically how a short sentence can be broken down to a kind of tree-structure (nested subject/object, verb/adverb, etc.)

These are just a few examples – there are so many toolkits, and applications using these toolkits, that it would be impossible to make a choice based on a superficial analysis. To make a qualified choice, we would need to study at least the basics, or we would need the help of someone who knows enough about it to make a recommendation based on our needs.

Roll-your-own

Using e.g. MySQL, the Porter stemmer, a stop-word list and various other techniques to roll a basic search engine. Perhaps throw in a Bayesian text similarity measurement, to help rank the results and create stronger/weaker links between tables of keywords and posts.

It’s not NLP, and it’s not “web 3.0″, or “the semantic web” that everyone is buzzing about these days – because it does not understand semantics, and this will not yield the same kind of results – NLP systems “understand” unstructured text, where words like “not” and “really” can reverse or amplify the meaning of a subject – whereas anything you can roll on your own would most likely just recognize and consider these words “stop words” (ignoring them).

Rasmus Schultz has worked for web development companies, advertising agencies and a music software company during his extensive development career. His main strengths are software development and database design. Rasmus has more than a decade of experience with many development platforms, languages and standards.

To Agile or Not To Agile

Thursday, April 9th, 2009

Gorges has published our best practices methodology, but it does not mention agile development.  Why not?

First, what is agile development?  Agile development is a set of methodologies for creating software.  These methodologies include breaking projects into smaller tasks with minimal long-term planning.  Each task, or iteration, is treated at a mini-project with planning, design, coding, and testing cycles.  Collaboration is encouraged Agile Developmentamong team members, and face-to-face communication is prefered to written documentation.  The customer is expected to be available to make decisions on design, features, and prioritizing tasks.

The proclaimed benefits are faster development and higher customer satisfaction.

So why don’t we promote agile development on our web site?

We have found that it takes a certain type of client to make agile development work.  The client must be actively involved, willing to make quick decisions and not be adverse to compromise if we learn that certain features may be costly to implement.  The smaller cycles are called “sprints” and since the deadlines are fixed, decisions are continually made to prioritize the features and tasks to meet these deadlines.

Likewise this sort of development can be a challenge to developers who are used to fixed specifications and planning the entire project before starting programming.

Gorges has had some marvelous agile-development successes, but we have also learned to not push this style onto our customers.  Another reason we do not promote this methodology is that it is difficult to estimate the project size since the development is dependent upon the client’s feature selections.

Matt Clark worked in academia, corporate research labs and several technology startup companies prior to Gorges Web Sites. His expertise is software architecture, database development, and system administration. Matt brings Gorges Web Sites over 25 years experience developing fast and robust software on a multitude of platforms and languages.