Archive for September, 2009

What Hosting Do I Need?

Tuesday, September 15th, 2009

Choosing a hosting service is important, and there are many choices to make.  Here are some tips to help you make your selection.

The first step is to determine your business requirements.  The criteria should be reliability (or uptime), performance, support, and cost.  Try to estimate the cost of downtime, because that value should factor in your hosting decision.  If a day of downtime costs you thousands of dollars, then reliability is very important.

The cheapest hosting is to purchase an account on a shared server.  Your domain is one of perhaps hundreds or even thousands that vie for the server CPU, memory, and bandwidth.  If your site is slow, it may be difficult or even impossible to diagnose why since the fault may be with another domain on the same server.

The next level up is a virtual private server (VPS).  In reality you are still sharing the server with other customers, but there are separations between these relatively-independent operating systems so they affect each other less if problems on one arise.  The term “cloud computing” is really just another name for using virtual private servers, although often the cloud computing control panels make it easy and fast to add and remove VPS units as your domain needs change.

If you want the whole server to yourself, then you can hosting on a dedicated server.  This is all about control – there are no other customers to contend with if you are the only one using the server.  Note that you may need an experienced system administrator to help if you are setting up your own dedicated server.

If your domain outgrows a dedicated server, then you have graduated to a cluster solution.  You will have new challenges regarding sharing session management and your database between multiple servers.  It should also be mentioned that cloud computing supports clustering with their VPS machines, which is cheaper than a custom-built clustered solution.

At Gorges, we offer shared-server and dedicated-server hosting solutions to our software development clients.  We have two co-location facilities that we use in Ithaca, New York, and our servers are monitored constantly.  Since we do our own hosting, we can add software packages or customize the server configuration as-needed for our clients.

Matt Clark worked in academia, corporate research labs and several technology startup companies prior to GORGES. His expertise is software architecture, database development, and system administration. Matt brings GORGES over 25 years experience developing fast and robust software on a multitude of platforms and languages.

Web Tools for Natural Language Processing

Tuesday, September 1st, 2009

We have been researching Web 3.0, which is the moniker assigned to the next generation of web applications that really understands what you are trying to do.

Part of creating “smart” web applications is understanding the semantics of what people type in, which implies using natural language processing.  Natural language processing software examines unstructured documents, and generates structured metadata that computers can handle.

Our application needed to understand phrases that people enter into a web browser.  We found three different approaches to handling this unstructured text:

SaaS APIs

These are hosted applications. All offer limited services at no charge, commercial services are generally pretty expensive. The major players appear to be:

Zemanta: offers an API with automatic tagging, among many other features.

OpenCalais: while it is by no means “open”, this API is powered by Reuters – which means that their “corpus” (body of words understood by the system) was composed using one of the world’s largest and most accurate volumes of text.

Alchemy API: offers automated categorization, tagging, keywords, etc.

NLP Toolkits

These are open-source toolkits (APIs that you can install on your own server) for analysis of unstructured text. Learning how to apply one of these might take a considerable effort – someone would have to learn at least the basics of NLP, to apply this software, or you might choose to hire a consultant with the the skills to develop this part of the application.

NLTK.org: a library written in Python, started in 2005, has been slowly creeping towards release 1.0 for the past year or so. While relatively young, it may be based on newer research than some of the more mature NLP libraries. Many corpora, grammar collections and trained models ready to use.

GATE: General Architecture for Text Engineering. Stable and proven toolkit for Java – this project started in 1995. Countless subprojects leverage this toolkit for various purposes.

FreeLing: Widely used toolkit in C++, with APIs for Java, PERL and Python. Online demos of this library demonstrate graphically how a short sentence can be broken down to a kind of tree-structure (nested subject/object, verb/adverb, etc.)

These are just a few examples – there are so many toolkits, and applications using these toolkits, that it would be impossible to make a choice based on a superficial analysis. To make a qualified choice, we would need to study at least the basics, or we would need the help of someone who knows enough about it to make a recommendation based on our needs.

Roll-your-own

Using e.g. MySQL, the Porter stemmer, a stop-word list and various other techniques to roll a basic search engine. Perhaps throw in a Bayesian text similarity measurement, to help rank the results and create stronger/weaker links between tables of keywords and posts.

It’s not NLP, and it’s not “web 3.0″, or “the semantic web” that everyone is buzzing about these days – because it does not understand semantics, and this will not yield the same kind of results – NLP systems “understand” unstructured text, where words like “not” and “really” can reverse or amplify the meaning of a subject – whereas anything you can roll on your own would most likely just recognize and consider these words “stop words” (ignoring them).

Rasmus Schultz has worked for web development companies, advertising agencies and a music software company during his extensive development career. His main strengths are software development and database design. Rasmus has more than a decade of experience with many development platforms, languages and standards.