Neural Networks and Language Identification
In 8th century Bahgdad, philosopher al-Kindi, also known as the "Philosopher of the Arabs", revolutionized cryptography when he published "On Deciphering Cryptographic Messages". He did this recognizing that when a text is analysed, certain letters are more likely to show up than others. The statistical frequency of letters in a language acts as a fingerprint for the language, and allows cryptologists to decipher messages encoded with simple substitution ciphers.
Because this "fingerprint" is visible even in small text sample sizes, it seems likely that a relatively accurate neural network could be constructed and trained without too much difficulty (if you're unsure what a neural network is I have a page here that provides a brief explanation).
Since the principle of this language recognition is based on character frequency, we first need a way to parse text provide suitable input values for processing in a network. A simple program I wrote can quickly analyze this, any using a variety of open source text we can gather sample input data for the network. Due to the fact that there is less online content written in, say, Afrikaans than in English, it seems obvious that the network will perform better on well known languages than uncommon languages.
Since each language tends to use a distinct alphabet, we also need to agree on the number of distinct characters we will use to implement this network. For the purposes of this project I have limited myself to languages with Latin-based alphabets. There are many out there, but also for the purposes of this project I thought it was a good idea to keep it simple. I used 18 languages: Afrikaans, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Icelandic, Latin, Norwegian, Polish, Portuguese, Spanish, Swedish, and Tagalog. Of those languages I've compiled a list of 53 distinct letters (irrespective of case) that occur. I have omitted numbers and letters because the occurrence has more to do with the genre of the text than the language of the text.
The basic design of the network I implemented therefore includes 54 input units--one for each character, corresponding to 10 output units--one for each language used. Note that if you try to identify a language that is not in the above list the output of these network will either not respond or respond incorrectly. A network could theoretically parse alphabets of all languages, but the computation and size of the network will necessarily be larger if you try to include languages like Chinese.
Initial tests were promising. Using a sigmoid training rule I was able to get the network to converge to a solution within 1000 training epochs. I took the weights from the network I trained and programmed them into the character frequency program I mentioned earlier. With a bit of tinkering I was able to get the program to generate the correct output to text in various languages that I copied from Wikipedia. While the network was far more successful than I initially imagined it would be, this success came with several caveats. The network is extremely accurate for large input text sizes (say over 1000 characters), but when the input size gets lower the accuracy quickly drops off. Given the fact that the underlying principle behind this network relies on statistical frequency analysis, this is not surprising. With small input sizes there are simply not enough characters to produce an accurate "fingerprint" of the language. Luckily there are alternative methods of encoding the characters.
One such approach commonly used in the literature behind this sort of neural network training is called the n-gram approach. Rather than counting the frequency of single characters that occur, we can instead keep track of consecutive strings of multiple characters. It is obvious why this approach might be more successful than the previous approach. Trigrams such as "the" or "and" are far more likely to occur in English than in any other language, and thus reduces our need for a large sample input size. With this approach we can expect to accurately identify language with as few as a sentence or two of input text. Of course, this newfound power comes at a cost. The amount of computation required for this network rapidly increases if we decide to include all possible trigrams. Rather than 53 separate inputs, we now have to deal with almost 150,000 inputs. Even a high-end computer might have trouble dealing with networks that large.
A Bi-gram Approach
The accuracy of my initial approach for input sizes of 100 characters was around 40%. I thought I could do better.
With my second attempt I decided to use bi-grams instead of individual characters. To keep the number of inputs small I decided to use only the regular alphabet this time. This probably affected the identification accuracy to some extent, but I decided the trade-off in computing time was worth it.
Using 676 (26x26) input units this time, but otherwise keeping the same network specifications as last time, I was able to effectively double the network's accuracy at 100 characters. The amount of computation required for this improved network to determine a language is necessarily larger this time, but I feel that the improved accuracy of identification was worth it.
While certain languages (Spanish/Portuguese) are still relatively hard to identify accurately with small sample sizes, the network proves to be very accurate with large input sizes. You can try it for yourself here.