This book gives a good one-volume summary of statistical machine translation (SMT), the technique that powers Google Translate and similar applications. Philipp Koehn is one of the best-known people in the field, and is very active both on the theoretical and the practical side. The Open Source Moses engine, which he and his group at Edinburgh University have developed over the last few years, has now become more or less the de facto standard toolkit for SMT. So: an authoritative, well-informed account of a new field.
The basic idea of SMT is shockingly simple, and, when the first papers started coming out in the early 90s, people in the language-processing community were indeed shocked. Suppose you're translating from French into English. All you do is take a large amount of bilingual text - the first experiments were done with the proceedings of the Canadian Parliament - line it up, and extract tables which list apparent correspondences between French phrases and English phrases and their relative frequencies. You then analyze the English text and produce a second set of tables which give the relative frequencies of English phrases on their own.
To translate, you take a French sentence, find bits of it that match French/English table entries, write down the associated frequencies both for the translation rules and for the resulting English phrases, and pick the combination that gives you the best score. There are two main reason why it's not completely straightforward. First, there are millions of possible combinations. Most words can be translated in several ways; for instance, à
can be "on", "in" or "for", or, to choose a more interesting example that Not recently drew to my attention, branlette
can be either "sugar shaker" or "hand job". The possibilities, needless to say, multiply out. Second, and at least as seriously, the English words will often be in a different order from the French words, so you need to take account of that in some way; here, the basic solution is for the translation algorithm to impose a penalty for changing the order, with big changes costing more than small ones.
But surely there must be more to translation than just looking things up in huge tables and picking the highest-scoring combo? Indeed there is: the fact of the matter, however, is that, with our present level of understanding, this is the method that works best. At the end of the book, there is a chapter briefly describing smarter methods that pay some attention to grammar; but they're not that much smarter, they're much more challenging to implement, and the gains are modest.
I am irresistibly reminded of the discussions of Ptolemaic astronomy in Laplace's wonderful Exposition du système du monde
. When you don't really understand planetary motion, you use the best model you can come up with and try to make it fit the data as well as you can. It is hard to believe that the ancient Greek astronomers really thought that the planets moved on invisible crystal spheres attached to other invisible crystal spheres, but you can make it work quite well as a predictive theory if you're prepared to do the necessary number-crunching. As Laplace says, this turned out to be a far more fruitful research direction than imaginative armchair theorizing. People developed the system of equants, deferents and epicycles as far as it would go, and, by carefully studying what went wrong, they eventually found something that was genuinely better. In Machine Translation, we haven't yet reached the Newtonian stage. But if you want to know the details of how those crystal spheres work, Koehn's book is the one to buy.
Here's a cute experiment I just heard about from one of Philipp Koehn's colleagues. Go to Google Translate and try translating the two sentences "I saw few people" and "I saw a
few people" into various languages. In some cases, the results will, as you'd expect, be different; in others, they'll be the same.
I suppose there might be some languages where they actually should be the same, but it's definitely getting it wrong in Swedish and I'm almost sure it's wrong in Russian too. It's definitely right in French, and I think in Norwegian. Basically, statistical machine translation contains a strong element of randomness.
If you speak a non-English language fluently, feel free to tell the rest of us what happens in your language!