4 Following

Manny Rayner's book reviews

I love reviewing books - have been doing it at Goodreads, but considering moving here.

Currently reading

The Greatest Show On Earth: The Evidence For Evolution
Richard Dawkins
R in Action
Robert Kabacoff
Fluid Concepts and Creative Analogies
Douglas R. Hofstadter
McGee on Food and Cooking: An Encyclopedia of Kitchen Science, History and Culture
Harold McGee
Epistemic Dimensions of Personhood
Simon Evnine
Pattern Recognition and Machine Learning (Information Science and Statistics)
Christopher M. Bishop
Relativity, Thermodynamics and Cosmology
Richard C. Tolman
The Cambridge Handbook of Second Language Acquisition
Julia Herschensohn, Martha Young-Scholten
Natural Language Processing with Python - Edward Loper, Steven Bird, Ewan Klein [Editor's preface to the second edition: notgettingenough read the first edition of this review and complained that it was all Geek to her. I have amended it accordingly]
POLONIUS: What do you read, my lord?
HAMLET: Words, words, words.
Hamlet was evidently interested in textual analysis, and if the Python Natural Language Toolkit (NLTK) had been available in Elsinore I'm sure he'd have bought this book too. I'd heard good things about it, and it doesn't disappoint: the authors have done a terrific job of combining a lot of freeware tools and resources into a neat package.

They say they want it to be accessible even to people who have no software development experience; this may be just a little optimistic, but try it for yourself and see what you think. They've certainly made every effort to get you hooked from the beginning. Ten minutes after downloading the software, I was able to produce a randomized version of Monty Python and the Holy Grail with a single command:
>>>Python 2.6.6 
>>> import nltk
>>> nltk.download()
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> text6.generate()
Building ngram index...
SCENE 1 : Well , I see . Running away , And his nostrils
raped and his bottom burned off , And his pen -- SIR ROBIN
: We are just not used to handsome knights . Nay . Nay .
Come on . Anybody armed must go too . OFFICER # 1 : No .
Not only by surprise . Not only by surprise . Not the
Knights Who Say ' Ni '. KNIGHTS OF NI : Ni ! ARTHUR :
You know much that is . Yeah , a swallow ' s got a point .
So what else can it do? Geeks may want to skip to the example below, but here's a brief summary. The toolkit contains three kinds of materials. First, there's a well-selected set of texts, packaged up so that they can easily be used. Some of them are listed above; there are a couple of dozen more that you can quickly locate.

Second, there's a bunch of tools which you can use to analyze the texts. For example, there's an interface to WordNet, which is a kind of digitized super-thesaurus containing tens of thousands of words and concepts, all neatly arranged into a complex hierarchy with the most general concepts at the top and the most specific ones at the bottom. There's a tool called a "part-of-speech tagger", which takes a pieces of text and guesses the part of speech - noun, verb, adjective, etc - for each word in the context in which it appears. There are "parsers", which can analyze sentences in terms of grammatical function - finding subjects, objects, main verbs, and so on. And there are plenty of other things, in particular easy ways to incorporate machine learning methods, which you can train yourself by giving them examples.

Third, there's Python itself, which is the glue that sticks all these things together. I'd somehow never used Python before, but it's a concise and elegant language that's easy to learn if you already have some software skills. If you know Perl, Ruby, Unix shell-scripting, or anything like that, you'll be up and flying in no time. You can write scripts which are just a few lines long, but which do a whole lot of stuff: read a file from the web, chop it up into individual words and sentences, find all the sentences that have some particular property you're searching for, and then display everything as a neat table or graph.

The rest of the review will probably only be interesting to geeks, but if that's you, please read on...

I finished the book yesterday, and I've just spent a few hours messing around writing little scripts to see what it can do. Here's the most entertaining one. I thought it would be interesting to be able to locate all the words in a text that refer to animals. NLTK includes a handy interface to WordNet, so the first job was to write a function which checks whether a word could refer to a concept lower in the hierarchy than the one for 'animal'. It's never quite as easy as you first think; after a little experimentation, I realized that I had to block words which referred to animals only by virtue of referring to human beings. The final definition looks like this:
animal_synset = wn.synset('animal.n.01')
human_synset = wn.synset('homo.n.02')

def is_animal_word(word):
hypernyms = [ hyp
for synset in wn.synsets(word)
for path in synset.hypernym_paths()
for hyp in path
if not human_synset in path]
return animal_synset in hypernyms
I then wrote a script which called my function to return all the animal words in the first n words of a piece of text:
def print_animal_words_v1(text, n):
words = set([w.lower() for w in text[:n]])
animal_words = sorted(set([w for w in words
if is_animal_word(w)]))
print "Animal words in first %d words" % n
They've packaged up a bunch of textual resources for easy access, so I could immediately test it on the first 50,000 words of Emma:
>>> emma = gutenberg.words('austen-emma.txt')
>>> print_animal_words_v1(emma, 50000)
Animal words in first 50000 words
['baby', 'bear', 'bears', 'blue', 'chat', 'chicken',
'cow', 'cows', 'creature', 'creatures', 'does',
'entire', 'female', 'fish', 'fly', 'games', 'goose',
'head', 'horse', 'horses', 'imagines', 'kite',
'kitty', 'martin', 'martins', 'monarch', 'mounts',
'oysters', 'pen', 'pet', 'pollards', 'shark',
'sharks', 'stock', 'tumbler', 'young']
A quick look at this reveals some suspicious candidates: for example, 'does' is most likely never used as the plural of 'doe', so shouldn't be counted as an animal word.

My second version of the script called another resource, a "tagger", which quickly goes through the text and tries to guess what part of speech each word is in the context in which it appears. I only look at the words whose tags start with an 'N', indicating that they have been guessed as nouns:
def print_animal_words_v2(text, n):
print "Tagging first %d words" % n
tagged_words = nltk.pos_tag(text[:n])
print("Tagging done")
words = set([w.lower() for (w, tag) in tagged_words
if tag.startswith('N')])
animal_words = sorted(set([w for w in words
if is_animal_word(w)]))
print "Animal words in first %d words" % n
Now I get a shorter list, which in particular omits the suspicious 'does':
>>> print_animal_words_v2(emma, 50000)
Tagging first 50000 words
Tagging done
Animal words in first 50000 words
['baby', 'bears', 'blue', 'chicken', 'cow',
'creature', 'creatures', 'female', 'games', 'goose',
'head', 'horse', 'horses', 'kitty', 'martin',
'martins', 'monarch', 'oysters', 'pet', 'pollards',
'shark', 'sharks', 'stock', 'tumbler', 'young']
Well, that should be enough to give you the favor of the thing. If you don't want to buy the book, it's available free online here. Have fun!