Topic maps

I've been exploring ways of calculating the subject matter discussed by Russian newspapers recently. In the end I settled on using TF-IDF (Term Frequence - Inverse Document Frequency) keywords on a large (read: near exhaustive) database of the Russian press. Taking each keyword as a node, and each keyword pair that occur in a document as an edge (I collected the top ten keywords for each text), quite nice topic maps can be created.
The size of a node is proportional to how often it features. The edges are similarly weighted according to frequency.  This is material I'll be using for my thesis, so I'm not keen to spread it all over the internet at this stage, but here is a snippet of Novaia Gazeta's sports pages. (here  or click image for hi-res version)

I find it surprising there is no big tennis cluster, and am also quite surprised the NHL features as strongly as it does. Of course the Olympics are very present, closely linked with doping. And Lokomotiv Moscow as the most written about football team, or merely the most unusual word, and consequently captured by the algorithm. There are some obvious mistakes (e.g. голый), but on the whole it looks pretty convincing. I put this down Mikhail Korobov's brilliant Pymorphy2 morphological analyzer. No more word-stemmers for me, so it's farewell Snowball!


No comments:

Post a Comment