Moving forward with Composr

ocPortal has been relaunched as Composr CMS. ocPortal 9 is superseded by Composr 10.

Head over to for our new site, and to our migration roadmap. Existing ocPortal member accounts have been mirrored.

A little bit of fun with text processing

A little bit of fun with text processing This blog post has absolutely nothing to do with CMS's and the Internet, but I am going to have a bit of fun using the skills I have – it is a public holiday here in the UK after all :).

I am a bit of a philosopher at heart, and strongly believe in this thing philosophers have called "Occams razor". It simply means that the simplest consistent explanation is usually closest to the truth. It's a big part of the scientific method, as scientists obviously want to find the simplest set of physical laws possible.

As a result I am not a big fan of conspiracy theories – I think the simplest explanations are usually the right one, and that the resources needed to create and maintain a non-trivial conspiracy is far from a simple situation. That's not to say all conspiracy theories are wrong, of course some are not, but I think typically the truth is simple. It's also not to say the truth is known – there may be a simple truth to things, and the public think some other simple thing is true when it is not.
Still following?  :lol:

Anyway, for some time now I wanted to have some fun with text processing. Text processing is the field that search engines like Google use in order to make sense of web pages. It's a branch of 'Information studies', which is kind of like a combination of 'Library studies', 'Linguistics' and 'Computer science'.

What I decided to do, was make a simple script that would go over something called a 'corpus' and make up some random simple conspiracy theories. Don't worry, this is not going to get me arrested for grave robbing – a digital corpus is simply a load of digital articles that you can utilise. Reuters provided a nice one. It is a huge collection of news articles they collection during 1987.
What I did was very simple, it literally took me 30 minutes. I wrote a list of famous/infamous political names (you can decide which is which):
  • Clinton
  • Obama
  • Hitler
  • Stalin
  • Washington
  • Gandhi
  • Thatcher
  • Osama
  • Churchhill
  • Blair
and then I searched for articles where the names were formed by taking the first letters of a series of words.
Very simple.

Here's my code:

PHP code



header('Content-type: text/plain');

while ((
    if (
"Trying $f\n";
        foreach (
$to_find as $x)
            for (
                for (
                    if (@
                if (
"Found $x around:";
                    for (
' '.$words[$j+$i];

And the results (and my silly comments):
Found Osama around: owned subsidiary a merger agreement [Conspiracy: Osama Bin Laden has infiltrated the Western world via Gulf-backed mergers, and/or funded by subsidiaries of oil companies]
Found Osama around: of such a market as
Found Osama around: officials said a major aim [oh my, that is ripe for conspiracy theory writing, but I'm treading carefully here…]
Found Stalin around: shut temporarily are located in North [Conspiracy: Stating that Stalin's Northern empire, the USSR, would shut temporarily in 4 years time, but re-open soon after]
Found Blair around: BC LME ALUMINIUM IST RIN
Found Obama around: Omaha basis and midwest area [I'm not an expert on US politics, but I'm sure someone could make something of that]
Found Osama around: on South and Mid Atlantic [ditto]
Found Obama around: our bilateral and multilateral assistance [this sounds very Obama, right]
Found Osama around: out such a meeting altogether [No peace talks then?]

As you can see, Osama comes up a lot. I have to tread carefully here, I am absolutely not going to make light-hearted humour relating to the 9/11 tragedy. But it is easy to see how someone could claim some kind of interpretation of these results, which are pretty darned arbitrary, as some kind of omen. That's overanalysis and something psychologists call 'dissonance'; it's part of what make us able to draw conclusions in a complex world, but also the cause of all kind of craziness.

