Monday, 30 January 2012

Dual-Booting McAfee Endpoint Encryption - Linux Kernel Upgrade

I used the package manager to upgrade the kernel to 3.0.0-15-generic in my laptop, yesterday evening and when I tried to boot into Windows this morning, I couldn't: there was no Windows entry in the Grub Boot Menu.

After a bit of cursing, when I finally calmed down, I realized what had happened. When a new kernel is installed, a new grub configuration file (grub.cfg) file will be generated and this is based, I think, on the output of os-prober, but, and this is crucial, because my Windows installation is encrypted, the installation is not detected and thus missed off the new grub configuration file.

In order to rectify this, I had to manually edit the grub configuration file (/boot/grub/grub.cfg) and add, following the last menuentry entry, this entry:
menuentry "Windows" {
        set root='(hd0,1)'
        chainloader +1
}
If your Windows installation is in another partition, then you'll need to modify the second line, e.g. set root='(hd1,1)' for the first partition of your second hard drive.

Do be careful when modifying /boot/grub/grub.cfg as you could end up when a non booting system and bear in mind that if you run update-grub2, this file will be overwritten, with whatever grub detects.

Tuesday, 24 January 2012

It's a self signed world - Part 2. The joy of certificates - Part 7

Back in August I wrote a blog post describing how to use makecert to create a self signed CA, I also said that I would repeat the process but using OpenSSL, well your prayers have been answered. Since OpenSSL ships with most Linux distros and also works in Windows, this is the ideal tool for the job.

The key point of a self signed certificate is that, well, it is self signed, which means that it is only really good for development or testing as it won't be trusted by external users, particularly if accessing a website using a modern web browser. I guess you could also use it for internal services too.

At any rate, I'm running this from CentOS 6.2 using OpenSSL 1.0.0-fips 29 Mar 2010. In order to get the OpenSSL version just type:
openssl version
These are the steps needed to create a  self signed certificate using OpenSSL:
  1. Create server certificate private key: 
    openssl genrsa -des3 -out phpmyadmin.key 1024
  2. Create Certificate Signing Request, this is what you would normally pass to a CA (e.g. Verisign) for them to generate a signed certificate with. They normally check that you say who you are and after money has exchanged hands they issue with the public key signed by their CA:
    openssl req -new -key phpmyadmin.key -out phpmyadmin.csr
  3. Remove Passphrase from key. If you want to be prompted for the passphrase everytime Apache starts, then skip to step 4:
    cp phpmyadmin.key phpmyadmin.key.pass
    openssl rsa -in phpmyadmin.key.pass -out phpmyadmin.key
  4. Create public server certificate:
    openssl x509 -req -days 1000 -in phpmyadmin.csr -signkey phpmyadmin.key -out phpmyadmin.crt
That is it, you now have a private/public key pair that can be used for Apache, see this post for details on how to configure the certificates. Do note, that you don't actually have a CA so these two lines need to be commented out:
#   Server Certificate Chain:
SSLCertificateChainFile /etc/httpd/conf.d/certs/win2k8ca.cer

#   Certificate Authority (CA):
SSLCACertificateFile /etc/httpd/conf.d/certs/win2k8ca.cer
If you want to use this private/public key with IIS, then you need to convert it into a pkcs#12 format certificate, which you can do with the following command:
openssl pkcs12 -export -in phpmyadmin.crt -inkey phpmyadmin.key -out phpmyadmin.pfx
Please remember to import the certificate to the trusted root certification authorities of the server as well as your personal store to prevent any problems.

Installing secure phpMyAdmin on CentOS 6.2

Following on from Sunday's post on how to set up phpMyAdmin on CentOS 6.2, I thought it would be a good idea to set up phpMyAdmin as a secure website (HTTPS), rather than in clear-text (HTTP). This will ensure that all traffic between the web browser and phpMyAdmin is encrypted.

In a previous post I set up a Certification Authority so I will be using this CA to generate the necessary certificates, but don't worry if you don't have one, you can use makecert or OpenSSL to generate a self signed certificate.

All that is needed is a server and CA certificate, if you've followed my previous post on phpMyAdmin, you can go directly to step 7. Thus armed with a pkcs#12 server certificate (phpMyAdmin.pfx) and a CA certificate (win2kca.cer) we can start:
  1. Set SELinux to allow Apache to bind to a non-default port:
    setsebool -P allow_ypbind 1
  2. Download EPEL Release to enable usage of EPEL Repository: 
    wget http://download.fedora.redhat.com/pub/epel/6/i386/epel-release-6-5.noarch.rpm
  3. Install EPEL Release package:
    yum install epel-release-6-5.noarch.rpm -y
  4. Install phpMyAdmin:
    yum install phpmyadmin -y
  5. Create new directory to host the phpMyAdmin website: 
    mkdir /var/www/phpMyAdmin
  6. Copy phpMyAdmin installation to the directory created in the previous step: 
    cp -r /usr/share/phpMyAdmin/. /var/www/phpMyAdmin
  7. Extract public and private key from server certificate:
    openssl pkcs12 -in phpMyAdmin.pfx -out phpMyAdmin.key -nodes -nocerts
    openssl pkcs12 -in phpMyAdmin.pfx -out phpMyAdmin.crt -nodes -nokeys
  8. Restrict permissions on key file:
    chmod 400 phpMyAdmin.key
  9. Create certificate and key directories and move certificates and keys to them:
    mkdir /etc/httpd/conf.d/certs
    mkdir /etc/httpd/conf.d/keys
    mv phpMyAdmin.crt /etc/httpd/conf.d/certs
    mv phpMyAdmin.key /etc/httpd/conf.d/keys
    cp win2k8ca.cer /etc/httpd/conf.d/certs
  10. Set SELinux to permissive, this is to prevent issues with SELinux preventing Apache from working properly:
    setenforce 0
  11. Edit Apache's SSL configuration file (/etc/httpd/conf.d/ssl.conf). I have changed the port to 7777 and prevented LOW ciphers from being accepted. The rest is simply providing the location of the certificates. Only listing relevant parts of ssl.conf:
    Listen 7777

    <VirtualHost _default_:7777>

    #   SSL Cipher Suite:
    SSLCipherSuite ALL:!ADH:!EXPORT:!SSLv2:RC4+RSA:+HIGH:+MEDIUM

    #   Server Certificate:

    SSLCertificateFile /etc/httpd/conf.d/certs/phpMyAdmin.crt

    #   Server Private Key:

    SSLCertificateKeyFile /etc/httpd/conf.d/certs/phpMyAdmin.key

    #   Server Certificate Chain:
    SSLCertificateChainFile /etc/httpd/conf.d/certs/win2k8ca.cer

    #   Certificate Authority (CA):
    SSLCACertificateFile /etc/httpd/conf.d/certs/win2k8ca.cer

    </VirtualHost>
    1. You can check that the apache configuration file is correct by using:
      apachectl -t 
  12. Restart Apache:
    apachectl -k restart or service httpd restart
  13. Open firewall for port 7777 and save IPTables configuration:
    iptables -I INPUT -p tcp --dport 7777 -j ACCEPT; service iptables save
  14. You can now navigate to https://phpmyadmin.dev.com:7777/setup (If you are using Chrome, you will see this screen first. Other browsers will show similar screens). Note that you'll need a entry on your hosts file that points phpmyadmin.dev.com to the IP address of the Server: 
  15. Click Procceed anyway. You are seeing this because your CA is not trusted by Chrome.
    Although it would seem that the connection is not encrypted, the icon is misleading, it just means that it is not trusted. See below for confirmation:
  16. Because I'm lazy, I'm going to reuse the screenshots and text from my previous phpMyAdmin post, so .. Click New Server. I only changed the name and compression, accepted defaults for everything else:
  17. Go To Authentication Tab. See this link for an overview of the authentication types:
  18. Click Save, which will bring you to the screen below:
  19. Download the configuration file (config.inc.php) and copy it to /var/www/phpMyAdmin.
  20. You can now start using phpMyAdmin on https://phpmyadmin.dev.com:7777:
  21. All that remains is to renable SELinux and deal with the policy violations:
    cat /var/log/audit/audit.log | grep denied > ssl
    audit2allow -M apachessl -i ssl
    semodule -i apachessl.pp
    setenforce 1
Note that steps 2 & 3 simply add repository for the EPEL repository to your yum repository collection and install the repository key.

In theory, the setup script should be able to generate the configuration file for you, but I've not been able to get it to work. Instructions can be found here if you are interested. 

I haven't thoroughly tested this setup so it is possible, as always, that there could be SELinux issues. All I can suggest is that, if you have some inexplicable issue, have a look at the SELinux log (/var/log/audit/audit.log).

    Sunday, 22 January 2012

    Installing phpMyAdmin in CentOS 6.2 (netinstall)

    I really have no issue with using a terminal, in fact I quite love the geekiness associated with it, but for some reason I never feel comfortable using a terminal to manage mySQL, which is why I love phpMyAdmin.

    I am installing phpMyAdmin in a machine that hosts Joomla, see this post for more details, in practical terms this means that a second website will be needed to host phpMyAdmin, whether you host this site on a different port or a host header it's up to you, the process is fairly similar. Do bear in mind that using a different port has implications to your firewall configuration, in this post I will be using a different port.

    It is worth bearing in mind that this configuration is not secure and as such should only be used on internal networks. Although running the website on a non-standard port will provide obscurity, it does not provide security. Have a look at this post for a secure phpMyAdmin installation guide.

    Unfortunately phpMyAdmin is not, at the time of writing, included with RHEL based systems. Luckily, it is part of the Extra Packages for Enterprise Linux (EPEL) interest group. This means that the EPEL repository can be used to install phpMyAdmin thus obviating the need to install it from source.

    Here are the steps needed to install phpMyAdmin in a CentOS 6.2 server:
    1. Set SELinux to allow Apache to bind to a non-default port:
      setsebool -P allow_ypbind 1
    2. Download EPEL Release to enable usage of EPEL Repository: 
      wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-5.noarch.rpm
    3. Install EPEL Release package:
      yum install epel-release-6-5.noarch.rpm -y
    4. Install phpMyAdmin:
      yum install phpmyadmin -y
    5. Create new directory to host the phpMyAdmin website: 
      mkdir /var/www/phpMyAdmin
    6. Copy phpMyAdmin installation to the directory created in the previous step: 
      cp -r /usr/share/phpMyAdmin/. /var/www/phpMyAdmin
    7. Add a new virtual host to Apache, by editing the Apache configuration file /etc/httpd/conf/httpd.conf, see this post for more details. Relevant parts of httpd.conf:
      Listen 80
      Listen 8888

      NameVirtualHost *:80
      NameVirtualHost *:8888

      <VirtualHost *:80>
          ServerAdmin manyrootsofallevil@myhost.com
          DocumentRoot /var/www/html
          ServerName  Joomla
          ErrorLog logs/Joomla_error
          CustomLog logs/Joomla-access_log common
      </VirtualHost>

      <VirtualHost *:8888>
          ServerAdmin manyrootsofallevil@myhost.com
          DocumentRoot /var/www/phpMyAdmin
          ServerName  Joomla
          ErrorLog logs/phpMyAdmin_error
          CustomLog logs/phpMyAdmin-access_log common
      </VirtualHost>
      1. You can check that the apache configuration file is correct by using:
        apachectl -t 
    8. Restart Apache:
      apachectl -k restart or service httpd restart
    9. Open firewall for port 8888 and save IPTables configuration:
      iptables -I INPUT -p tcp --dport 8888 -j ACCEPT; service iptables save
    10. From a browser navigate to http://localhost:8888/setup :
    11. Click New Server. I only changed the name and compression, accepted defaults for everything else:
    12. Go To Authentication Tab. See this link for an overview of the authentication types:
    13. Click Save, which will bring you to the screen below:
    14. Download the configuration file (config.inc.php) and copy it to /var/www/phpMyAdmin.
    15. You can now start using phpMyAdmin on http://192.168.1.65:8888

    Note that steps 2 & 3 simply add repository for the EPEL repository to your yum repository collection and install the repository key.

    In theory, the setup script should be able to generate the configuration file for you, but I've not been able to get it to work. Instructions can be found here if you are interested.

    Thursday, 19 January 2012

    Regional Settings in Windows 2003

    We've got a SOAP webservice that parses a string into a datetime object and today after over three months, one of our test servers starting throwing errors back.

    After I got over the shock of learning that they were promoting some code into production this weekend without letting us know, hey after all, we are only the guys who look after the app, I started digging.

    The first thing I learnt was that all the failures were due the same input date, 22/01/2012, which should have worked. Except that it didn't, because it was expecting MM/DD/YYYY format rather than DD/MM/YYYY format, in other words US date settings.

    I remembered that this issue had taken place at some point in the past and they had taken the server down. I also remember setting the regional settings to UK for the service account and bringing the server back up or at least I thought I did.

    I logged on to the server with the service account and sure enough it was set to UK settings, eh??

    So I decided to ask the Oracle, which threw this link back at me: 

    On the Avanced Settings tab:
    Click to select the Apply all settings to the current user account and to the default user profile check box to apply changes to the default user profile.
    So I did and lo and behold it started working again.

    I must say that I don't really understand why it wasn't working before, but it would seem that the default profile was having an effect, perhaps it is used by the .NET framework to determine regional settings, I'm not too sure and I can't investigate from home.

    Still quite funny to receive calls from the PM, the programme manager and then the account executive telling me that the issue needed fixing A.S.A.P. as it would result in embarrassment if the latest build wasn't promoted to production this weekend. Never mind that there was no real issue, somebody had determined that a failure had occurred and it needed fixing. This place....

    Monday, 16 January 2012

    Change Trace Logging Directory in MS Dynamics CRM 2011


    Microsoft has finally decided that we are all big boys now and that we can make some decisions, including where the trace logging files should go. The TraceDirectory key in the registry still gets summarily ignored, but it is possible to change the directory using PowerShell. It's a fairly simple process:
    1. Start PowerShell.
    2. Load the CRM PS Snap in:  Add-PSSnapin Microsoft.Crm.PowerShell
    3. Store Trace Settings into variable to allow easy editing: $trace = Get-CRMSetting TraceSettings
    4. Change Directory to your preferred value: $trace.Directory=”D:\Trace”
    5. Save Settings: Set-CRMSetting $trace
    6. You can check that the new setting has been set with: Get-CRMSetting TraceSettings
    CallStack     : True
    Categories    : *:Error
    Directory     : D:\Trace
    Enabled       : True
    FileSize      : 10
    ExtensionData : System.Runtime.Serialization.ExtensionDataObject
    Note that the directory on step 4 needs to exist.

    If you haven’t enabled trace logging, you can do it before step 4:
    $trace.Enabled =”True”
    Thank you Microsoft.

    Sunday, 15 January 2012

    Windows Azure from Windows XP

    A few days ago a colleague pointed out that thanks to our subscription to MSDN we get free access to the Microsoft cloud system, Windows Azure. So I thought I would give it a try, boy was I in for a surprise.

    After signing up and logging in, I made an attempt to get the SDK to see what all the fuss was about and this is where all the fun began. Microsoft's answer to Yum or Aptitute,  Microsoft Web Platform Installer was installed and then all the pre-requisites are installed in order, which takes an awful long time and when you think it has finished, you realized it has failed to install everything because, and I'm not kidding, IIS 7.x is needed, say what? I'm running Windows XP, why the b£$%^& h!"£ would it not check this first? I appreciate that each installer will run its own checks but why doesn't MS Web PI check this? Why isn't it clear that development on Windows XP for the Windows Azure is no longer supported, if it ever was. I sometimes dispear with Microsoft.

    You need to go to the microsoft download page, e.g. sdk 1.4 itself to check the pre-requisites as there is nothing on the Windows Azure page itself. If you click install this launches MS Web PI and then you run through the whole frustrating process.

    Saturday, 14 January 2012

    The Linux usability problem

    It seems that every time that I want to do something quickly in front of my Kubuntu Laptop it almost always turns into a bit of a nightmare. I'm quite conformable using a console, vi does not intimidate me anymore and I consider myself to be reasonably tech savvy, but Linux at home really does try my patience. 

    At work, I really don't mind if I have to spend two hours to do something that takes 3 minutes in Windows (e.g. joining a RHEL 6 box to a Windows 2003 AD); I consider it the price one pays for freedom and most of all I really do relish the challenge. If something is hard when it's finally achieved it feels like so much more of an achievement, because it is, even if in reality it isn't, you just didn't know how to do it. 
    However, when I'm at home I want things to work, I already spend 8 to 10 hours a day dealing with frustration, I don't want to do the same at home and Windows 7 just works. Just two examples that have led to loads of frustration recently:
    1. I wanted to have a look at a web page that had a java applet, which kinda works out of the box in Windows, yet I could not get it to work in Firefox and when I got it working in rekong, it did not work properly. I know there are guides to install the java plugin for firebadger but I couldn't get it to work. I also know I'm an idiot and an ignoramus for not getting to work, still. I had similar issues with Flash before I upgraded to 11.10.
    2. I acquired a Logitech webcam to do video conferencing with friends and family. I've also used it to record a few videos clowning about for said friends and family from my Windows 7 desktop. I wanted to do some clowning about in the sitting room using my kubuntu laptop but I could not get any of the various options (VLC, Mplayer,  Cheese, UVC) to work properly. No sound, would not record, would crash, etc.. Compared to plug and play, download Windows Movie Maker and off I went, it's not contest.
    I guess if I didn't work in application development/support I might enjoy the challenge at home, I know I used to, but now it's just a lot of hassle for relatively little reward, I think I'm sticking with Windows 7 for home use for the time being.

    To a some extent the problem is due to lack of manufacturer support, but seeing as Linux currently commands a 1% share of the desktop market, they figure it's not worth the hassle and thus fewer people will make the jump, thus not giving manufacturers a reason to invest in Linux, in other words a classical catch-22 situation.

    Monday, 9 January 2012

    Document Distance

    It was a quiet day on Friday and I had a monster 90+ design document to review. Dry does not even begin to describe the document, so I decided to leave it for Monday and do something a little bit more intellectually stimulating. I went through my list of to visit bookmarks and I stumbled upon the Free Online courses from MIT, so after going through what was available, I settled on a CS course on algorithms, Introduction to Algorithms

    The first problem is the Document Distance problem, which in essence uses a Cosine similarity algorithm to establish how similar two documents are, which could be use for plagiarism detection. Can this simple algorithm help us determine whether Sir Francis Bacon wrote William Shakespeare's plays?

    The data set used can be found here or you could visit the Project Gutenberg's page and download a few ebooks to test the algorithm with.

    I've created a small console app that can compare two files, all within a directory and files within two directories. I first implemented the definition of word used by the course:
    A word is a consecutive sequence of alphanumeric characters, such as "Hamlet" or "2007". We'll treat all upper-case letters as if they are lower-case, so that "Hamlet" and "hamlet" are the same word. Words end at a non-alphanumeric character, so "can't" contains two words: "can" and "t".
    However, I was not too satisfied with this definition, so I defined a word as:
    A word is a sequence of alphanumeric characters that may contain an apostrophe, such as "Hamlet" or "can't" surrounded by a space or a punctuation mark. We'll treat all upper-case letters as if they are lower-case, so that "Hamlet" and "hamlet" are the same word.
    Source Code:

       1 using System;
       2 using System.Collections.Generic;
       3 using System.Linq;
       4 using System.Text;
       5 using System.IO;
       6 
       7 namespace DocumentDistance
       8 {
       9     class Program
      10     {
      11         static void Main(string[] args)
      12         {
      13              switch (args.Length)
      14              {
      15                  case 3:
      16                      if (args[0].ToLower().Equals("-l") || args[0].ToLower().Equals("-c"))
      17                      {
      18                          Process(args);
      19                      }
      20                      else
      21                      {
      22                          Usage();
      23                      }
      24                      break;
      25                  case 4:
      26                      if (args[1].ToLower().Equals("-dd") && (args[0].ToLower().Equals("-l") || args[0].ToLower().Equals("-c")))
      27                      {
      28                          Process(args);
      29                      }
      30                      else
      31                      {
      32                          Usage();
      33                      }
      34                      break;
      35                  default: Usage();
      36                      break;
      37              }
      38        
      39         }
      40 
      41         /// <summary>
      42         /// This is a wrapper method to invoke all processing methods.
      43         /// </summary>
      44         /// <param name="args"></param>
      45         private static void Process(string[] args)
      46         {
      47             try
      48             {
      49                 //Each List stores the words of each book.
      50                 List<string>[] words;
      51 
      52                 DirectoryInfo di;
      53                 FileInfo[] files, filesdir1, filesdir2;
      54 
      55                 if (args[1].ToLower().Equals("-d"))
      56                 {
      57                     //Only Grab txt. files. Limiting but problably ok.
      58                     di = new DirectoryInfo(args[2]);
      59                     files = di.GetFiles("*.txt");
      60                 }
      61                 else if (args[1].ToLower().Equals("-dd"))
      62                 {
      63                     di = new DirectoryInfo(args[2]);
      64                     filesdir1 = di.GetFiles("*.txt");
      65                     di = new DirectoryInfo(args[3]);
      66                     filesdir2 = di.GetFiles("*.txt");
      67 
      68                     files = new FileInfo[filesdir1.Length + filesdir2.Length - 1];
      69 
      70                     for (int i = 0; i < filesdir1.Length; i++)
      71                     {
      72                         files[i] = filesdir1[i];
      73                     }
      74 
      75                     for (int i = 0, j = filesdir1.Length - 1; i < filesdir2.Length; j++, i++)
      76                     {
      77                         files[j] = filesdir2[i];
      78                     }
      79                 }
      80                 else
      81                 {
      82                     files = new FileInfo[2];
      83                     files[0] = new FileInfo(args[1]);
      84                     files[1] = new FileInfo(args[2]);
      85                 }
      86 
      87 
      88                 if (args[0].ToLower().Equals("-l"))
      89                 {
      90                     //Read all the words for each file in the directory 
      91                     words = ReadFilesLine(files);
      92                 }
      93                 else
      94                 {
      95                     //Read all the words for each file in the directory 
      96                     words = ReadFilesChar(files);
      97                 }
      98                 //Each Dictionary stores the unique words in each book.
      99                 Dictionary<string, int>[] dicts = new Dictionary<string, int>[words.Length];
     100 
     101                 //Count unique words for each book
     102                 for (int i = 0; i < words.Length; i++)
     103                 {
     104                     dicts[i] = CountWords(words[i], files[i].Name);
     105                 }
     106 
     107                 //Compare the books
     108                 for (int i = 0; i < words.Length - 1; i++)
     109                 {
     110                     for (int j = i + 1; j < words.Length; j++)
     111                     {
     112                         CalculateDistance(files[i].Name, files[j].Name, dicts[i], dicts[j]);
     113                     }
     114                 }
     115 
     116                 Console.ReadLine();
     117             }
     118             catch (Exception ex)
     119             {
     120                 Console.WriteLine("Exception: {0}", ex);
     121                 Console.ReadLine();
     122             }
     123         }
     124 
     125         /// <summary>
     126         /// Display invocation information
     127         /// </summary>
     128         private static void Usage()
     129         {
     130             Console.WriteLine("Usage is DD.exe <ProcessSwitch> txtfile1 txtfile2 or DD.exe <ProcessSwitch> -d Directory Path");
     131             Console.WriteLine("or DD.exe <ProcessSwitch> -dd DirectoryPath1 DirectoryPath2");
     132             Console.WriteLine("where <ProcessSwitch> is either -c or -l");
     133             Console.WriteLine("-c : words are defined as any alphanumeric string so that 'can't' is two words (can and t) but cant is one");
     134             Console.Write("-l : words are defined as any alphanumeric string with a space at either end, apart from words ");
     135             Console.Write("ending in an apostrophe e.g. computers' to indicate possesion");
     136             Console.ReadLine();
     137         }
     138 
     139         /// <summary>
     140         /// This is wrapper method that invokes the InnerProduct method to calculate the "distance" between two documents
     141         /// </summary>
     142         /// <param name="file1">First Book File Name</param>
     143         /// <param name="file2">Second Book File Name</param>
     144         /// <param name="dict1">Dictionary containing list of unique words for the first book</param>
     145         /// <param name="dict2">Dictionary containing list of unique words for the second book</param>
     146         private static void CalculateDistance(string file1, string file2, Dictionary<string, int> dict1, Dictionary<string, int> dict2)
     147         {
     148             long numerator = InnerProduct(dict1, dict2);
     149             double denominator = Math.Sqrt(InnerProduct(dict1, dict1) * InnerProduct(dict2, dict2));
     150 
     151             //This is the Document Distance
     152             double theta = Math.Acos(numerator / denominator);
     153 
     154             //Calculate Document Distance as a percentage of similarity.
     155             double percentage = ((theta / Math.PI * 2) - 1) * -100;
     156 
     157             Console.WriteLine("The distance between {0} and {1} is:{2:F6} or {3:F3}% similarity.", file1, file2, theta, percentage);
     158         }
     159 
     160         /// <summary>
     161         /// Read the passed in file and generate a list of words contained in the file. In this method a word is any alphanumeric char array
     162         /// e.g. can't is actually two words can and t
     163         /// </summary>
     164         /// <param name="filename">Path to file</param>
     165         /// <returns>list of words</returns>
     166         private static List<string> ReadFile(string filename)
     167         {
     168             try
     169             {
     170                 StreamReader file = new StreamReader(filename);
     171 
     172                 string word = string.Empty;
     173 
     174                 List<string> words = new List<string>();
     175 
     176                 char n;
     177 
     178                 do
     179                 {
     180                     n = (char)file.Read();
     181 
     182                     if (char.IsLetterOrDigit(n))
     183                     {
     184                         word += n;
     185                     }
     186                     else
     187                     {
     188                         if (!string.IsNullOrEmpty(word))
     189                         {
     190                             words.Add(word.ToLower());
     191                         }
     192 
     193                         word = null;
     194                     }
     195 
     196                 } while (n != (char)65535);
     197 
     198                 return words;
     199             }
     200             catch (Exception ex)
     201             {
     202 
     203                 throw new Exception("Exception While Reading file", ex.InnerException);
     204             }
     205         }
     206 
     207         /// <summary>
     208         /// Read the passed in files and generate a list of words contained in each file. 
     209         /// A word is the alphanumeric part of any string surrounded by spaces. 
     210         /// e.g. hello!!! will be stored as hello
     211         /// e.g. can't!! will be stored as can't.
     212         /// See CheckAlphaNumeric Method for more details
     213         /// </summary>
     214         /// <param name="files">List of files to be read</param>
     215         /// <returns>Array containing a list of words for each file</returns>
     216         private static List<string>[] ReadFilesLine(FileInfo[] files)
     217         {
     218             List<string>[] words;
     219             FileInfo fi;
     220             string word = string.Empty;
     221             string[] tempwords;
     222 
     223             try
     224             {
     225                 words = new List<string>[files.Length];
     226 
     227                 for (int i = 0; i < files.Length; i++)
     228                 {
     229                     //Note that we need to instantiate each List<string> object in the array before we can use it.
     230                     words[i] = new List<string>();
     231 
     232                     fi = files[i];
     233 
     234                     StreamReader file = new StreamReader(fi.FullName);
     235 
     236                     string line = file.ReadLine();
     237 
     238                     do
     239                     {
     240                         word = string.Empty;
     241 
     242                         //Get the words by splitting on ' '.
     243                         tempwords = line.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
     244 
     245                         foreach (string s in tempwords)
     246                         {
     247                             word = CheckAlphaNumeric(s);
     248 
     249                             if (!word.Equals(string.Empty))
     250                             {
     251                                 words[i].Add(word);
     252                             }
     253                         }
     254 
     255                         line = file.ReadLine();
     256 
     257                     } while (line != null);
     258                 }
     259 
     260                 return words;
     261             }
     262             catch (Exception ex)
     263             {
     264                 throw new Exception("Exception While Reading file", ex.InnerException);
     265             }
     266         }
     267 
     268         /// <summary>
     269         /// Wrapper method for multiple files using ReadFile method.
     270         /// </summary>
     271         /// <param name="files">List of files to be read</param>
     272         /// <returns>Array containing a list of words for each file</returns>
     273         private static List<string>[] ReadFilesChar(FileInfo[] files)
     274         {
     275             List<string>[] words = new List<string>[files.Length];
     276 
     277             try
     278             {
     279                 for (int i = 0; i < files.Length; i++)
     280                 {
     281                     words[i] = new List<string>();
     282                     words[i] = ReadFile(files[i].FullName);
     283                 }
     284 
     285                 return words;
     286             }
     287             catch (Exception ex)
     288             {
     289                 throw new Exception("Exception While Reading file", ex.InnerException);
     290             }
     291         }
     292 
     293         /// <summary>
     294         /// This method ensures that punctuation marks and other non alphanumeric characters are not stored as
     295         /// part of the words. The input is a string obtained by using string.Split(' '), which means that
     296         /// all question marks, exclamation marks, etc.. will be included. Clearly, these are not desirable,
     297         /// so they are removed. 
     298         /// </summary>
     299         /// <param name="s">word as split by string.Split(' ') method</param>
     300         /// <returns>A word free of surrounding punctuation marks</returns>
     301         private static string CheckAlphaNumeric(string s)
     302         {
     303             int start = 0, end = 0;
     304 
     305             //Find the first alphanumeric character. Do two searches:
     306             //One starting from the start of the string, the other starting from the end of the string.
     307             for (int j = 0; j < s.Length; j++)
     308             {
     309                 if (char.IsLetterOrDigit(s[j]))
     310                 {
     311                     start = j;
     312                     break;
     313                 }
     314             }
     315 
     316             for (int j = s.Length - 1; j > 0; j--)
     317             {
     318                 //Char 39 is an apostrophe. This is to allow to differentiate between the apostrophe as hackers' and hackers
     319                 if (char.IsLetterOrDigit(s[j]) || s[j] == (char)39)
     320                 {
     321                     end = j;
     322                     break;
     323                 }
     324             }
     325 
     326             if (end != s.Length - 1 || start != 0)
     327             {
     328                 //This means that no alphanumeric character exist in the input string
     329                 //thus we ignore it by returning an empty string.
     330                 if (start == end)
     331                 {
     332                     return string.Empty;
     333                 }
     334                 else
     335                 {
     336                     return s.Substring(start, (end - start) + 1).ToLower();
     337                 }
     338             }
     339             else
     340             {
     341                 return s.ToLower();
     342             }
     343         }
     344 
     345         /// <summary>
     346         /// Calculates the inner product of the word frequencies.
     347         ///  In other words, _w D_1 (w)D_2 (w), 
     348         ///  where D_1 is the word freqency for the first document
     349         ///  and D_2 for the second
     350         /// </summary>
     351         /// <param name="dict1">List of unique words and frequencies for first document</param>
     352         /// <param name="dict2">List of unique words and frequencies for second document</param>
     353         /// <returns>Sum of all word frequencies that appear on both texts</returns>
     354         private static long InnerProduct(Dictionary<string, int> dict1, Dictionary<string, int> dict2)
     355         {
     356             long sum = 0;
     357             //if the word is in both text sum the frequency otherwise it's zero
     358             foreach (var pair in dict1)
     359             {
     360                 if (dict2.ContainsKey(pair.Key))
     361                 {
     362                     sum += dict1[pair.Key] * dict2[pair.Key];
     363                 }
     364             }
     365 
     366             return sum;
     367         }
     368 
     369         /// <summary>
     370         /// This method generates a list of words and their frequencies. This is
     371         /// stored in a Dictionary<string,int> object.
     372         /// </summary>
     373         /// <param name="words">List of Words</param>
     374         /// <param name="filename">File Name</param>
     375         /// <returns>List of words and their frequencies</returns>
     376         private static Dictionary<string, int> CountWords(List<string> words, string filename)
     377         {
     378             Dictionary<string, int> dict = new Dictionary<string, int>();
     379 
     380             foreach (string s in words)
     381             {
     382                 if (!dict.ContainsKey(s.ToLower()))
     383                 {
     384                     dict.Add(s.ToLower(), 1);
     385                 }
     386                 else
     387                 {
     388                     dict[s.ToLower()] = dict[s.ToLower()] + 1;
     389                 }
     390             }
     391 
     392             Console.WriteLine("{0}: Total Words {1}, Distinct Words {2}", filename.Remove(filename.LastIndexOf('.')), words.Count, dict.Count);
     393 
     394             return dict;
     395         }
     396 
     397     }
     398 }
    


    Let's see whether there are significant differences between both word definitions:
    dd -c t1.verne.txt t2.bobsey.txt
    t1.verne: Total Words 8943, Distinct Words 2150
    t2.bobsey: Total Words 49785, Distinct Words 3354
    The distance between t1.verne.txt and t2.bobsey.txt is:0.582949 or 62.888% similarity.
    dd -l t1.verne.txt t2.bobsey.txt
    t1.verne: Total Words 8629, Distinct Words 2189
    t2.bobsey: Total Words 47554, Distinct Words 3655
    The distance between t1.verne.txt and t2.bobsey.txt is:0.538879 or 65.694% similarity.
    There is a bit of a change but not massive.  An another comparison:
    dd -c t2.bobsey.txt t3.lewis.txt
    t2.bobsey: Total Words 49785, Distinct Words 3354
    t3.lewis: Total Words 182355, Distinct Words 8530
    The distance between t2.bobsey.txt and t3.lewis.txt is:0.574160 or 63.448% similarity.

    dd -l t2.bobsey.txt t3.lewis.txt
    t2.bobsey: Total Words 47554, Distinct Words 3655
    t3.lewis: Total Words 180645, Distinct Words 9022
    The distance between t2.bobsey.txt and t3.lewis.txt is:0.521558 or 66.797% similarity.
    Again the same sort of variation between the word definitions, from now on I'll stick to my definition of a word. The other crucial point is that the similarity is quite high between the texts, ~65%. How about if we test two novels from the same author, say Jane Austen?
    Emma: Total Words 160028, Distinct Words 10450
    Pride and Prejudice: Total Words 124224, Distinct Words 7181
    The distance between Emma.txt and Pride and Prejudice.txt is:0.183119 or 88.342% similarity.
    In fact the lowest similarity between a selected few of Jane Austen's novels is ~ 85% and between a few of Arthur Conan Doyle's novels is ~ 87%. So it would be reasonable to suppose an ~85% similarity between novels by the same author or would it?

    It turns out that it's not quite as simple as that, see below a comparison between several of Dickens' novel:
    A Tale of Two Cities: Total Words 138321, Distinct Words 11202
    David Coperfield: Total Words 356158, Distinct Words 20083
    Great Expectations: Total Words 186598, Distinct Words 12711
    Hard Times: Total Words 105454, Distinct Words 10543
    Oliver Twist: Total Words 160480, Distinct Words 13350
    The Life And Adventures Of Nicholas Nickleby: Total Words 324043, Distinct Words 20720
    The Pickwick Papers: Total Words 300922, Distinct Words 21384
    The distance between A Tale of Two Cities.txt and David Coperfield.txt is:0.378243 or 75.920% similarity.
    The distance between A Tale of Two Cities.txt and Great Expectations.txt is:0.314171 or 79.999% similarity.
    The distance between A Tale of Two Cities.txt and Hard Times.txt is:0.256404 or 83.677% similarity.
    The distance between A Tale of Two Cities.txt and Oliver Twist.txt is:0.169488 or 89.210% similarity.
    The distance between A Tale of Two Cities.txt and The Life And Adventures Of Nicholas Nickleby.txt is:0.186920 or 88.100% similarity.
    The distance between A Tale of Two Cities.txt and The Pickwick Papers.txt is:0.257100 or 83.633% similarity.
    The distance between David Coperfield.txt and Great Expectations.txt is:0.153898 or 90.203% similarity.
    The distance between David Coperfield.txt and Hard Times.txt is:0.268132 or 82.930% similarity.
    The distance between David Coperfield.txt and Oliver Twist.txt is:0.436028 or 72.242% similarity.
    The distance between David Coperfield.txt and The Life And Adventures Of Nicholas Nickleby.txt is:0.342514 or 78.195% similarity.
    The distance between David Coperfield.txt and The Pickwick Papers.txt is:0.466544 or 70.299% similarity.
    The distance between Great Expectations.txt and Hard Times.txt is:0.252974 or 83.895% similarity.
    The distance between Great Expectations.txt and Oliver Twist.txt is:0.375174 or 76.116% similarity.
    The distance between Great Expectations.txt and The Life And Adventures Of Nicholas Nickleby.txt is:0.296311 or 81.136% similarity.
    The distance between Great Expectations.txt and The Pickwick Papers.txt is:0.425808 or 72.892% similarity.
    The distance between Hard Times.txt and Oliver Twist.txt is:0.306853 or 80.465% similarity.
    The distance between Hard Times.txt and The Life And Adventures Of Nicholas Nickleby.txt is:0.218641 or 86.081% similarity.
    The distance between Hard Times.txt and The Pickwick Papers.txt is:0.364912 or 76.769% similarity.
    The distance between Oliver Twist.txt and The Life And Adventures Of Nicholas Nickleby.txt is:0.203269 or 87.059% similarity.
    The distance between Oliver Twist.txt and The Pickwick Papers.txt is:0.201226 or 87.190% similarity.
    The distance between The Life And Adventures Of Nicholas Nickleby.txt and The Pickwick Papers.txt is:0.272799 or 82.633% similarity.
    There are three comparisons that yield less than 73% and another three that yield ~76%, so perhaps we need to recalibrate how similar two documents need to be before we can claim that they have been written by the same author. Three more comparisons: 
    Emma: Total Words 160028, Distinct Words 10450
    David Coperfield: Total Words 356158, Distinct Words 20083
    The distance between Emma.txt and David Coperfield.txt is:0.350631 or 77.678% similarity.
    Emma: Total Words 160028, Distinct Words 10450
    Hard Times: Total Words 105454, Distinct Words 10543
    The distance between Emma.txt and Hard Times.txt is:0.295020 or 81.218% similarity.
    Great Expectations: Total Words 186598, Distinct Words 12711
    A Study in Scarlet: Total Words 46660, Distinct Words 6418
    The distance between Great Expectations.txt and A Study in Scarlet.txt is:0.297589 or 81.055% similarity.
    Oh Dear!, they've all been written by the same literary master, but by whom?

    These comparisons clearly show the limitations of this type of, very simple, word frequency analysis.  You weren't really expecting to know whether Sir Francis Bacon wrote Shakespeare's plays with a 400 line application were you?
    dd -l t8.shakespeare.txt t9.bacon.txt
    t8.shakespeare: Total Words 898261, Distinct Words 30241
    t9.bacon: Total Words 55612, Distinct Words 7943
    The distance between t8.shakespeare.txt and t9.bacon.txt is:0.530085 or 66.254% similarity.