It was a quiet day on Friday and I had a monster 90+ design document to review. Dry does not even begin to describe the document, so I decided to leave it for Monday and do something a little bit more intellectually stimulating. I went through my list of to visit bookmarks and I stumbled upon the Free Online courses from MIT, so after going through what was available, I settled on a CS course on algorithms, Introduction to Algorithms.
The first problem is the Document Distance problem, which in essence uses a Cosine similarity algorithm to establish how similar two documents are, which could be use for plagiarism detection. Can this simple algorithm help us determine whether Sir Francis Bacon wrote William Shakespeare's plays?
The data set used can be found here or you could visit the Project Gutenberg's page and download a few ebooks to test the algorithm with.
I've created a small console app that can compare two files, all within a directory and files within two directories. I first implemented the definition of word used by the course:
A word is a consecutive sequence of alphanumeric characters, such as "Hamlet" or "2007". We'll treat all upper-case letters as if they are lower-case, so that "Hamlet" and "hamlet" are the same word. Words end at a non-alphanumeric character, so "can't" contains two words: "can" and "t".
However, I was not too satisfied with this definition, so I defined a word as:
Source Code:A word is a sequence of alphanumeric characters that may contain an apostrophe, such as "Hamlet" or "can't" surrounded by a space or a punctuation mark. We'll treat all upper-case letters as if they are lower-case, so that "Hamlet" and "hamlet" are the same word.
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using System.IO; 6 7 namespace DocumentDistance 8 { 9 class Program 10 { 11 static void Main(string[] args) 12 { 13 switch (args.Length) 14 { 15 case 3: 16 if (args[0].ToLower().Equals("-l") || args[0].ToLower().Equals("-c")) 17 { 18 Process(args); 19 } 20 else 21 { 22 Usage(); 23 } 24 break; 25 case 4: 26 if (args[1].ToLower().Equals("-dd") && (args[0].ToLower().Equals("-l") || args[0].ToLower().Equals("-c"))) 27 { 28 Process(args); 29 } 30 else 31 { 32 Usage(); 33 } 34 break; 35 default: Usage(); 36 break; 37 } 38 39 } 40 41 /// <summary> 42 /// This is a wrapper method to invoke all processing methods. 43 /// </summary> 44 /// <param name="args"></param> 45 private static void Process(string[] args) 46 { 47 try 48 { 49 //Each List stores the words of each book. 50 List<string>[] words; 51 52 DirectoryInfo di; 53 FileInfo[] files, filesdir1, filesdir2; 54 55 if (args[1].ToLower().Equals("-d")) 56 { 57 //Only Grab txt. files. Limiting but problably ok. 58 di = new DirectoryInfo(args[2]); 59 files = di.GetFiles("*.txt"); 60 } 61 else if (args[1].ToLower().Equals("-dd")) 62 { 63 di = new DirectoryInfo(args[2]); 64 filesdir1 = di.GetFiles("*.txt"); 65 di = new DirectoryInfo(args[3]); 66 filesdir2 = di.GetFiles("*.txt"); 67 68 files = new FileInfo[filesdir1.Length + filesdir2.Length - 1]; 69 70 for (int i = 0; i < filesdir1.Length; i++) 71 { 72 files[i] = filesdir1[i]; 73 } 74 75 for (int i = 0, j = filesdir1.Length - 1; i < filesdir2.Length; j++, i++) 76 { 77 files[j] = filesdir2[i]; 78 } 79 } 80 else 81 { 82 files = new FileInfo[2]; 83 files[0] = new FileInfo(args[1]); 84 files[1] = new FileInfo(args[2]); 85 } 86 87 88 if (args[0].ToLower().Equals("-l")) 89 { 90 //Read all the words for each file in the directory 91 words = ReadFilesLine(files); 92 } 93 else 94 { 95 //Read all the words for each file in the directory 96 words = ReadFilesChar(files); 97 } 98 //Each Dictionary stores the unique words in each book. 99 Dictionary<string, int>[] dicts = new Dictionary<string, int>[words.Length]; 100 101 //Count unique words for each book 102 for (int i = 0; i < words.Length; i++) 103 { 104 dicts[i] = CountWords(words[i], files[i].Name); 105 } 106 107 //Compare the books 108 for (int i = 0; i < words.Length - 1; i++) 109 { 110 for (int j = i + 1; j < words.Length; j++) 111 { 112 CalculateDistance(files[i].Name, files[j].Name, dicts[i], dicts[j]); 113 } 114 } 115 116 Console.ReadLine(); 117 } 118 catch (Exception ex) 119 { 120 Console.WriteLine("Exception: {0}", ex); 121 Console.ReadLine(); 122 } 123 } 124 125 /// <summary> 126 /// Display invocation information 127 /// </summary> 128 private static void Usage() 129 { 130 Console.WriteLine("Usage is DD.exe <ProcessSwitch> txtfile1 txtfile2 or DD.exe <ProcessSwitch> -d Directory Path"); 131 Console.WriteLine("or DD.exe <ProcessSwitch> -dd DirectoryPath1 DirectoryPath2"); 132 Console.WriteLine("where <ProcessSwitch> is either -c or -l"); 133 Console.WriteLine("-c : words are defined as any alphanumeric string so that 'can't' is two words (can and t) but cant is one"); 134 Console.Write("-l : words are defined as any alphanumeric string with a space at either end, apart from words "); 135 Console.Write("ending in an apostrophe e.g. computers' to indicate possesion"); 136 Console.ReadLine(); 137 } 138 139 /// <summary> 140 /// This is wrapper method that invokes the InnerProduct method to calculate the "distance" between two documents 141 /// </summary> 142 /// <param name="file1">First Book File Name</param> 143 /// <param name="file2">Second Book File Name</param> 144 /// <param name="dict1">Dictionary containing list of unique words for the first book</param> 145 /// <param name="dict2">Dictionary containing list of unique words for the second book</param> 146 private static void CalculateDistance(string file1, string file2, Dictionary<string, int> dict1, Dictionary<string, int> dict2) 147 { 148 long numerator = InnerProduct(dict1, dict2); 149 double denominator = Math.Sqrt(InnerProduct(dict1, dict1) * InnerProduct(dict2, dict2)); 150 151 //This is the Document Distance 152 double theta = Math.Acos(numerator / denominator); 153 154 //Calculate Document Distance as a percentage of similarity. 155 double percentage = ((theta / Math.PI * 2) - 1) * -100; 156 157 Console.WriteLine("The distance between {0} and {1} is:{2:F6} or {3:F3}% similarity.", file1, file2, theta, percentage); 158 } 159 160 /// <summary> 161 /// Read the passed in file and generate a list of words contained in the file. In this method a word is any alphanumeric char array 162 /// e.g. can't is actually two words can and t 163 /// </summary> 164 /// <param name="filename">Path to file</param> 165 /// <returns>list of words</returns> 166 private static List<string> ReadFile(string filename) 167 { 168 try 169 { 170 StreamReader file = new StreamReader(filename); 171 172 string word = string.Empty; 173 174 List<string> words = new List<string>(); 175 176 char n; 177 178 do 179 { 180 n = (char)file.Read(); 181 182 if (char.IsLetterOrDigit(n)) 183 { 184 word += n; 185 } 186 else 187 { 188 if (!string.IsNullOrEmpty(word)) 189 { 190 words.Add(word.ToLower()); 191 } 192 193 word = null; 194 } 195 196 } while (n != (char)65535); 197 198 return words; 199 } 200 catch (Exception ex) 201 { 202 203 throw new Exception("Exception While Reading file", ex.InnerException); 204 } 205 } 206 207 /// <summary> 208 /// Read the passed in files and generate a list of words contained in each file. 209 /// A word is the alphanumeric part of any string surrounded by spaces. 210 /// e.g. hello!!! will be stored as hello 211 /// e.g. can't!! will be stored as can't. 212 /// See CheckAlphaNumeric Method for more details 213 /// </summary> 214 /// <param name="files">List of files to be read</param> 215 /// <returns>Array containing a list of words for each file</returns> 216 private static List<string>[] ReadFilesLine(FileInfo[] files) 217 { 218 List<string>[] words; 219 FileInfo fi; 220 string word = string.Empty; 221 string[] tempwords; 222 223 try 224 { 225 words = new List<string>[files.Length]; 226 227 for (int i = 0; i < files.Length; i++) 228 { 229 //Note that we need to instantiate each List<string> object in the array before we can use it. 230 words[i] = new List<string>(); 231 232 fi = files[i]; 233 234 StreamReader file = new StreamReader(fi.FullName); 235 236 string line = file.ReadLine(); 237 238 do 239 { 240 word = string.Empty; 241 242 //Get the words by splitting on ' '. 243 tempwords = line.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries); 244 245 foreach (string s in tempwords) 246 { 247 word = CheckAlphaNumeric(s); 248 249 if (!word.Equals(string.Empty)) 250 { 251 words[i].Add(word); 252 } 253 } 254 255 line = file.ReadLine(); 256 257 } while (line != null); 258 } 259 260 return words; 261 } 262 catch (Exception ex) 263 { 264 throw new Exception("Exception While Reading file", ex.InnerException); 265 } 266 } 267 268 /// <summary> 269 /// Wrapper method for multiple files using ReadFile method. 270 /// </summary> 271 /// <param name="files">List of files to be read</param> 272 /// <returns>Array containing a list of words for each file</returns> 273 private static List<string>[] ReadFilesChar(FileInfo[] files) 274 { 275 List<string>[] words = new List<string>[files.Length]; 276 277 try 278 { 279 for (int i = 0; i < files.Length; i++) 280 { 281 words[i] = new List<string>(); 282 words[i] = ReadFile(files[i].FullName); 283 } 284 285 return words; 286 } 287 catch (Exception ex) 288 { 289 throw new Exception("Exception While Reading file", ex.InnerException); 290 } 291 } 292 293 /// <summary> 294 /// This method ensures that punctuation marks and other non alphanumeric characters are not stored as 295 /// part of the words. The input is a string obtained by using string.Split(' '), which means that 296 /// all question marks, exclamation marks, etc.. will be included. Clearly, these are not desirable, 297 /// so they are removed. 298 /// </summary> 299 /// <param name="s">word as split by string.Split(' ') method</param> 300 /// <returns>A word free of surrounding punctuation marks</returns> 301 private static string CheckAlphaNumeric(string s) 302 { 303 int start = 0, end = 0; 304 305 //Find the first alphanumeric character. Do two searches: 306 //One starting from the start of the string, the other starting from the end of the string. 307 for (int j = 0; j < s.Length; j++) 308 { 309 if (char.IsLetterOrDigit(s[j])) 310 { 311 start = j; 312 break; 313 } 314 } 315 316 for (int j = s.Length - 1; j > 0; j--) 317 { 318 //Char 39 is an apostrophe. This is to allow to differentiate between the apostrophe as hackers' and hackers 319 if (char.IsLetterOrDigit(s[j]) || s[j] == (char)39) 320 { 321 end = j; 322 break; 323 } 324 } 325 326 if (end != s.Length - 1 || start != 0) 327 { 328 //This means that no alphanumeric character exist in the input string 329 //thus we ignore it by returning an empty string. 330 if (start == end) 331 { 332 return string.Empty; 333 } 334 else 335 { 336 return s.Substring(start, (end - start) + 1).ToLower(); 337 } 338 } 339 else 340 { 341 return s.ToLower(); 342 } 343 } 344 345 /// <summary> 346 /// Calculates the inner product of the word frequencies. 347 /// In other words, ∑_w D_1 (w)∙D_2 (w), 348 /// where D_1 is the word freqency for the first document 349 /// and D_2 for the second 350 /// </summary> 351 /// <param name="dict1">List of unique words and frequencies for first document</param> 352 /// <param name="dict2">List of unique words and frequencies for second document</param> 353 /// <returns>Sum of all word frequencies that appear on both texts</returns> 354 private static long InnerProduct(Dictionary<string, int> dict1, Dictionary<string, int> dict2) 355 { 356 long sum = 0; 357 //if the word is in both text sum the frequency otherwise it's zero 358 foreach (var pair in dict1) 359 { 360 if (dict2.ContainsKey(pair.Key)) 361 { 362 sum += dict1[pair.Key] * dict2[pair.Key]; 363 } 364 } 365 366 return sum; 367 } 368 369 /// <summary> 370 /// This method generates a list of words and their frequencies. This is 371 /// stored in a Dictionary<string,int> object. 372 /// </summary> 373 /// <param name="words">List of Words</param> 374 /// <param name="filename">File Name</param> 375 /// <returns>List of words and their frequencies</returns> 376 private static Dictionary<string, int> CountWords(List<string> words, string filename) 377 { 378 Dictionary<string, int> dict = new Dictionary<string, int>(); 379 380 foreach (string s in words) 381 { 382 if (!dict.ContainsKey(s.ToLower())) 383 { 384 dict.Add(s.ToLower(), 1); 385 } 386 else 387 { 388 dict[s.ToLower()] = dict[s.ToLower()] + 1; 389 } 390 } 391 392 Console.WriteLine("{0}: Total Words {1}, Distinct Words {2}", filename.Remove(filename.LastIndexOf('.')), words.Count, dict.Count); 393 394 return dict; 395 } 396 397 } 398 }
Let's see whether there are significant differences between both word definitions:
dd -c t1.verne.txt t2.bobsey.txt
t1.verne: Total Words 8943, Distinct Words 2150
t2.bobsey: Total Words 49785, Distinct Words 3354
The distance between t1.verne.txt and t2.bobsey.txt is:0.582949 or 62.888% similarity.
dd -l t1.verne.txt t2.bobsey.txtThere is a bit of a change but not massive. An another comparison:
t1.verne: Total Words 8629, Distinct Words 2189
t2.bobsey: Total Words 47554, Distinct Words 3655
The distance between t1.verne.txt and t2.bobsey.txt is:0.538879 or 65.694% similarity.
dd -c t2.bobsey.txt t3.lewis.txt
t2.bobsey: Total Words 49785, Distinct Words 3354
t3.lewis: Total Words 182355, Distinct Words 8530
The distance between t2.bobsey.txt and t3.lewis.txt is:0.574160 or 63.448% similarity.
dd -l t2.bobsey.txt t3.lewis.txt
t2.bobsey: Total Words 47554, Distinct Words 3655
t3.lewis: Total Words 180645, Distinct Words 9022
The distance between t2.bobsey.txt and t3.lewis.txt is:0.521558 or 66.797% similarity.
Again the same sort of variation between the word definitions, from now on I'll stick to my definition of a word. The other crucial point is that the similarity is quite high between the texts, ~65%. How about if we test two novels from the same author, say Jane Austen?
Emma: Total Words 160028, Distinct Words 10450
Pride and Prejudice: Total Words 124224, Distinct Words 7181
The distance between Emma.txt and Pride and Prejudice.txt is:0.183119 or 88.342% similarity.
In fact the lowest similarity between a selected few of Jane Austen's novels is ~ 85% and between a few of Arthur Conan Doyle's novels is ~ 87%. So it would be reasonable to suppose an ~85% similarity between novels by the same author or would it?
It turns out that it's not quite as simple as that, see below a comparison between several of Dickens' novel:
A Tale of Two Cities: Total Words 138321, Distinct Words 11202
David Coperfield: Total Words 356158, Distinct Words 20083
Great Expectations: Total Words 186598, Distinct Words 12711
Hard Times: Total Words 105454, Distinct Words 10543
Oliver Twist: Total Words 160480, Distinct Words 13350
The Life And Adventures Of Nicholas Nickleby: Total Words 324043, Distinct Words 20720
The Pickwick Papers: Total Words 300922, Distinct Words 21384
The distance between A Tale of Two Cities.txt and David Coperfield.txt is:0.378243 or 75.920% similarity.
The distance between A Tale of Two Cities.txt and Great Expectations.txt is:0.314171 or 79.999% similarity.
The distance between A Tale of Two Cities.txt and Hard Times.txt is:0.256404 or 83.677% similarity.
The distance between A Tale of Two Cities.txt and Oliver Twist.txt is:0.169488 or 89.210% similarity.
The distance between A Tale of Two Cities.txt and The Life And Adventures Of Nicholas Nickleby.txt is:0.186920 or 88.100% similarity.
The distance between A Tale of Two Cities.txt and The Pickwick Papers.txt is:0.257100 or 83.633% similarity.
The distance between David Coperfield.txt and Great Expectations.txt is:0.153898 or 90.203% similarity.
The distance between David Coperfield.txt and Hard Times.txt is:0.268132 or 82.930% similarity.
The distance between David Coperfield.txt and Oliver Twist.txt is:0.436028 or 72.242% similarity.
The distance between David Coperfield.txt and The Life And Adventures Of Nicholas Nickleby.txt is:0.342514 or 78.195% similarity.
The distance between David Coperfield.txt and The Pickwick Papers.txt is:0.466544 or 70.299% similarity.
The distance between Great Expectations.txt and Hard Times.txt is:0.252974 or 83.895% similarity.
The distance between Great Expectations.txt and Oliver Twist.txt is:0.375174 or 76.116% similarity.
The distance between Great Expectations.txt and The Life And Adventures Of Nicholas Nickleby.txt is:0.296311 or 81.136% similarity.
The distance between Great Expectations.txt and The Pickwick Papers.txt is:0.425808 or 72.892% similarity.
The distance between Hard Times.txt and Oliver Twist.txt is:0.306853 or 80.465% similarity.
The distance between Hard Times.txt and The Life And Adventures Of Nicholas Nickleby.txt is:0.218641 or 86.081% similarity.
The distance between Hard Times.txt and The Pickwick Papers.txt is:0.364912 or 76.769% similarity.
The distance between Oliver Twist.txt and The Life And Adventures Of Nicholas Nickleby.txt is:0.203269 or 87.059% similarity.
The distance between Oliver Twist.txt and The Pickwick Papers.txt is:0.201226 or 87.190% similarity.
The distance between The Life And Adventures Of Nicholas Nickleby.txt and The Pickwick Papers.txt is:0.272799 or 82.633% similarity.
There are three comparisons that yield less than 73% and another three that yield ~76%, so perhaps we need to recalibrate how similar two documents need to be before we can claim that they have been written by the same author. Three more comparisons:
Emma: Total Words 160028, Distinct Words 10450
David Coperfield: Total Words 356158, Distinct Words 20083
The distance between Emma.txt and David Coperfield.txt is:0.350631 or 77.678% similarity.
Emma: Total Words 160028, Distinct Words 10450
Hard Times: Total Words 105454, Distinct Words 10543
The distance between Emma.txt and Hard Times.txt is:0.295020 or 81.218% similarity.
Great Expectations: Total Words 186598, Distinct Words 12711
A Study in Scarlet: Total Words 46660, Distinct Words 6418
The distance between Great Expectations.txt and A Study in Scarlet.txt is:0.297589 or 81.055% similarity.
Oh Dear!, they've all been written by the same literary master, but by whom?
These comparisons clearly show the limitations of this type of, very simple, word frequency analysis. You weren't really expecting to know whether Sir Francis Bacon wrote Shakespeare's plays with a 400 line application were you?
dd -l t8.shakespeare.txt t9.bacon.txt
t8.shakespeare: Total Words 898261, Distinct Words 30241
t9.bacon: Total Words 55612, Distinct Words 7943
The distance between t8.shakespeare.txt and t9.bacon.txt is:0.530085 or 66.254% similarity.
No comments:
Post a Comment