Monday 18 March 2013

How to convert webpage to .mobi or .epub file

The standard advice when anybody answers this question seems to be: Use Calibre, the e-book management software. If you have a Kindle you can also use Amazon's conversion service, which is accessed by sending the webpage as an attachment to your kindle's email address (you do need to white list the email address from which you're sending the email from).

The problem is that neither of them works 100% of the time. There are online alternatives, such as instapaper, which is really handy as it allows you to combine multiple webpages into a single mobi file among other things, but I've never seen an image on any of the instapaper mobi files that I have downloaded, never.

I wrote a little console app that gets rid of the nasty stuff that makes Calibre go pop, it might work with Amazon too, but I've only tried it with Calibre as it's quicker to test.

Please bear in mind that I've only done limited testing, by the time I've done the app, done the tests and so on, I probably could have read all the "problematic" articles online, but there you go.

Also note that you will need the HtmlAgilityPack.

using HtmlAgilityPack;              
using System;              
using System.Collections.Generic;              
using System.IO;              
using System.Linq;              
using System.Text;              
using System.Threading.Tasks;              
              
namespace HTMLCleaner              
{              
 class Program              
 {              
     static void Main(string[] args)              
     {              
       if (args.Length >= 1 && args.Length <= 2)              
       {              
        try              
        {              
            string sourceFile = args[0];              
            string destFile = args.Length > 1 ? args[1] : args[0];              
                            
            HtmlDocument doc = new HtmlDocument();              
                            
            doc.Load(sourceFile);              
                            
            doc.DocumentNode.Descendants()              
                .Where(x => x.Name == "script" || x.Name == "iframe" || x.Name == "noscript").ToList()              
                .ForEach(x => x.Remove());              
                            
            using (StreamWriter sw = new StreamWriter(destFile))              
            {              
                doc.Save(sw);              
            }              
                            
            Console.WriteLine("Successfully cleaned HTML");              
        }              
        catch (Exception ex)              
        {              
            Console.WriteLine("Error: {0} - Type {1}.", ex.Message, ex.GetType());              
        }              
       }      
       else      
       {      
        Console.WriteLine("Please Invoke like this:");              
        Console.WriteLine("HTMLCleaner.exe sourcefile destinationfile");              
        Console.WriteLine("Destination file can be omitted, in which case the source file will also be the destination file");              
       }              
 
     }              
 }              
}              

No comments:

Post a Comment