Hebbut.Net Public Download Area

Public Offerings by Ger Hobbelt - available for download

Prebuilt Win32/64 binaries and up-to-date MSVC2005 source projects are included for when you like to build this baby from source on your own machines.

Have fun!

Ger Hobbelt

NOTE:

You may need the Microsoft Visual Studio 2005 (SP1) C/C++ run rime libraries too if you get error messages about manifest files, etc. (This is NOT the .NET Microsoft redistributable!)

Get the installer for the runtime libraries from Microsoft itself.

OR: try this local copy of the same (as was distributed with the MSVC2005 setup which was used to create the executable[s] above.

 

You also need to fetch the hebbut.net HTMLtidy library + MSVC project files, available here.

HTMLtidyWrapper for .NET (MSVC2005)

A .NET (2.0) wrapper library which allows any .NET code to use the powerful features offered by HTMLtidy (original available at http://tidy.sourceforge.net/ ) .

This wrapper has been developed from scratch by Ger Hobbelt as part of a bigger system which processes large amounts of foreign HTML data using XSLT machinery: for this to work for arbitrary web pages, the often crummy input HTML needs to be reformatted and fixed to comply with strict XHTML standards - a job for which HTMLtidy is perfectly suited.

One of the features of HTMLtidyWrapper is access to all HTMLtidy options: you can set them individually or load/save them from string or file.

NOTE:

May 2008: The source archive now includes two simple sample applications, one in C#, the other in VB.NET, to exhibit some of the features of HTMLtidy and the wrapper library.

WARNING:

This .NET wrapper library requires the use of the patched HTMLtidy library available here.

Yes, I know there are other HTMLtidy wrapper codes out there and I have tested them (in 2006) but none of them suited my needs. This HTMLtidyWrapper produces a XmlDocument for each parsed HTML input, which is the minimum I need for my own systems. Of course, it's now 2 years later and the situation will have changed since then, but I think this code has served me well for the past two years and decided to publish under the GPL, while it will continue to be used in my own systems.

To get an initial taste of the .NET interface, here's the HTMLtidy namespace extract (the HTMLtidy enumerates are also exported to .NET but those are not shown here):

namespace HtmlTidy
{
    public class TidyParser : XmlTextReader, IDisposable
    {
        public TidyParser();
        public TidyParser(string src);

        public override void Close();
        public override sealed void Dispose();
        public XmlDocument DoParseHtml(string s);    // convert HTML input to XML node tree
        public string DoParseHtml2String(string s);  // rewrite HTML input
        public string ErrorMessage();                // return error/warning/diag description
        public void ResetErrorMessage();
        public uint TidyAccessWarningCount();
        public int TidyCleanAndRepair();
        public int TidyParseFile(string filename);
        public int TidyParseStdin();
        public int TidyParseString(string content);
        public string TidyReleaseDate();
        public int TidyRunDiagnostics();
        public int TidySaveFile(string filename);
        public int TidySaveStdout();
        public int TidySaveString(out string buffer);
        public int TidySetCharEncoding(string encnam);
        public int TidySetInCharEncoding(string encnam);
        public int TidySetOutCharEncoding(string encnam);
        public int TidyStatus();
        public uint TidyWarningCount();
        public bool TidyDetectedGenericXml();
        public int TidyDetectedHtmlVersion();
        public bool TidyDetectedXhtml();
        public uint TidyErrorCount();
        public void TidyErrorSummary();
        public void TidyGeneralInfo();

        // configuration access methods
        public int TidyLoadConfigEncFromFile(string configfile, string charenc);
        public int TidyLoadConfigFromFile(string configfile);
        public int TidyLoadConfigFromString(string config);
        public int TidyLoadConfigFromStringEnc(string config, string charenc);
        public bool TidyOptAdjustConfig();        // useful to correct/validate configuration after setting
                                                  // individual config elements.
        public uint TidyConfigErrorCount();
        public void* TidyGetNextOption(void** pos);
        public void* TidyGetOption(HtmlTidyOptionId optId);
        public void* TidyGetOptionByName(string optnam);
        public void* TidyGetOptionList();
        public bool TidyOptCopyConfig(TidyParser To);
        public bool TidyOptDiffThanDefault();
        public bool TidyOptDiffThanSnapshot();
        public HtmlTidyTriState TidyOptGetAutoBool(HtmlTidyOptionId optId);
        public bool TidyOptGetBool(HtmlTidyOptionId optId);
        public HtmlTidyConfigCategory TidyOptGetCategory(void* opt);
        public string TidyOptGetCurrPick(HtmlTidyOptionId optId);
        public void* TidyOptGetDeclTagList();
        public string TidyOptGetDefault(void* opt);
        public bool TidyOptGetDefaultBool(void* opt);
        public uint TidyOptGetDefaultInt(void* opt);
        public string TidyOptGetDoc(void* opt);
        public void* TidyOptGetDocLinksList(void* opt);
        public string TidyOptGetEncName(HtmlTidyOptionId optId);
        public HtmlTidyOptionId TidyOptGetId(void* opt);
        public HtmlTidyOptionId TidyOptGetIdForName(string optnam);
        public uint TidyOptGetInt(HtmlTidyOptionId optId);
        public string TidyOptGetName(void* opt);
        public string TidyOptGetNextDeclTag(HtmlTidyOptionId optId, void** iter);
        public void* TidyOptGetNextDocLinks(void** pos);
        public string TidyOptGetNextPick(void* opt, void** pos);
        public void* TidyOptGetPickList(void* opt);
        public HtmlTidyOptionType TidyOptGetType(void* opt);
        public string TidyOptGetValue(HtmlTidyOptionId optId);
        public bool TidyOptIsReadOnly(void* opt);
        public bool TidyOptParseValue(string optnam, string val);
        public bool TidyOptResetAllToDefault();
        public bool TidyOptResetToDefault(HtmlTidyOptionId opt);
        public bool TidyOptResetToSnapshot();
        public int TidyOptSave(out string dst);
        public int TidyOptSaveFile(string cfgfil);
        public bool TidyOptSetAutoBool(HtmlTidyOptionId optId, HtmlTidyTriState val);
        public bool TidyOptSetBool(HtmlTidyOptionId optId, bool val);
        public bool TidyOptSetInt(HtmlTidyOptionId optId, uint val);
        public bool TidyOptSetValue(HtmlTidyOptionId optId, string val);
        public bool TidyOptSnapshot();
    }
}    

and a C# code snippet showing a few examples of it's use is here:

    // now run the collected HTML through HTMLtidy to clean it and reformat as XHTML 
    // so we can access it using an XMLreader, for instance.
    TidyParser tidier = new TidyParser();
    bool cfg_ok = tidier.TidyOptSetBool(HtmlTidy.HtmlTidyOptionId.ShowWarnings, true);
    cfg_ok &= tidier.TidyOptSetBool(HtmlTidy.HtmlTidyOptionId.XhtmlOut, true);
    cfg_ok &= tidier.TidyOptSetBool(HtmlTidy.HtmlTidyOptionId.XmlDecl, true);
    cfg_ok &= tidier.TidyOptSetAutoBool(HtmlTidy.HtmlTidyOptionId.IndentContent, HtmlTidyTriState.AutoState);
    cfg_ok &= tidier.TidyOptAdjustConfig();

    if (tidier.TidyConfigErrorCount() > 0)
    {
        tidier.TidyErrorSummary();
        Console.WriteLine("HTMLTIDY error summary: {0}", tidier.ErrorMessage());
        tidier.ResetErrorMessage();
    }

    // process input HTML and convert it to XHTML
#if false
    String res = tidier.DoParseHtml2String(sb.ToString());
    Console.WriteLine("HTMLTIDY output\n{0}", res);
#endif

    XmlDocument xhtml_doc = tidier.DoParseHtml(sb.ToString());
    if (tidier.TidyErrorCount() > 0 || tidier.TidyWarningCount() > 0)
    {
        tidier.TidyErrorSummary();
        Console.WriteLine("HTMLTIDY HTML parsing error+warning summary: {0}", tidier.ErrorMessage());
        tidier.ResetErrorMessage();
    }

    // write reformatted XML/XHTML to XML file for further processing by other XML-based tools:
    XmlWriterSettings settings = new XmlWriterSettings();
    settings.OmitXmlDeclaration = false;
    settings.Indent = true;
    settings.IndentChars = "  ";
    XmlWriter writer = XmlWriter.Create(Console.Out, settings);
    xhtml_doc.WriteTo(writer);
    writer.Close();

    // try a search operation on the XmlDocument.
    XPathNavigator navigator = xhtml_doc.CreateNavigator();
    XPathExpression query = navigator.Compile("count(//html//br)");
    double total = (double)navigator.Evaluate(query);
    Console.WriteLine("Number of <br> elements = {0}\n", total);
    total = (double)navigator.Evaluate("count(//p)");
    Console.WriteLine("Number of <p> elements = {0}\n", total); 

NOTE:

The TidyParser::XMLReader compatibility is under development; once complete, this would improve (simplify) the code using HTMLtidy and the .NET XSLT transforms.

Changes/Fixes:

Downloads

Legend

Downloadable archives / files

Files / Archives Version Date/Time Quality Notes

Open SourceMicrosoft Visual StudioGNUUNIXLinux Penguin HtmlTidyWrapper-1.2.i_a.full-src.7z

Microsoft Windows64-bit Microsoft Windows (Intel Itanium)65-bit Microsoft Windows - AMD64 HtmlTidyWrapper-1.2.i_a.bin-win32.7z

HTMLtidyWrapper for .NET 2009-02-12 Production for .NET 2.0 or later

Open SourceMicrosoft Visual StudioGNUUNIXLinux Penguin HtmlTidyWrapper-1.1.i_a.full-src.7z

Microsoft Windows64-bit Microsoft Windows (Intel Itanium)65-bit Microsoft Windows - AMD64 HtmlTidyWrapper-1.1.i_a.bin-win32.7z

HTMLtidyWrapper for .NET 2008-05-14 Production for .NET 2.0 or later

For related HTMLtidy builds and sources, look here.