Table of Contents
DIUnicode
Overview
DIUnicode's Pascal implementation features more than 70 encodings, like UTF-7, UTF-8, UTF-16, the ISO-8859 family, various Windows and Macintosh codepages, KOI8 character sets, Chinese GB18030, and more. Adding a new character coding is as simple as writing a single conversion procedure. It supports 144 character sets and encodings when linked against DIConverters.
Key Benefits
DIUnicode is for you if your application needs to handle text with multiple character encodings with high performance and little development time.
Both the Unicode Reader and the Unicode Writer work with strings, buffers, and streams. You can, for example, directly read from or write to database BLOB streams avoiding all temporary storage of your data.
An efficient buffering system guarantees excellent performance, even when processing huge files.
Simple Usage Examples
DIUnicode makes reading and writing Unicode as simple as ASCII text, regardless of the character set or encoding you are processing. the code snippets below show some of the techniques usually applied with TDIUnicodeReader, the reader class of DIUnicode. Remember that you can use the parsing routine unchanged with any of the available encodings.
Read entire lines from a Unicode text file:
{ Setup and initialize. } Reader := TDIUnicodeReader.Create(nil); { Let's say we want to read UTF-8. This could well be any other character encoding. } Reader.ReadMethods := Read_Utf_8; Reader.SourceStream := TFileStream.Create('MyFile.txt', fmOpenRead); { Now the actual reading: } while Reader.ReadLine do begin TheLine := Reader.DataAsStrW; { Your code to process the line goes here. } end;
Read individual characters only:
while Reader.ReadChar do begin TheChar := Reader.Char; case TheChar of 'A'..'Z': ; // Process Alphas '0'..'9': ; // Process Digits end; end;
Use overloaded methods to read up to a particular character or a set of characters:
{ Read all characters up to the Dollar sign. } Reader.ReadCharsTill('$'); { Read all characters up to either '(' or ')'. } Reader.ReadCharsTill('(', ')'); { Skip rest of line and advance to next one. } Reader.SkipLine;
Advanced parsing:
- An RFC compliant CSV Parser is part of DIUnicode. Source code is available as a feature demonstration.
- The popular DIHtmlParser is build on top of DIUnicode. It implements a full featured HTML, XHTML and XML parser with Unicode support and a flexible plugin architecture.
Peek Ahead / Look Ahead reading
Unlike other text readers, the lookahead features of TDIUnicodeReader are not limited to a fixed number of characters but by available memory only. The code below reads up to five Unicode characters into the internal buffer. TDIUnicodeReader could well look ahead much further, but this should not be abused and the number kept reasonably small.
var UR: TDIUnicodeReader; c: WideChar; begin { ... TDIUnicodeReader creation and initialization should go here ... } UR.PeekAhead(5); // Read up to 5 characters to internal buffer. if UR.PeekedCount >= 1 then // Test if 1st peekd character could be read ... c := TDIUnicodeReader.PeekedChars[0]; // and examine it. if UR.PeekedCount >= 5 then // Same as above ... c := TDIUnicodeReader.PeekedChars[4]; // but with 5th peeked chararcter now. c := UR.ReadChar; // Continue reading with next char.
Performance
DIUnicode is extremely fast, even when processing very large files. Both the reader and the writer classes benefit from their internal buffers which allows them to read and write files in small chunks of data, one at a time only. DIUnicode will never require you to fit the entire file into memory. This way it achieves conversion rates of far over 20 MB per second.