* * *

Author Topic: How to stream.Read UTF16 files?  (Read 3567 times)

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1842
Re: How to stream.Read UTF16 files?
« Reply #15 on: June 26, 2012, 03:39:34 pm »
Thank you, applied.
http://www.theo.ch/lazarus/utf8tools.zip

About the license: Please read utf8proc_LICENSE

paskal

  • Full Member
  • ***
  • Posts: 194
Re: How to stream.Read UTF16 files?
« Reply #16 on: June 27, 2012, 09:35:08 am »
Here is another issue- I saw no way to read a part of a file, for example, I need the first 100 chars for preview purpose. Is the only option to read the entire file and to display the first 100 chars, or I could do something else?
Lazarus 1,1; build 40379; FPC2,6,1

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1842
Re: How to stream.Read UTF16 files?
« Reply #17 on: June 27, 2012, 11:25:22 am »
Not directly, but TCharEncStream is basically a TMemoryStream, so you can do whatever you can do with TMemoryStream.
Example:
Code: [Select]
var fs:TFileStream;
  buf:Array[0..199] of Byte;
begin
  if OpenDialog1.Execute then
  begin
    fs:=TFileStream.Create(OpenDialog1.FileName,fmOpenRead);
    fs.ReadBuffer(Buf, 200);
    fCES.WriteBuffer(Buf,200);
    fs.Free; 
   Memo1.text := fCES.UTF8Text;   
....

paskal

  • Full Member
  • ***
  • Posts: 194
Re: How to stream.Read UTF16 files?
« Reply #18 on: July 05, 2012, 12:02:27 pm »
There is something in charencstreams.pas that IMHO is not right.
Here is a procedure:
Code: [Select]
procedure TUniStream.CheckFileType;
var ASt: string[5];
  Str: AnsiString;
  Posi, rd: integer;
begin
  Ast := #0#0#0#0#0;
  if GetSystemEncoding = EncodingUTF8 then fUniStreamType := ufUTF8 else fUniStreamType := ufANSI;
  fHasBOM := False;
  Position := 0;
  rd := Read(ASt[1], 4);
  begin
    if (rd > 2) and (Copy(Ast, 1, 3) = UTF8BOM) then begin fUniStreamType := ufUtf8; fHasBOM := True; end else
      if (rd > 3) and (Copy(Ast, 1, 4) = UTF32LEBOM) then begin fUniStreamType := ufUtf32le; fHasBOM := True; end else
        if (rd > 3) and (Copy(Ast, 1, 4) = UTF32BEBOM) then begin fUniStreamType := ufUtf32be; fHasBOM := True; end else
          if (rd > 1) and (Copy(Ast, 1, 2) = UTF16LEBOM) then begin fUniStreamType := ufUtf16le; fHasBOM := True; end else
            if (rd > 1) and (Copy(Ast, 1, 2) = UTF16BEBOM) then begin fUniStreamType := ufUtf16be; fHasBOM := True; end;
    Position := 0;
    fHaveType := True;
  end;
  if not fHasBom then
  begin
    SetLength(Str, Min(2048, Size));
    if Length(Str) = 0 then exit;
    Read(Str[1], Length(Str));
    Posi := Pos(#0#0, Str);
    if Posi > 0 then
    begin
      if odd(Posi div 2) then fUniStreamType := ufUtf32le else fUniStreamType := ufUtf32be;
    end else
    begin
      Posi := Pos(#0, Str);
      if Posi > 0 then if odd(Posi) then fUniStreamType := ufUtf16be else fUniStreamType := ufUtf16le;
    end;
  end;
end; 

In the way it is written, when opening a UTF8 file without BOM the procedure will return the result form GetSystemEncoding, which in Windows is most probably ANSI.

Also, I wonder about this function:

Code: [Select]
function TCharEncStream.GetUTF8Text: UTF8String;
begin
  Result := inherited GetUTF8Text;
  if (UniStreamType = ufANSI) or ((UniStreamType = ufUtf8) and (not HasBom)) then
  begin
    if not ForceType then ANSIEnc := LConvencoding.GuessEncoding(Result);
    if ANSIEnc <> EncodingUTF8 then
    begin
      UniStreamType := ufANSI;
      Result := ConvertEncoding(Result, ANSIEnc, EncodingUTF8);
    end;
  end;
end;   
There is a line:     if not ForceType then ANSIEnc := LConvencoding.GuessEncoding(Result);
If it is changed to if not ForceType then ANSIEnc := LConvencoding.GuessEncoding(LeftStr(Result,100)); will it use less resources?
« Last Edit: July 06, 2012, 09:10:03 am by paskal »
Lazarus 1,1; build 40379; FPC2,6,1

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads