Unicode Text Input

Unicode Text Input

Postby nageswaragunupudi » Tue Jul 05, 2016 10:50 am

FWH provides two controls TGet and TEdit for imput of single line text. While these controls can be used to imput any other type of variable, present discussion is limited to text input only.

In the case of Unicode applications, i.e., applications built with FW_SetUnicode( .t. ), these controls accept Unicode character input and otherwise the input is restricted to Ansi characters only.

TGet:

TGet restricts the input to the length of the character variable. When we create a Get for a variable of length 10 bytes, the input is restricted to 10 bytes only. While this means 10 English characters, the actual Unicode characters can be less anywhere between 3 to 10 chars. This is suitable in many cases like when we use DBF to store the data. If we plan to use DBF for native language applications, we need to provide character field length of 3 times the number of characters we expect to store in the field.

By default, TGet trucates the input to the fixed length in bytes. Trucation of a utf8 string at arbitrary position makes it invalid and results in garbage. FWH manages to retain the meaningful part of the input.

RDBMSs like MySql, MsSql are different. When we create a column of width n, ( VARCHAR(n) in MySql with utf8 character set or NVARCHAR(n) in other RDBMSs) the column actually accommodates n characters, not just bytes.

Specific to MySql, a VarChar(10) column with utf8 charset with Unicode connection, can accept and store a native language text of 10 characters though in utf8, it may require upto 30 bytes of space and in utf16 it requires 20 bytes of space.

3rd party libs based on libmysql report width of 30 in such cases, while ADO reports 10. In the first case it indicates maximum number of bytes and in the second case it indicates the number of characters.

When we retrieve the value of the field, we get string padded to 30 bytes with the libs and actual string with ADO, whose length in bytes can be less or more than the field size.

If we use the variable (padded to 30 bytes) with Get, we can enter unicode, ansi, numbers upto 30 bytes. The result can be more than 10 characters ( 30 Ansi chars, numbers or utf8 chars from 10 to 30).
If we use this value for Update/Insert it is likely that MySql rejects and operation fails. (Note: If we try to store a string larger than the size, MySql (other RDBMSs too) does not truncate but rejects with an error)

So what is the best behavior we can expect for a Unicode application?

FWH's implementation of MySql reports 10 only and this means 10 characters and is unambiguous.
When we retrieve the value, we get a String whose UTF8 length is 10 characters exactly. Length in bytes may be more than 10 bytes but that should not matter.

TGet is improved handle such cases. When we establish a Unicode connection to MySql server TGet understands how to limit the length of the input. Instead of limiting number of bytes, the imput is limited to the number of characters. The result of the Get also is 10 characters (of whatever language) in length and whatever be the number of bytes it should not matter. The result is perfectly suitable for Update/Insert into MySql database. Not only TGet but also XBrowse and TDataRow when used with the Browse also behave appropriately.

This has absolutely no effect on ANSI input because in such case, number of characters is exactly the same as number of bytes.

Even otherwise we can use the clause LIMITBYCHARS .t. (or .f.) to toggle this behavior on or off.

@ r,c GET ...LIMITBYCHARS .f. // Normal Default. Restricts the input by Bytes. Useful for storing data in DBF
@ r,c GET ...LIMITBYCHARS .t. // Useful for stoting data in RDBMs (This is default when using FWH MySql)

Gets in Unicode mode have some limitations. We can not use any picture clause other than "@!". The reason is that applying a picture to a Unicode string in xHarbour results in garbage. Though this is possible with Harbour, input into Get with picture clauses is still a problem.

Even in a Unicode application we need pure ANSI Gets to imput English and numbers in text and also need to user picture clauses.

Adding the clause ANSI to a Get statement forces the Get to behave like normal Ansi get with all facilities of pictures etc., even in a Unicode application. For Ansi applications this clause has not effect.

TEdit:

TEdit() offers a clean and simple interface for text input. By default, TEdit accepts unlimited characters. Using the clause LIMITTEXT, input is restricted to number of characters (not bytes ) of the variable. Clause LIMITTEXT BY <n> CHARS limits text to <n> chars.

Known issues:
Inline text editing of XBrowse in a Unicode application is inconvenient. Arrow keys, Home and End keys exit the inline Get. As of now, only way is to delete the text and enter the new value.
We are attending to this.
Regards

G. N. Rao.
Hyderabad, India
User avatar
nageswaragunupudi
 
Posts: 10241
Joined: Sun Nov 19, 2006 5:22 am
Location: India

Return to FiveWin for Harbour/xHarbour

Who is online

Users browsing this forum: cmsoft and 11 guests