Guides/Unicode
J has essentially complete support for unicode in code development and applications. The only minor limitation is that identifiers used in programming must be in 7-bit ascii, but this does not affect the use of unicode in applications. For example:
a=. '沒有問題' NB. assign unicode text to a 沒有=. 1 2 3 NB. identifiers must be 7-bit ascii |spelling error
Literal text is assumed to be in utf8 format. J also has a 2-byte unicode datatype, and the verb u: converts back and forth. Both representations can be useful when programming, so take care to ensure the right datatype is being used.
utf8 is used in:
window driver interface file name in 1!:x family plot package interface regular expression of pcre *c argument of dll
2-byte unicode used in:
manipulation of character array *w argument of dll
Standard utilities include:
utf8 convert to utf8 ucp convert to unicode datatype (cp=code point), if necessary uucp convert char or utf8 to wchar ucpcount code point (glyph or character) count datatype noun data type
The name a defined above is in literal text, and therefore assumed to be utf8. More examples:
a 沒有問題 datatype a NB. a is type literal literal #a NB. the count of a is the count of its utf8 representation 12 a. i. a NB. bytes in the utf8 representation 230 178 146 230 156 137 229 149 143 233 161 140 b=. ucp a NB. b is a converted to 2-byte unicode b NB. b displays the same as a 沒有問題 #b NB. the count of b is the number of characters 4 datatype b NB. b is type unicode unicode a -: utf8 b NB. utf8 converts b back to a 1
Scripts
Script cp2utf8 converts plain text files in codepages to utf8.
Script ufread reads unicode text files in various formats.
Renaming Unicode Files
We will define a win32 API verb
NB.*mv v move file, e.g. from mv to MoveFile=: 'kernel32 MoveFileW > i *w *w' cd ;&uucp
For testing we will create a file in one unicode range,
load'files dir' 'test' fwrite 'Test - 沒有問題' NB. create a file 4 fread 'Test - 沒有問題' test 0 0{:: 1!:0]'Test -*' NB. dir find Test - 沒有問題
rename into another and be able to read it by new name.
'Test - 沒有問題' MoveFile 'Test - Без проблем' NB. rename 1 fread 'Test - Без проблем' test 0 0{:: 1!:0]'Test -*' NB. dir find Test - Без проблем
Links
Guides/UnicodeGettingStarted - notes on using unicode
Vocabulary entry for u: - definition of verb u:
Unicode Test Drive - Oleg's notes on unicode
UTF-8 and Unicode Standards good background reading