本文是小编为大家收集整理的关于UTF16十六进制到文本的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。
问题描述
我有UTF-16十六进制表示,例如" 0633064406270645",是阿拉伯语的"".
我想将其转换为同等文本.在PostgreSQL中有没有直接的方法?
我可以将UTF代码点转换为以下;不幸的是,似乎不支持UTF16.关于如何在PostgreSQL中执行的任何想法,我会写一个功能?
SELECT convert_from (decode (E'D8B3D984D8A7D985', 'hex'),'UTF8'); "سلام" SELECT convert_from (decode (E'0633064406270645', 'hex'),'UTF16'); ERROR: invalid source encoding name "UTF16" ********** Error **********
推荐答案
是的,Postgres不支持UTF-16.
然而,它确实支持基本多语言平面.换句话说,如果您有任何跨越多个16位代码单元的UTF-16字符,则需要将它们转换为相应的代码点.
其他推荐答案
PostgreSQL本身不支持UTF-16.我建议您在将数据提供给DB之前将数据转换为UTF-8.如果为时已晚(您的数据库中已经存在错误的数据),则可以使用这些维护功能转换UTF-16的数据(逻辑从 wikipedia ):
-- convert from bytea, containing UTF-16-BE data CREATE OR REPLACE FUNCTION convert_from_utf16be(utf16_data bytea, invalid_replacement text DEFAULT '?') RETURNS text LANGUAGE sql IMMUTABLE STRICT AS $function$ WITH source(unit) AS ( SELECT (get_byte(utf16_data, i) << 8) | get_byte(utf16_data, i + 1) FROM generate_series(0, octet_length(utf16_data) - 2, 2) i ), codes(lag, unit, lead) AS ( SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER () FROM source ) SELECT string_agg(CASE WHEN unit >= 56320 AND unit <= 57343 THEN CASE WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed ELSE invalid_replacement END WHEN unit >= 55296 AND unit <= 56319 THEN CASE WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888) ELSE invalid_replacement END ELSE chr(unit) END, '') FROM codes $function$; -- convert from bytea, containing UTF-16-LE data CREATE OR REPLACE FUNCTION convert_from_utf16le(utf16_data bytea, invalid_replacement text DEFAULT '?') RETURNS text LANGUAGE sql IMMUTABLE STRICT AS $function$ WITH source(unit) AS ( SELECT get_byte(utf16_data, i) | (get_byte(utf16_data, i + 1) << 8) FROM generate_series(0, octet_length(utf16_data) - 2, 2) i ), codes(lag, unit, lead) AS ( SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER () FROM source ) SELECT string_agg(CASE WHEN unit >= 56320 AND unit <= 57343 THEN CASE WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed ELSE invalid_replacement END WHEN unit >= 55296 AND unit <= 56319 THEN CASE WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888) ELSE invalid_replacement END ELSE chr(unit) END, '') FROM codes $function$; -- convert from bytea, containing UTF-16 data (with or without BOM) CREATE OR REPLACE FUNCTION convert_from_utf16(utf16_data bytea, invalid_replacement text DEFAULT '?') RETURNS text LANGUAGE sql IMMUTABLE STRICT AS $function$ SELECT CASE COALESCE(octet_length(utf16_data), 0) WHEN 0 THEN '' WHEN 1 THEN invalid_replacement ELSE CASE substring(utf16_data FOR 2) WHEN E'\\xFFFE' THEN convert_from_utf16le(substring(utf16_data FROM 3), invalid_replacement) ELSE convert_from_utf16be(substring(utf16_data FROM 3), invalid_replacement) END END $function$;
使用这些功能,您可以从各种UTF-16转换:
SELECT convert_from_utf16be(decode('0633064406270645D852DF62', 'hex')), convert_from_utf16le(decode('330644062706450652D862DF', 'hex')), convert_from_utf16(decode('FEFF0633064406270645D852DF62', 'hex')), convert_from_utf16(decode('FFFE330644062706450652D862DF', 'hex')); -- convert_from_utf16be | convert_from_utf16le | convert_from_utf16 | convert_from_utf16 ------------------------+----------------------+--------------------+------------------- -- سلام𤭢 | سلام𤭢 | سلام𤭢 | سلام𤭢
其他推荐答案
convert_from(或一般的PostgreSQL)不支持UTF-16,但您可以诉诸于一种可选的语言.
plperlu中的示例(需要数据库超级用户特权来创建该函数,CREATE LANGUAGE plperlu如果还没有创建):
CREATE FUNCTION decode_utf16(text) RETURNS text AS $$ require Encode; return Encode::decode("UTF-16BE", pack("H*", $_[0])); $$ immutable language plperlu; => select decode_utf16('0633064406270645'); decode_utf16 -------------- سلام
问题描述
I have UTF-16 hex representation such as “0633064406270645” which is "سلام" in Arabic language.
I would like to convert it to its text equivalent. Is there a straight way to do that in PostgreSQL?
I can convert the UTF code point like below; unfortunately it seems UTF16 is not supported. Any ideas on how to do it in PostgreSQL, worst case I will write a function?
SELECT convert_from (decode (E'D8B3D984D8A7D985', 'hex'),'UTF8'); "سلام" SELECT convert_from (decode (E'0633064406270645', 'hex'),'UTF16'); ERROR: invalid source encoding name "UTF16" ********** Error **********
推荐答案
That's right, Postgres doesn't support UTF-16.
However, it does support Unicode escape sequences:
SELECT U&'\0633\0644\0627\0645'
But keep in mind that Unicode code points and UTF-16 code units are only equivalent in the Basic Multilingual Plane. In other words, if you have any UTF-16 characters which span multiple 16-bit code units, you'll need to translate them to the corresponding code point yourself.
其他推荐答案
PostgreSQL does not support UTF-16 natively. I suggest you to convert your data to UTF-8 before supplying it to the DB. If it's too late (wrong data already exists in your DB), you can use these maintenance functions to convert data from UTF-16 (logic copied from wikipedia):
-- convert from bytea, containing UTF-16-BE data CREATE OR REPLACE FUNCTION convert_from_utf16be(utf16_data bytea, invalid_replacement text DEFAULT '?') RETURNS text LANGUAGE sql IMMUTABLE STRICT AS $function$ WITH source(unit) AS ( SELECT (get_byte(utf16_data, i) << 8) | get_byte(utf16_data, i + 1) FROM generate_series(0, octet_length(utf16_data) - 2, 2) i ), codes(lag, unit, lead) AS ( SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER () FROM source ) SELECT string_agg(CASE WHEN unit >= 56320 AND unit <= 57343 THEN CASE WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed ELSE invalid_replacement END WHEN unit >= 55296 AND unit <= 56319 THEN CASE WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888) ELSE invalid_replacement END ELSE chr(unit) END, '') FROM codes $function$; -- convert from bytea, containing UTF-16-LE data CREATE OR REPLACE FUNCTION convert_from_utf16le(utf16_data bytea, invalid_replacement text DEFAULT '?') RETURNS text LANGUAGE sql IMMUTABLE STRICT AS $function$ WITH source(unit) AS ( SELECT get_byte(utf16_data, i) | (get_byte(utf16_data, i + 1) << 8) FROM generate_series(0, octet_length(utf16_data) - 2, 2) i ), codes(lag, unit, lead) AS ( SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER () FROM source ) SELECT string_agg(CASE WHEN unit >= 56320 AND unit <= 57343 THEN CASE WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed ELSE invalid_replacement END WHEN unit >= 55296 AND unit <= 56319 THEN CASE WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888) ELSE invalid_replacement END ELSE chr(unit) END, '') FROM codes $function$; -- convert from bytea, containing UTF-16 data (with or without BOM) CREATE OR REPLACE FUNCTION convert_from_utf16(utf16_data bytea, invalid_replacement text DEFAULT '?') RETURNS text LANGUAGE sql IMMUTABLE STRICT AS $function$ SELECT CASE COALESCE(octet_length(utf16_data), 0) WHEN 0 THEN '' WHEN 1 THEN invalid_replacement ELSE CASE substring(utf16_data FOR 2) WHEN E'\\xFFFE' THEN convert_from_utf16le(substring(utf16_data FROM 3), invalid_replacement) ELSE convert_from_utf16be(substring(utf16_data FROM 3), invalid_replacement) END END $function$;
With these functions, you can convert from all kind of UTF-16:
SELECT convert_from_utf16be(decode('0633064406270645D852DF62', 'hex')), convert_from_utf16le(decode('330644062706450652D862DF', 'hex')), convert_from_utf16(decode('FEFF0633064406270645D852DF62', 'hex')), convert_from_utf16(decode('FFFE330644062706450652D862DF', 'hex')); -- convert_from_utf16be | convert_from_utf16le | convert_from_utf16 | convert_from_utf16 ------------------------+----------------------+--------------------+------------------- -- سلام𤭢 | سلام𤭢 | سلام𤭢 | سلام𤭢
其他推荐答案
convert_from (or PostgreSQL in general) doesn't support UTF-16, but you may resort to one of the optional languages that do.
Example in plperlu (requires database superuser privileges to create the function, and CREATE LANGUAGE plperlu if not created already):
CREATE FUNCTION decode_utf16(text) RETURNS text AS $$ require Encode; return Encode::decode("UTF-16BE", pack("H*", $_[0])); $$ immutable language plperlu; => select decode_utf16('0633064406270645'); decode_utf16 -------------- سلام