Oracle to file in UTF-8, with Perl
Par Pierrick, mercredi 24 janvier 2007 à 13:45 / categorie: Perl / tags: / #95 / rss
I have an Oracle database with UTF-8 data inside. In a Perl script, I want to extract, transform and load these data in a file.
To have data in UTF-8 in the Oracle database, I had to create the database with AL32UTF8 as NLS_CHARACTERSET:
SQL> select VALUE from V$NLS_PARAMETERS where PARAMETER = 'NLS_CHARACTERSET'; VALUE ------------- AL32UTF8
Now every varchar2 will contain UTF-8 encoded characters. Warning: by default the length you define on a varchar2 column is a byte length information, not a character length. Indeed with UTF-8 encoding a character may need 1, 2 3 or even 4 bytes. So you also have to tell Oracle that NLS_LENGTH_SEMANTICS is CHAR and not BYTE, it is a matter of consistancy, and I really don't understand the purpose of this parameter... (it has caused me a lot of problems in the past).
SQL> select value from V$NLS_PARAMETERS where PARAMETER = 'NLS_LENGTH_SEMANTICS'; VALUE ------------- CHAR
When an Oracle client communicates with an Oracle server, there are 2 encoding informations to consider: the encoding of data send by the client, or the wanted encoding of data for data fetching and the database encoding. The database encoding is defined as described earlier, at database creation. The client side encoding is defined in the environment variable NLS_LANG.
In a Perl script, we can locally set an environment variable with the %ENV hash. $ENV{NLS_LANG} for example. The DBI connector to Oracle, DBD::Oracle tells Perl which encoding is associated to data (unfortunately, this is not the case with DBD::mysql).
Still in a Perl script, you have to tell the output file handle which encoding to use for writing data.
#!/usr/bin/perl use strict; use warnings; use Getopt::Long; use DBI; my %opt = (); GetOptions( \%opt, qw/outfile=s dbname=s dbuser=s dbpass=s/ ); foreach my $option (qw/dbname dbuser dbpass/) { if (not defined($opt{$option})) { die('Error: --'.$option.' must be set'."\n"); } } $opt{outfile} ||= '/tmp/out'; # We tell the Oracle connector which encoding to use. The last part of the # environment variable NLS_LANG is relevant concerning data # encoding. AL32UTF8 is what everybody assume to be true UTF-8 $ENV{NLS_LANG} = 'AMERICAN_AMERICA.AL32UTF8'; # We open an output file handle, telling explicitely that we want to output # UTF-8 data. open(my $outfile, '>:utf8', $opt{outfile}) or die 'cannot open file "'.$opt{outfile}.'"'; my $dbh = DBI->connect( 'DBI:Oracle:'.$opt{dbname}, $opt{dbuser}, $opt{dbpass}, ) or die "can't connect to database"; my $query = 'select string from topic245'; my $sth = $dbh->prepare($query); $sth->execute() or die "can't execute input query"; while (my ($string) = $sth->fetchrow_array()) { # The data is slightly transformed (uppercase) to show that Perl # "understands" the characters, it's not only bytes. print {$outfile} uc $string, "\n"; } $sth->finish(); $dbh->disconnect(); close($outfile);
The Perl script in use:
[pierrick@plegall] ~ $ sqlplus root@TALEND [...] Connecté à : Oracle9i Release 9.2.0.1.0 - Production JServer Release 9.2.0.1.0 - Production SQL> select * from TOPIC245; STRING --------- salut Noël ➊➋➒ éàù€ SQL> quit Déconnecté de Oracle9i Release 9.2.0.1.0 - Production JServer Release 9.2.0.1.0 - Production [pierrick@plegall] ~ $ oracle_to_file-utf8.pl --dbname=TALEND --dbuser=root --dbpass=**** --out=/tmp/out.txt [pierrick@plegall] ~ $ cat /tmp/out.txt SALUT NOËL ➊➋➒ ÉÀÙ€ [pierrick@plegall] ~ $ file /tmp/out.txt /tmp/out.txt: UTF-8 Unicode text
Commentaires
Aucun commentaire pour le moment.
Ajouter un commentaire
Les commentaires pour ce billet sont fermés.