October 26, 2011

one small case about Invisible Character

One day I got a ticket, a sqlload tkt: import data into one table from an CSV file The CSV file is:
/x368/backup/app01/home/ctort/cdi_city_cort_final.csv

And the target table's structure is:
SYS @ ctort > desc ctortALAPP01.t_bpc_cleaning
 Name      Type
-----------------------------
 COUNTRY         VARCHAR2(200)
 ACTUALCOUNTRY   VARCHAR2(200)
 STATE           VARCHAR2(200)
 ACTUALSTATE     VARCHAR2(200)
 CITY            VARCHAR2(200)
 ACTUALCITY      VARCHAR2(200)


below are some rows in the cdi_city_cort_final.csv file:
US,UNITED STATES,MI,MICHIGAN,ADRIAN,ADRIAN
US,UNITED STATES,NC,NC,ADVANCE,ADVNCE
US,UNITED STATES,MN,MN,AFTON,AFTON
US,UNITED STATES,MA,MA,AGAWAM,AGAWAM


The columns and fileds are one-to-one correspondence, it seems an simple work, and the sqlload is done without any issue.
User confirmed after a quick check , all rows were there.


But the next day, user came to me again, they encounter some problem.
I check the table as what they told me, but all data seems OK, no issue:
SYS @ ctort > select CITY,ACTUALCITY  from ctortALAPP01.t_bpc_cleaning where  country ='US' and state = 'KS' and rownum<2;
CITY            ACTUALCITY
--------------------------
MCCRACKEN       MCCRACKEN


User blablabla said much, but  i still don't understand what the issue is. So the user share the screen to me.
Now i understand.
This table:ctortALAPP01.t_bpc_cleaning
When user connect through sql developer,and randam select one row from the table:
SYS @ ctort > select CITY,ACTUALCITY  from ctortALAPP01.t_bpc_cleaning where  country ='US' and state = 'KS' and rownum<2;
CITY            ACTUALCITY
--------------------------
MCCRACKEN       MCCRACKEN


then user try to copy and paste the rows content to another wordpad file,
when he copy the column CITY's content MCCRACKEN,what he paste still shows excatly MCCRACKEN,It is OK.
But when he copy the column ACTUALCITY's MCCRACKEN,after paste into txt file, it shows "MCCRACKEN     ".
Yes, you don't misread ,  it shows "" symbol and mytiply space.
WHY?
The user is confused with this so he came to me again.
Now let's check into it.
It is easy to guess it is because some invisible character exist in the fileds ACTUALCITY.
Let's find out what it is:
SYS @ ctort > select CITY,ACTUALCITY  from ctortALAPP01.t_bpc_cleaning where  country ='US' and state = 'KS' and rownum<2;
CITY            ACTUALCITY
--------------------------
MCCRACKEN       MCCRACKEN


SYS @ ctort > select dump(CITY,16), dump(ACTUALCITY,16)  from ctortALAPP01.t_bpc_cleaning where  country ='US' and state = 'KS' and rownum<5;
DUMP(CITY,16)
-----------------------
DUMP(ACTUALCITY,16)
-----------------------
Typ=1 Len=9:  4d,43,43,52,41,43,4b,45,4e
Typ=1 Len=10: 4d,43,43,52,41,43,4b,45,4e,
d


We can see clear now, there is one more "d" at the end of ACTUALCITY column.
This character can't show in the sqlplus, but once user copy and paste, we can see it.
So, where is this d come from?
Go to check my sqlload command again, it is nothing wrong.
Then the problem must come from the CSV file which offered by the user. Let's go to check it:
cihcisddb735[oracle]_ctort> cat /x368/backup/app01/home/ctort/cdi_city_cort_final.csv|more
COUNTRY,ACTUALCOUNTRY,STATE,ACTUALSTATE,CITY ,ACTUALCITY
DE,GERMANY,,,AACHEN,AACHEN
DK,DENMARK,,,AALBORG,AALBORG
.................


Obviously we can't find any clue from cat command.
Let's check the file into HEX:
vi cdi_city_cort_final.csv
and then input:
:%!xxd

HEX code is below:
0000000: 434f 554e 5452 592c 4143 5455 414c 434f  COUNTRY,ACTUALCO
0000010: 554e 5452 592c 5354 4154 452c 4143 5455  UNTRY,STATE,ACTU
0000020: 414c 5354 4154 452c 4349 5459 202c 4143  ALSTATE,CITY ,AC
0000030: 5455 414c 4349 5459 0d0a 4445 2c47 4552  TUALCITY..DE,GER
0000040: 4d41 4e59 2c2c 2c41 4143 4845 4e2c 4141  MANY,,,AACHEN,AA
0000050: 4348 454e 0d0a 444b 2c44 454e 4d41 524b  CHEN..DK,DENMARK
................


Go through it, very soon, we concentrate on the 0d0a which exist at end of each row.
Why it is 0d0a
In linux the wrap symbol should only be 0a.
Only when in windows the wrap row symbol is 0d0a.
And the extra 0d caused above problem to us.


Now we already can guess that this CSV file is come from windows not linux.
Go to check with user:
Kevin : 2:41:48 AM is the cdi_city_cort_final.csv file, generate from windows?
Mural : 2:42:04 AM yes


OK,till now the whole thing comes to light.
Now that we found out the reason, there are mutiply methods to fix this.
We can copy the CSV file to windows enviroment and sqlload again from windows client.
But what if the CSV file is big or we don't want to reload again?

Don't worry, we can directly remove the last Hex code "0d" from the filed with oracle provided function.

You may ask:
We can remove last word "p" from string "testp" by:
rtrim("p" from "testp")
But How to trim an invisible character?  We even can't see it! We even can't type it out though keybord!


Don't worry, because we already know the ASCII code for the invisible character is 0d,  we can remove it in below way:

rtrim(Input_String,UTL_RAW.CAST_TO_VARCHAR2('0d'))

UTL_RAW.CAST_TO_VARCHAR2 is used to convert HEX code back to character. It is the opposite of dump function.




More......