« 3. Find x. | Main | PDO_OCI does not support CLOBs »

When PHP and Oracle assume the worst about each other

If you ever want a developer to doubt him/herself, present them with questions about character sets. It's not that hard (at least according to Joel), but it's sometimes unnecessarily complicated. Even Oracle, which usually gets storing of data right on the money has had issues with character sets. If you want to use UTF8 exactly as defined by Unicode, in Oracle, you'd set your character set to AL32UTF8. This is because AL32UTF8 stores characters beyond U+FFFF as four bytes (as it should), but Oracle's "UTF8" stores these characters as a sequence of two UTF-16 surrogate characters encoded using UTF-8 (or six bytes per character). (More on UTF8 vs. AL32UTF8 in Oracle.)

Needless to say, even when you *know* you set up your database correctly for supporting UTF8, the path to debug issues may be frustrating and full of red herrings.

When setting up Oracle to support UTF8, you want the NLS parameters for character sets to be set as follows::

select * from V$NLS_PARAMETERS where parameter like '%CHARACTERSET%';

NLS_CHARACTERSET
AL32UTF8

NLS_NCHAR_CHARACTERSET
AL16UTF16

(AL16UTF16 is used for nvarchar data types, but more on those in another post.)

PHP can easily make sure to display UTF8 characters correctly with the mbstring (multi-byte string) extension. For database support, it is pretty standard these days to use PDO OCI extension to connect to Oracle from PHP.

On a recent project, that was our configuration but unfortunately, we ran into a problem where any nice Chinese characters we put in our input fields would not display correctly.

For a string '試驗' stored in a field my_field in table my_table, we would get the following results:

sqlplus> select my_field from my_table;

Would result in:
??????

The first response was a moment of doubt. Did we set up the database wrong? Why can't we see the characters?

Now the interesting part was that the following would work (sort of):

sqlplus> select asciistr(my_field) from my_table;

Would result in:
\8A66\9A57

This alleviated the doubt to some extent as we could obviously store the correct information about the characters in the database, but it still didn't explain why we could only see data on the site as a bunch of question marks. The PHP extension was set up correctly, the database could accept the data in any character set. What's wrong?

As mentioned above, character sets tend to be confusing and there is a lot of information about bugs related to their incorrect usage. Trying to find some relevant information online about our problem was taking a while. Even Ask Tom was surprisingly not helpful as Tom claims in one of his posts that he does not have that much experience with storing characters from non-Western character sets.

The first clue was the obvious NLS_LANG environment variable. It's actually so obvious, we assumed it was set correctly in our environment, so we skipped immediately to a bug report report we found: Problem using UTF-8 database with pdo_oci. This seemed to be exactly the problem we were experiencing. Basically the problem was that neither PHP nor Oracle would trust each other to properly handle UTF-8 characters and they would pass them to each other as '??????' to be on the safe side. Argh! The solution was simple: DSN string was modified to contain 'charset=AL32UTF8'. This basically tells PHP and Oracle to play nice and trust each other.

We set the charset in the DSN string, but unfortunately our characters were still question marks. What the heck? we said. Everything should be working by now. After few hours of thinking up horrific lost data scenarios, we decided to go back to basics and look at the NLS_LANG variable. It turns out it wasn't set. For us, I set it as:

NLS_LANG='AMERICAN_AMERICA.AL32UTF8'

Note: when you set this, you may encounter the following error:
ORA-12705: Cannot access NLS data files or invalid environment specified

There is a simple fix to this. Simply make sure that you can read the $ORACLE_HOME/nls/data directory and all files within it.

This finally fixed the problem. Phew! Overall, the entire solution were these 3 easy steps:

1. Modify the DSN string was modified to contain 'charset=AL32UTF8'

2. Apache now has an environment variable NLS_LANG set as follows: NLS_LANG='AMERICAN_AMERICA.AL32UTF8'

3. Make sure that $ORACLE_HOME/nls/data and all files within are world-readable.

The database and PHP are now successfully telling each other to store and display characters in correct character sets and the developers are breathing easy. Props to Ben and JP for help with debugging and Brian for fixing the permissions!

Comments (1)

I logged a bug (with patch) against PDO_OCI to improve error its handling: http://pecl.php.net/bugs/bug.php?id=11345

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

Maggie Nelson is a software developer in New York City. She likes open source (especially PHP) but is willing to break her "free is good" philosophy for Oracle.


Subscribe to my feed

Copyright 2008 Maggie Nelson

New York PHP Community Member

Powered by
Movable Type 3.34