multibyte string handling in php

Download Multibyte string handling in PHP

If you can't read please download the document

Upload: danielrhodes

Post on 24-May-2015

5.967 views

Category:

Technology


5 download

DESCRIPTION

Multibyte string handling in PHP with the mbstring extension

TRANSCRIPT

  • 1. Multibyte string handling in PHP with the mbstring extension By Daniel Rhodes of Warp Asylum ( www.warpasylum.co.uk ) As seen on Zend.com!

2. What is mbstring for?

  • Multibyte string handling

3. Supports many character encodings including unicode 4. Supports some different national languages * 5. Character encoding conversion 6. Some Japanese specific functions / settings 7. Mbstring is NOT...

  • A magic way to get the internals of the PHP interpreter itself to suddenly operate natively with unicode (you'll have to wait and follow the development of PHP itself for that!)

8. How to get mbstring

  • Regular (but not built-in) extension for PHP

9. On most PHP servers it's already there so... 10. ...just switch it on! 11. Present and switched on out-of-the-box in Zend Server (CE and upwards) 12. If not present then download, but shouldn't need to compile etc 13. Some key directives for mbstring

  • mbstring.internal_encoding

14. mbstring.language 15. See http://php.net/manual/en/mbstring.configuration.php 16. Easy peasy in Zend Server 17. Enough now let's rock and roll!

  • Mbstring gives us multibyte-safe versions of the core string handling functions

18. For example, we all know strlen() 19. So let's have a look at mb_strlen() 20. mb_strlen() 21. More mb_strlen() 22. Even more mb_strlen() 23. Still rocking and rolling...

  • Mbstring gives us multibyte-safe versions of the core string handling functions
  • For example, we all know strpos()

24. So let's have a look at mb_strpos() 25. mb_strpos() 26. More mb_strpos() 27. Wrapping up and moving on

  • Mbstring gives us multibyte-safe versions of the core string handling functions
  • There are LOTS of these multibyte-safe versions of core string handling functions please have a look

28. BE CAREFUL but you can make calls to strlen() (and etc) automatically call mb_strlen()- this is the mbstring.func_overload directive 29. Mbstring specific functions

  • Let's look at character encodings first
  • mb_detect_encoding()

30. mb_convert_encoding() 31. LOTS of supported encodings 32. ( http://php.net/manual/en/mbstring.supported-encodings.php ) 33. Mbstring.detect_order directive comes into play here 34. mb_detect_encoding() 35. mb_detect_order() 36. More mb_detect_order() 37. Mbstring specific functions

  • Still looking at character encodings ...
  • mb_detect_encoding()

38. mb_convert_encoding() 39. LOTS of supported encodings 40. ( http://php.net/manual/en/mbstring.supported-encodings.php ) 41. Mbstring.detect_order directive comes into play here 42. mb_convert_encoding() 43. More mb_convert_encoding() 44. Regular expressions on multibyte strings

  • mb_regex_encoding()but note that supported encodings for regex purposes is actually a SUBSET of supported encodings for mbstring itself!

45. mb_ereg() 46. mb_ereg_match() 47. mb_ereg_replace() 48. and many more! 49. Note: PHP's regular preg_*() functions can also do UTF-8 with the /u pattern modifier !! 50. mb_ereg() 51. More mb_ereg() 52. Summary of mbstring functions

  • Directive setting functions

53. Multibyte versions of regular string functions 54. Regex functions 55. Encoding detection / conversion 56. Japanese specific functions / settings 57. Other misc stuff 58. Putting it all together

  • Mbstring gets PHP working with multibyte

59. BUT... 60. Don't forget your: 61. PHP script files(best to have encoding of file same asmbstring.internal_encoding) 62. Database 63. Output (ie. Probably HTML) 64. Input (ie. Form submissions etc) 65. Multibyting your database

  • Oracle I'm no expert but look at NCHAR as opposed to CHAR ('N' for 'national language')

66. PostgreSQL I'm no expert but IIRC Postgres automagically understands and converts input / output character encodings 67. MySQL can choose a collation for server, each schema, each table, each column! 68. MySQL collation means charset + sort order (for example CS means case-sensitive sort order) 69. More multibyting your database

  • MySQL easiest to put everything on 'utf8_unicode_ci' or 'utf8_general_ci' (but note that these two collations differ when sorting and doing LIKE etc! See http://forums.mysql.com/read.php?103,187048,188748#msg-188748)

70. You'll need to do an SQL query of: 71. SET NAMES utf8 and / or SET CHARACTER SET utf8 72. After connecting and before reading / writing 73. (otherwise characters will become garbled) 74. Multibyting your output HTML

  • For example, for UTF8, we need to output this kind of HTTP header:

75. Content-Type: "text/html; charset=UTF-8;" 76. ie. header("Content-Type: text/html; charset=UTF-8;"); 77. Possible but less desirable to output as a meta tag in the HTML : 78. 79. (or simply for HTML5) 80. Don't forget lang=xy or xml:lang=xy where needed 81. Multibyting your input

  • Theoretically possible, but unusual, to have a with a different encoding to its host page

82. Out-of-the-box, form data on a SJIS host page comes in as SJIS. Form data on an EUC-JP host page comes in as EUC-JP and etc 83. Or have I just been very lucky? 84. Look at mbstring.http_input directive if struggling 85. That's all folks!

  • I'll leave you with some things to think about:
  • Iconv (a built-in extension) might be better if all you need is to detect / change encodings

86. Previous examples of preg_match() failing will probably work with the /u patter modifier (to enable UTF-8) 87. No mb version of trim() or preg_match_all() 88. Mbstring in action:http://twitter.com/japxlate http://mapanese.info 89. Questions welcome at [email protected]