the whole shebang
This commit is contained in:
125
vendor/patchwork/utf8/README.md
vendored
Normal file
125
vendor/patchwork/utf8/README.md
vendored
Normal file
@@ -0,0 +1,125 @@
|
||||
Patchwork UTF-8
|
||||
===============
|
||||
|
||||
Patchwork UTF-8 provides both :
|
||||
|
||||
- a portability layer for Unicode handling in PHP, and
|
||||
- a class that mirrors the quasi complete set of native string functions,
|
||||
enhanced to UTF-8 [grapheme clusters](http://unicode.org/reports/tr29/)
|
||||
awareness.
|
||||
|
||||
It can also serve as a documentation source referencing the practical problems
|
||||
that arise when handling UTF-8 in PHP: Unicode concepts, related algorithms,
|
||||
bugs in PHP core, workarounds, etc.
|
||||
|
||||
Portability
|
||||
-----------
|
||||
|
||||
Unicode handling in PHP is best performed using a combo of `mbstring`, `iconv`,
|
||||
`intl` and `pcre` with the `u` flag enabled. But when an application is expected
|
||||
to run on many servers, you should be aware that these 4 extensions are not
|
||||
always enabled.
|
||||
|
||||
Patchwork UTF-8 provides pure PHP implementations for 3 of those 4 extensions.
|
||||
Here is the set of portability-fallbacks that are currently implemented:
|
||||
|
||||
- *utf8_encode, utf8_decode*,
|
||||
- `mbstring`: *mb_convert_encoding, mb_decode_mimeheader, mb_encode_mimeheader,
|
||||
mb_convert_case, mb_internal_encoding, mb_list_encodings, mb_strlen,
|
||||
mb_strpos, mb_strrpos, mb_strtolower, mb_strtoupper, mb_substitute_character,
|
||||
mb_substr, mb_stripos, mb_stristr, mb_strrchr, mb_strrichr, mb_strripos,
|
||||
mb_strstr*,
|
||||
- `iconv`: *iconv, iconv_mime_decode, iconv_mime_decode_headers,
|
||||
iconv_get_encoding, iconv_set_encoding, iconv_mime_encode, ob_iconv_handler,
|
||||
iconv_strlen, iconv_strpos, iconv_strrpos, iconv_substr*,
|
||||
- `intl`: *Normalizer, grapheme_extract, grapheme_stripos, grapheme_stristr,
|
||||
grapheme_strlen, grapheme_strpos, grapheme_strripos, grapheme_strrpos,
|
||||
grapheme_strstr, grapheme_substr*.
|
||||
|
||||
`pcre` compiled with unicode support is required.
|
||||
|
||||
Patchwork\Utf8
|
||||
--------------
|
||||
|
||||
[Grapheme clusters](http://unicode.org/reports/tr29/) should always be
|
||||
considered when working with generic Unicode strings. The `Patchwork\Utf8`
|
||||
class implements the quasi-complete set of native string functions that need
|
||||
UTF-8 grapheme clusters awareness. Function names, arguments and behavior
|
||||
carefully replicates native PHP string functions so that usage is very easy.
|
||||
|
||||
Some more functions are also provided to help handling UTF-8 strings:
|
||||
|
||||
- *isUtf8()*: checks if a string contains well formed UTF-8 data,
|
||||
- *toAscii()*: generic UTF-8 to ASCII transliteration,
|
||||
- *strtocasefold()*: unicode transformation for caseless matching,
|
||||
- *strtonatfold()*: generic case sensitive transformation for collation matching
|
||||
|
||||
Mirrored string functions are:
|
||||
*strlen, substr, strpos, stripos, strrpos, strripos, strstr, stristr, strrchr,
|
||||
strrichr, strtolower, strtoupper, wordwrap, chr, count_chars, ltrim, ord, rtrim,
|
||||
trim, str_ireplace, str_pad, str_shuffle, str_split, str_word_count, strcmp,
|
||||
strnatcmp, strcasecmp, strnatcasecmp, strncasecmp, strncmp, strcspn, strpbrk,
|
||||
strrev, strspn, strtr, substr_compare, substr_count, substr_replace, ucfirst,
|
||||
lcfirst, ucwords, number_format, utf8_encode, utf8_decode*.
|
||||
|
||||
Missing are *printf*-family functions.
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
The recommended way to install Patchwork UTF-8 is [through
|
||||
composer](http://getcomposer.org). Just create a `composer.json` file and run
|
||||
the `php composer.phar install` command to install it:
|
||||
|
||||
{
|
||||
"require": {
|
||||
"patchwork/utf8": "1.1.*"
|
||||
}
|
||||
}
|
||||
|
||||
Then, early in your bootstrap sequence, you have to configure your environment:
|
||||
|
||||
```php
|
||||
\Patchwork\Utf8\Bootup::initAll(); // Enables the portablity layer and configures PHP for UTF-8
|
||||
\Patchwork\Utf8\Bootup::filterRequestUri(); // Redirects to an UTF-8 encoded URL if it's not already the case
|
||||
\Patchwork\Utf8\Bootup::filterRequestInputs(); // Sanitizes HTTP inputs to UTF-8 NFC
|
||||
```
|
||||
|
||||
Run `phpunit` in the `tests/` directory to see the code in action.
|
||||
|
||||
Make sure that you are confident about using UTF-8 by reading
|
||||
[Character Sets / Character Encoding Issues](http://www.phpwact.org/php/i18n/charsets)
|
||||
and [Handling UTF-8 with PHP](http://www.phpwact.org/php/i18n/utf-8),
|
||||
or [PHP et UTF-8](http://julp.lescigales.org/articles/3-php-et-utf-8.html) for french readers.
|
||||
|
||||
You should also get familar with the concept of
|
||||
[Unicode Normalization](http://en.wikipedia.org/wiki/Unicode_equivalence) and
|
||||
[Grapheme Clusters](http://unicode.org/reports/tr29/).
|
||||
|
||||
Do not blindly replace all use of PHP's string functions. Most of the time you
|
||||
will not need to, and you will be introducing a significant performance overhead
|
||||
to your application.
|
||||
|
||||
Screen your input on the *outer perimeter* so that only well formed UTF-8 pass
|
||||
through. When dealing with badly formed UTF-8, you should not try to fix it.
|
||||
Instead, consider it as ISO-8859-1 and use `utf8_encode()` to get an UTF-8
|
||||
string. Don't forget also to choose one unicode normalization form and stick to
|
||||
it. NFC is the most in use today.
|
||||
|
||||
This library is orthogonal to `mbstring.func_overload` and will not work if the
|
||||
php.ini setting is enabled.
|
||||
|
||||
Licensing
|
||||
---------
|
||||
|
||||
Patchwork\Utf8 is free software; you can redistribute it and/or modify it under
|
||||
the terms of the (at your option):
|
||||
- [Apache License v2.0](http://apache.org/licenses/LICENSE-2.0.txt), or
|
||||
- [GNU General Public License v2.0](http://gnu.org/licenses/gpl-2.0.txt).
|
||||
|
||||
Unicode handling requires tedious work to be implemented and maintained on the
|
||||
long run. As such, contributions such as unit tests, bug reports, comments or
|
||||
patches licensed under both licenses are really welcomed.
|
||||
|
||||
I hope many projects could adopt this code and together help solve the unicode
|
||||
subject for PHP.
|
Reference in New Issue
Block a user