string,icu,capitalization,title-case
@Jongware should get the credit for explaining this so well. Your question might be - does ICU have a list of non-capitalized words? But the short answer for ICU is: No. CLDR (from whence ICU gets its data) used to have "Stop words" for search purposes, but they were not...
c++,boost,nlp,icu,boost-locale
I haven't found a way to do this with boost::locale::boundary, but it is possible to do it with ICU directly by creating a customized RuleBasedBreakIterator, rather than using one provided by createWordInstance. Locale locale("fr_FR"); UErrorCode statusError = U_ZERO_ERROR; UParseError parseError = { 0 }; // get rules from a default...
From INSTALL file: The stringi package depends on the ICU4C >= 50 library. So libicu42 is far to old. If you check install.R file you'll find following lines: mirrors <- c("http://static.rexamine.com/packages/", "http://www.mini.pw.edu.pl/~gagolews/stringi/", "http://www.ibspan.waw.pl/~gagolews/stringi/") A couple of lines later you'll find something like this: if (!grepl("^https?://", href)) { # try to...
unicode,elasticsearch,lucene,tokenize,icu
The issue was I using elasticsearch sense plugin to query this and it was not encoding the data properly. It worked fine when I wrote a test using python client library.
Self Answer: If you know your current index in code-units, then you can use ICU::ubrk_current() to return the the cude unit index most recently returned by ICU::ubrk_next(). See: http://icu-project.org/apiref/icu4c/ubrk_8h.html#a4f8b67527c5c9d9205a3446506ffeefc I was mostly confused by the ambiguity in the descriptions of the UBreakIterator methods. However, after contacting ICU support, "Character Index"...
unicode,icu,unicode-normalization
ICU is correct. To understand why, have a look at the Canonical Composition Algorithm which is defined in chapter 3 of the Unicode Standard: D117 Canonical Composition Algorithm: Starting from the second character in the coded character sequence (of a Canonical Decomposition or Compatibility Decomposition) and proceeding sequentially to the...
Probably the best TP would be this. You can try various option combinations with the ICU Collation Demo. (give "alternate=shifted" a try)...
It looks like my best bet is grabbing the headers from the apple website. This repository also includes the makefile for libicucore.dynlib which uses --with-data-packaging=archive to put the ICU data tables in a standalone file /usr/share/icu/icudt51l.dat.
The simplest way would be to check the script of the first character: static Character.UnicodeScript getScript(String s) { if (s.isEmpty()) { return null; } return Character.UnicodeScript.of(s.codePointAt(0)); } A better way would be to find the most frequently occurring script: static Character.UnicodeScript getScript(String s) { int[] counts = new int[Character.UnicodeScript.values().length]; Character.UnicodeScript...
I've tested your code with version 53.1 and got correct Turkish output. I then retested with version 3.4.4 and got English output as you described in your question. Most likely, you are pulling in an older version through a transitive maven dependency....
node.js,ubuntu,shared-libraries,docker,icu
As @mscdex has pointed out, libicu was looking for the libicu52 package. Somehow the repository got updated allowing me to pull the new libicu which depends on libicu52 that isn't available in the repository of 12.04, but in 14.04. Since there is no official trusted build of 14.04 in the...
What you call Unicode number is typically called a code point. If you want to work with C++ and Unicode strings, ICU offers a icu::UnicodeString class. You can find the documentation here. To create a UnicodeString holding a single character, you can use the constructor that takes a code point...
You may need a library like commerceguys/intl, which has a getName() method.
maven,plugins,elasticsearch,icu
Okay... After a couple of hours I found the issue. The configuration was not right and ES was not picking up the analyzer correctly. This did it: { "index": { "analysis": { "analyzer": { "ducet_sort": { "tokenizer": "keyword", "filter": [ "icu_collation" ] } } } } } The settings bit...
To upgrade existing ICU in you XAMP installation you'll need to: copy php_intl.dll to your_xamp_folder/php copy all the icu*.dll files to your_xamp_folder/apache/bin check if extension=php_intl.dll is enabled in your_xamp_folder/php/php.ini restart Apache Let me know if it works (i'm currently on nginx) Edit: you'll find php_intl.dll here all icu*.dll files are...