Menu
  • HOME
  • TAGS

Proper title case in ICU [Does ICU have a list of non-capitalized words?]

string,icu,capitalization,title-case

@Jongware should get the credit for explaining this so well. Your question might be - does ICU have a list of non-capitalized words? But the short answer for ICU is: No. CLDR (from whence ICU gets its data) used to have "Stop words" for search purposes, but they were not...

Is it possible to get boost locale boundary analysis to split on apostrophes?

c++,boost,nlp,icu,boost-locale

I haven't found a way to do this with boost::locale::boundary, but it is possible to do it with ICU directly by creating a customized RuleBasedBreakIterator, rather than using one provided by createWordInstance. Locale locale("fr_FR"); UErrorCode statusError = U_ZERO_ERROR; UParseError parseError = { 0 }; // get rules from a default...

How to install stringi library from archive and install the local icu52l.zip

r,ubuntu,icu,stringi

From INSTALL file: The stringi package depends on the ICU4C >= 50 library. So libicu42 is far to old. If you check install.R file you'll find following lines: mirrors <- c("http://static.rexamine.com/packages/", "http://www.mini.pw.edu.pl/~gagolews/stringi/", "http://www.ibspan.waw.pl/~gagolews/stringi/") A couple of lines later you'll find something like this: if (!grepl("^https?://", href)) { # try to...

Elastic Search : Configuring icu_tokenizer for czech characters

unicode,elasticsearch,lucene,tokenize,icu

The issue was I using elasticsearch sense plugin to query this and it was not encoding the data properly. It worked fine when I wrote a test using python client library.

BreakIterator ICU - Get byte length of grapheme cluster

unicode,iterator,icu

Self Answer: If you know your current index in code-units, then you can use ICU::ubrk_current() to return the the cude unit index most recently returned by ICU::ubrk_next(). See: http://icu-project.org/apiref/icu4c/ubrk_8h.html#a4f8b67527c5c9d9205a3446506ffeefc I was mostly confused by the ambiguity in the descriptions of the UBreakIterator methods. However, after contacting ICU support, "Character Index"...

Is this Unicode NFC conversion correct?

unicode,icu,unicode-normalization

ICU is correct. To understand why, have a look at the Canonical Composition Algorithm which is defined in chapter 3 of the Unicode Standard: D117 Canonical Composition Algorithm: Starting from the second character in the coded character sequence (of a Canonical Decomposition or Compatibility Decomposition) and proceeding sequentially to the...

What is the theory behind unicode collation sorting

icu,uca

Probably the best TP would be this. You can try various option combinations with the ICU Collation Demo. (give "alternate=shifted" a try)...

Generate or find C headers for ICU core on OSX

c,osx,osx-mavericks,icu

It looks like my best bet is grabbing the headers from the apple website. This repository also includes the makefile for libicucore.dynlib which uses --with-data-packaging=archive to put the ICU data tables in a standalone file /usr/share/icu/icudt51l.dat.

How to detect the script system/alphabet from UTF-8 input?

java,unicode,utf-8,icu

The simplest way would be to check the script of the first character: static Character.UnicodeScript getScript(String s) { if (s.isEmpty()) { return null; } return Character.UnicodeScript.of(s.codePointAt(0)); } A better way would be to find the most frequently occurring script: static Character.UnicodeScript getScript(String s) { int[] counts = new int[Character.UnicodeScript.values().length]; Character.UnicodeScript...

formatting numbers(spellout) with icu4j

java,icu

I've tested your code with version 53.1 and got correct Turkish output. I then retested with version 3.4.4 and got English output as you described in your question. Most likely, you are pulling in an older version through a transitive maven dependency....

libicui18n.so.52: cannot open shared object file

node.js,ubuntu,shared-libraries,docker,icu

As @mscdex has pointed out, libicu was looking for the libicu52 package. Somehow the repository got updated allowing me to pull the new libicu which depends on libicu52 that isn't available in the repository of 12.04, but in 14.04. Since there is no official trusted build of 14.04 in the...

How to convert a Unicode code point to characters in C++ using ICU?

c++,unicode,icu

What you call Unicode number is typically called a code point. If you want to work with C++ and Unicode strings, ICU offers a icu::UnicodeString class. You can find the documentation here. To create a UnicodeString holding a single character, you can use the constructor that takes a code point...

Get localized currency display name from CLDR in PHP

php,unicode,icu,cldr

You may need a library like commerceguys/intl, which has a getName() method.

Elasticsearch ICU plugin - Analyzer not found

maven,plugins,elasticsearch,icu

Okay... After a couple of hours I found the issue. The configuration was not right and ES was not picking up the analyzer correctly. This did it: { "index": { "analysis": { "analyzer": { "ducet_sort": { "tokenizer": "keyword", "filter": [ "icu_collation" ] } } } } } The settings bit...

Update ICU extension within xampp?

php,xampp,icu,intl

To upgrade existing ICU in you XAMP installation you'll need to: copy php_intl.dll to your_xamp_folder/php copy all the icu*.dll files to your_xamp_folder/apache/bin check if extension=php_intl.dll is enabled in your_xamp_folder/php/php.ini restart Apache Let me know if it works (i'm currently on nginx) Edit: you'll find php_intl.dll here all icu*.dll files are...