How can system locale break your java application character encoding

Today my colleague encountered some weried character encoding problem of a java application. The application runs with out any problem on a mac, but shows weried characters when runs on ubuntu.

After some investigation, we found out that the locale on the ubuntu machine is set wrongly. But how can the system locale have any influence on how java encodes or deocdes? It turns out that some APIs(e.g., the default constructor of InputStreamReader, String.getBytes()) in java have parameters for setting character encoding. Those parameters default to JVM’s file.encoding property if not explicitly specified. And JVM’s file.encoding property gets default value according to JVM locale. JVM locale reads system locale as the default value when JVM starts up.

The lesson we learned here is that explicit is better than implicit. Default parameters might be sometimes convenient, but they can also set your hair on fire.

Some design considerations for Weblicht Batch Processing

Recently I have been working on building a batch processing system for weblicht.
Weblicht is a orchestration tool for running a pipeline of linguistic tools.

The problem with current weblicht is that it can only process files with relative small size.
All of the linguistic tools are wrapped as Restful webservices.
Weblicht calls services sequentially according to a pipeline constructed by the user using the GUI interface.

There are some limitations in this architecture.

  1. The call to some webservices in the pipeline may fail. In this case, the results of services which have been already called are discarded as well, though those might be to the users’ interest.
  2. All of the services are run sequentially. The downstream services waste a lot of time waiting for upstream services processing the whole file. What might be a better solution is that, the large file can be processed part by part. As soon as one part is processed by a service successfully, It can return the result back to weblicht, then the next service can get the data from weblicht to start processing. In the meantime the previous service can work on the remaining parts. This can increase the throughput of the system.
  3. Other limitations such as that the browser session can time out before the processing is finished; users need to keep their browser open.(This is solved by WaaS, weblicht as a webservice, users can send a chain and files to WaaS, but this requires some minimal programming skills which our target users may not have ).

To address these issues, I have been working on a new browser based application.


Event driven task processing

Data storage and access access

Files in webba need to be properly saved. Files include those directly uploaded by users and those intermediate results.
Any user should only have access to files uploaded by them or intermediate result files generated by their task.

CRUD application and the front-end

Mac-like Umlaut Composition on Archlinux

From time to time I need to type in German under English keyboard lay out.
On Mac, I can simply type option + u to get the umlaut, then type a, u, o, to get ä, ü, ö respectively.

I would like to have similar keystrokes on Archlinux.

It turns out to be rather simple. Suppose you have US keyboard layout as the default. Then type in

setxkbmap us mac

You will get almost the same as on the mac, except for that now the option key is AltGr(the right alt key).

If you want the left alt key function as the option key, just add

include "level3(lalt_switch)"


include "level3(ralt_switch)"

in the mac section of file /usr/share/X11/xkb/symbols/us