Getting into the code of a North Korean piece of software for data extraction. Using a disassembler and a debugger to understand the software’s handling and processing of the data before replicating the code.
1. The Data
A few months ago, a North Korean colleague was kind enough to share with me a very interesting piece of software. The Mirae (future) electronic dictionary (미래 전자사전) is a C++ program released in 2007. It ships with a selection of citations and literary works by both Kim Il-Song and Kim Jong-il, the full Immortal History (불멸의 력사) and Immortal Guidance (불멸의 향도) collection of novels about the two leaders’ life and achievements, works of foreign literature in translation, classical Korean texts, several volumes of literary theory, a style guide and three different dictionaries: the great dictionary of the Korean language (조선말대사전), the great dictionary of literature (문학대사전) and the great dictionary of common sense (상식대사전). It can be seen as a precursor to the some of the software that now equips Samjiyon tablets and more modern electronic devices. The Mirae dictionary however has a stronger focus on language and literature: there are over 1500 literary works overall, many of which have long been out of print in North Korea and are hard or impossible to come by. As such it is an invaluable resource for scholars of the country’s literature.
The interface’s design is nice if a bit retro, and it’s even got some background music. But unfortunately it is very unpractical for data extraction. Copy/pasting is disabled, making it hard, for example, to cite some of the works available in the dictionary without having to copy everything by hand. It would also be kind of nice to be able to run some statistics, just to see, for example, the exact number of foreign works translated and the origin of their authors.
The interface itself is quite light at 632 kb, and all of the data seems to be stored externally in the files of a “Data” folder:
Some of these files are quite big at over 500mb. The *.big files in the Images folder must contain the various illustrations used either by the UI or the different articles, the *.str files, which represent the bulk of the data, must correspond to the literary text. Judging from the interface, they are most likely formatted in DOC or a HWP-like formatt and queried on the fly when the user clicks on a new document. However how exactly this data is stored in the files in unclear. The files’ extensions provide no indication of the file format or software that might have been used, so we will have to figure it out by ourselves. A quick look at how the files are structured shows us that they do not seem to be encrypted, as some of them still have plaintext messages, nor even compressed judging from the number of repeating bytes which a compression algorithm would have dealt with.
2. Analyzing the code
To try to figure out how exactly the files are interpreted and rendered by the software, we’ll have to dig a little bit deeper and look at the different operations its code performs. By following them, we might be able to emulate them to extract data and save it in an open format.
To do so, we can start by disassembling the program, which will give us a static look at the code of the program in assembly. By checking the strings, we can easily find references to the various files stored in the Data folder:
The disassembler will then allow us to find the various parts of the code where these strings are used. In other words it will give us the location where the code uses the strings containing the paths to these files in order to open them and most likely extract or at least map their content. To find out what the program actually does with these files, we should therefore start “following” the code from the points where these strings are used within a function call, most likely one of the few Windows APIs intended for reading files. Following the code inside the disassembler without any idea of what the different memory addresses contain will not be very informative. We’ll need to use a debugger to be able to dynamically follow exactly what the code does. Using OllyDbg, we’ll set breakpoints at the various locations where the software makes use of the files’ strings.
To be continued…