Document Character Converter
DocCharConvert: Introduction
Document Character Converter is a tool to convert between different encoding formats and different scripts.
It currently supports the following types of converters:
It currently allows conversion of the following file types:
- Plain text files
- OpenDocument files (OO native format)[www.OpenOffice.org]
- TeX Files - limited support
DocCharConvert: Installation and Configuration
Prerequisites
You must have Java 1.6 or later installed. You can get it from the Sun website - you only need the JRE.
For conversion of OpenOffice documents, it is recommended to have OpenOffice installed from OpenOffice.org.
Starting DocCharConvert
If the installation went smoothly, then you should be able to just double click on the DocCharConvert Desktop icon or start it from the Programs Menu. The DocCharConvert Main Form should appear.
If this does not work, you can start it from a command line. If you are already in the DocCharConvert installation directory, then you may be able to just type:
eclipse
If you need to specify a different Java virtual machine you can use something like:
eclipse -jvm "C:\Program Files\Java\jre1.6.0_06\bin\java.exe"
You will need to adjust the paths as appropriate to your system.
Configuration
Choose Preferences from the Window menu.
You can change the Converter directory by browsing to the directory containing the converters and select one of the dccx files.
The built in converters will be in directories under plugins in the installation directory.
DocCharConvert: Getting Started
Converting some files
Open the
Document Conversion Wizard
from the File Menu or by clicking on the
icon.
- You can either convert existing files or type / paste the text directly.
- If you choose to convert existing files, then you need to select the File Mode, usually "Plain Text" or "OpenOffice".
- Click next
- Select the converter that you want to use and click Next. (The font choices are only used with OpenOffice files)
- If you chose to convert existing files, then you need to choose which files to convert:
- Browse to the file that you want to convert
- Browse to the file where you want to save the conversion result
- Add more files if necessary and click Next
- Set the encoding of the Input and Output files. If you are not sure what to use, Windows-1252 is a common encoding used by older fonts. More modern Unicode files probably use UTF-8 or UTF-16.
- Click Finish
If you want to convert the same set of files lots of times, you may want to save the list of files in a text file.
You can do this by clicking the Save List button.
You can then reload it with Load List.
The list is saved a simple text file format with one file pair per line:
"input file.txt" "output file.txt"
File Encoding
The file encoding refers to the standard that is used to map the raw bytes in a file into specific characters in a font.
For correct results, you need to know what format the files are in that you want to convert.
OpenOffice supports many document formats, so that may be the best choice for any non-text files.
You will need to open the files in OpenOffice and save them in OpenDocument format before conversion.
If you are using text files, then you need to know the encoding of the text in the files.
The default on many versions of Windows is windows-1252, however UTF-8 or another Unicode format will probably be necessary for non-Latin languages.
If you are converting from text in an old legacy font to Unicode, then you will probably want Windows-1252 for Input and UTF-8 for Output.
DocCharConvert: Command Line
Command Line Usage
If you are running lots of conversions on a regular basis you may want to use
a command line version of the tool.
Make sure that the correct version of Java is in your path. Change directory
to the DocCharConvert directory which has plugins as a subdirectory. You can then run the
command:
java -cp plugins/org.thanlwinsoft.doccharconvert_1.0.0.jar org.thanlwinsoft.doccharconvert.CommandLine
On Windows you can run the DocCharConvert.bat script from a Windows Command Console.
cd C:\Program Files\ThanLwinSoft.org\DocCharConvert
DocCharConvert.bat
You can see the command line options with the --help option:
Using config dir:C:\Program Files\ThanLwinSoft.org\DocCharConvert
Arguments: [-i iEnc] [-o oEnc] [-r] converter.dccx mode
[-f list]|[inputFile outputFile]
[--converters ConvertersPath]
Modes:
0 Plain Text
1 OpenOffice
2 TeX
3 OpenDocument
Optional Arguments:
--help display this help
-r use the converter in reverse mode
-i iEnc = input encoding e.g. -i iso-8859-1 (default UTF-8)
-o oEnc = output encoding e.g. -o iso-8859-1 (default UTF-8)
-f fileList = file containing list input output files
--converters path = change the default Converters dir to path
Please choose from one of the following converters:
Academy.dccx
AcademyExt.dccx
AcademyPipe.dccx
IwinMedium.dccx
Winnwa.dccx
Wwin_burmese.dccx
MyanmarUni4ToUni5.dccx
WinnwaUTN11.dccx
The list of converters may vary according to what is installed on your system.
For example, to convert a text file in WinInnwa to Myanmar Unicode,
you would something like type:
DocCharConvert.bat -i windows-1252 Winnwa.dccx 0 wininnwa.txt myUni.txt
DocCharConvert: Trouble Shooting
Garbled Data
Some of the data may be converted correctly, other data isn't. This probably
means that you have got the encoding specified wrongly for either the Input or
Output files. Check the original source of the data and the documentation for
the specific converter that you are using. Internally, the text is all converted
to Unicode before it is processed. If the encoding that you are using has codes
that cannot be translated to Unicode then these may fail to be converted
correctly. Old files created with legacy pre-Unicode fonts should probably be
converted as Windows-1252 with the Output set to UTF-8.
Some legacy fonts use code points that are undefined in Windows-1252.
In this case you may want to try the RawBytes encoding.
Other Problems and Bugs
For other problems and bugs please send an email to
develNO JUNK@thanlwinsoft.org.
Please try to be as explicit as possible in describing your problem.
If it is a case of incorrect conversion, then it is very hard to diagnose
problems unless I can reproduce it.
If possible, please send some example files (though not too big please!) that
illustrate the problem.