Myanmar Searching

Using Nutch with Myanmar

/var/www

Nutch uses Lucene to index and search documents on the Web. Normally, it tokenizes text based on whitespace. However, it is easy to specify language specific tokenizers. An implementation of a Myanmar Tokenizer and associated files is available below.

Instructions

Download Nutch and Lucene from Apache and then download the following Myanmar specific files.

/var/www/ThanLwinSoft/Downloads/Searching/ not found

These instructions assume that you have a suitable Java JDK (1.4 or higher) and ant and installed. Create a directory to contain the files e.g. /opt/nutch and place the downloaded files in it, then run this script:

#!/bin/bash
## This script assumes that it is run as a user who has write permission in /opt/nutch
## Update these variables to the versions that you are using
export BASE_DIR=/opt/nutch
export LUCENE=lucene-1.9.1
export ORIG_LUCENE_VER=1.9-rc1
export NEW_LUCENE_VER=1.9.2
export NUTCH=nutch-0.8.1
export MY_NUTCH_DATE=20070302
export CATALINA_HOME=/opt/apache-tomcat-6.0.10

missing=0
if ! (test -f $BASE_DIR/$LUCENE-src.tar.gz) then missing=1; fi
if ! (test -f $BASE_DIR/$NUTCH.tar.gz) then missing=1; fi
if ! (test -f $BASE_DIR/nutch-analysis-my-$MY_NUTCH_DATE.zip) then missing=1; fi
if ! (test -f $BASE_DIR/lucene-analysis-my-$MY_NUTCH_DATE.zip) then missing=1; fi
if ! (test -f $BASE_DIR/myNgram-$MY_NUTCH_DATE.zip) then missing=1; fi
if test $missing
then
echo WARNING: $BASE_DIR/$LUCENE-src.tar.gz $BASE_DIR/$NUTCH.tar.gz $BASE_DIR/nutch-analysis-my-$MY_NUTCH_DATE.zip 
echo $BASE_DIR$BASE_DIR/lucene-analysis-my-$MY_NUTCH_DATE.zip $BASE_DIR/myNgram-$MY_NUTCH_DATE.zip not found.
return 1
fi

cd $BASE_DIR
tar -zxvf $LUCENE-src.tar.gz
cd $LUCENE
ant
cd $BASE_DIR/$LUCENE/contrib/analyzers/src/java/
unzip $BASE_DIR/lucene-analysis-my-$MY_NUTCH_DATE.zip
cd $BASE_DIR/$LUCENE/contrib/analyzers
ant
cd $BASE_DIR/$LUCENE/contrib/miscellaneous
ant
cd $BASE_DIR

tar -zxvf $NUTCH.tar.gz
cd $NUTCH
ant
## use your locally built lucene-analyzers
cd $BASE_DIR/$NUTCH/src/plugin/lib-lucene-analyzers
cp plugin.xml plugin.xml.orig
## update the version to the one that you are using
cat plugin.xml.orig | sed s/$ORIG_LUCENE_VER/$NEW_LUCENE_VER/g > plugin.xml
cd lib
ln -s ../../../../../$LUCENE/build/contrib/analyzers/lucene-analyzers-$NEW_LUCENE_VER-dev.jar
## build the myanmar nutch analysis wrapper
cd $BASE_DIR/$NUTCH/src/plugin
unzip $BASE_DIR/nutch-analysis-my-$MY_NUTCH_DATE.zip
cd analysis-my
ant
cd ../
## You may want to create your own ngrams
## e.g. if you have a large UTF-8 encoded Myanmar text file called myanmar.txt in your base directory
# cd $BASE_DIR/$NUTCH/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/
# java -cp $BASE_DIR/$NUTCH/build/plugins/language-identifier/language-identifier.jar:$BASE_DIR/$NUTCH/lib/commons-logging-1.0.4.jar  org.apache.nutch.analysis.lang.NGramProfile -create my $BASE_DIR/myanmar.txt  UTF-8
## otherwise you can just use this one:
unzip $BASE_DIR/myNgram-$MY_NUTCH_DATE.zip
cd languageidentifier
ant
cd $BASE_DIR/$NUTCH
mkdir oldlib
mv lib/lucene-core*.jar oldlib
mv lib/lucene-misc*.jar oldlib
cd $BASE_DIR/$NUTCH/lib
ln -s ../../$LUCENE/build/lucene-core-$NEW_LUCENE_VER-dev.jar
ln -s ../../$LUCENE/build/contrib/misc/lucene-misc-$NEW_LUCENE_VER-dev.jar
cd $BASE_DIR/$NUTCH
## get rid of the original versions of the plugins that we have rebuilt
## so they don't get picked up on the class path accidentally
rm -rf plugins/lib-lucene-analyzers
rm -rf plugins/language-identifier

echo Edit nutch-site.xml as appropriate
echo Make sure that the plugin.includes property value includes 
echo language-identifier\|analysis-my
gedit conf/nutch-site.xml
echo Edit src/web/include/style.html to include a suitable Myanmar font
echo e.g. add a style: "
* {
  font-family: Padauk, Myanmar3, Arial, Helvetica, sans-serif;
  line-height: 1.5em; 
}"
gedit src/web/include/style.html
## build the war file with the configuration data
ant war
if test -f $CATALINA_HOME/webapps
then 
    cp $NUTCH/build/$NUTCH.war $CATALINA_HOME/webapps
else
echo now copy $NUTCH/build/$NUTCH.war into your Tomcat catalina webapps directory
fi
echo Follow the instructions at http://lucene.apache.org/nutch/tutorial8.html
echo to build an index e.g. in /opt/nutch/crawl and then test searching.

When this script runs, you will be asked to edit a couple of files. You will need to update your nutch-0.8.1/conf/nutch-site.xml to specify the language plugins and your crawl directory as in the example below. The searcher.dir property must match the one that you use in the -dir option to bin/nutch/crawl as described in the Nutch Tutorial.

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|language-identifier|analysis-(my)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>
<property>
    <name>searcher.dir</name>
    <value>/opt/nutch/crawl</value>
</property>

You will probably want to edit the styles in nutch-0.8.1/src/web/include/style.html e.g.

/* Default to a Myannmar font compliant to the latest Unicode proposal */
* {
  font-family: PadaukOT, Myanmar2, Padauk, Arial, Helvetica, sans-serif;
  line-height: 1.5em;
  font-size: 12px;
}
/* Underline doesn't look great with Myanmar, so use a bottom border instead */
a {
  text-decoration: none;
  border-bottom-style: solid;
  border-bottom-width: 1px;
}

The Apache Tomcat $CATALINA_HOME/conf/server.xml file needs to be modified to support UTF-8 queries. Make sure that the HTTP connector has the URIEncoding attribute set. e.g.

<Connector port="8080" protocol="HTTP/1.1" 
               maxThreads="150" connectionTimeout="20000" 
               redirectPort="8443" 
               URIEncoding="UTF-8" />

You are now ready to follow the instructions in the Nutch Tutorial to create an index and then test searching using the Tomcat interface.

Converters>>