Nutch uses Lucene to index and search documents on the Web. Normally, it tokenizes text based on whitespace. However, it is easy to specify language specific tokenizers. An implementation of a Myanmar Tokenizer and associated files is available below.
Download Nutch and Lucene from Apache and then download the following Myanmar specific files.
/var/www/ThanLwinSoft/Downloads/Searching/ not foundThese instructions assume that you have a suitable Java JDK (1.4 or higher) and ant and installed. Create a directory to contain the files e.g. /opt/nutch and place the downloaded files in it, then run this script:
#!/bin/bash
## This script assumes that it is run as a user who has write permission in /opt/nutch
## Update these variables to the versions that you are using
export BASE_DIR=/opt/nutch
export LUCENE=lucene-1.9.1
export ORIG_LUCENE_VER=1.9-rc1
export NEW_LUCENE_VER=1.9.2
export NUTCH=nutch-0.8.1
export MY_NUTCH_DATE=20070302
export CATALINA_HOME=/opt/apache-tomcat-6.0.10
missing=0
if ! (test -f $BASE_DIR/$LUCENE-src.tar.gz) then missing=1; fi
if ! (test -f $BASE_DIR/$NUTCH.tar.gz) then missing=1; fi
if ! (test -f $BASE_DIR/nutch-analysis-my-$MY_NUTCH_DATE.zip) then missing=1; fi
if ! (test -f $BASE_DIR/lucene-analysis-my-$MY_NUTCH_DATE.zip) then missing=1; fi
if ! (test -f $BASE_DIR/myNgram-$MY_NUTCH_DATE.zip) then missing=1; fi
if test $missing
then
echo WARNING: $BASE_DIR/$LUCENE-src.tar.gz $BASE_DIR/$NUTCH.tar.gz $BASE_DIR/nutch-analysis-my-$MY_NUTCH_DATE.zip
echo $BASE_DIR$BASE_DIR/lucene-analysis-my-$MY_NUTCH_DATE.zip $BASE_DIR/myNgram-$MY_NUTCH_DATE.zip not found.
return 1
fi
cd $BASE_DIR
tar -zxvf $LUCENE-src.tar.gz
cd $LUCENE
ant
cd $BASE_DIR/$LUCENE/contrib/analyzers/src/java/
unzip $BASE_DIR/lucene-analysis-my-$MY_NUTCH_DATE.zip
cd $BASE_DIR/$LUCENE/contrib/analyzers
ant
cd $BASE_DIR/$LUCENE/contrib/miscellaneous
ant
cd $BASE_DIR
tar -zxvf $NUTCH.tar.gz
cd $NUTCH
ant
## use your locally built lucene-analyzers
cd $BASE_DIR/$NUTCH/src/plugin/lib-lucene-analyzers
cp plugin.xml plugin.xml.orig
## update the version to the one that you are using
cat plugin.xml.orig | sed s/$ORIG_LUCENE_VER/$NEW_LUCENE_VER/g > plugin.xml
cd lib
ln -s ../../../../../$LUCENE/build/contrib/analyzers/lucene-analyzers-$NEW_LUCENE_VER-dev.jar
## build the myanmar nutch analysis wrapper
cd $BASE_DIR/$NUTCH/src/plugin
unzip $BASE_DIR/nutch-analysis-my-$MY_NUTCH_DATE.zip
cd analysis-my
ant
cd ../
## You may want to create your own ngrams
## e.g. if you have a large UTF-8 encoded Myanmar text file called myanmar.txt in your base directory
# cd $BASE_DIR/$NUTCH/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/
# java -cp $BASE_DIR/$NUTCH/build/plugins/language-identifier/language-identifier.jar:$BASE_DIR/$NUTCH/lib/commons-logging-1.0.4.jar org.apache.nutch.analysis.lang.NGramProfile -create my $BASE_DIR/myanmar.txt UTF-8
## otherwise you can just use this one:
unzip $BASE_DIR/myNgram-$MY_NUTCH_DATE.zip
cd languageidentifier
ant
cd $BASE_DIR/$NUTCH
mkdir oldlib
mv lib/lucene-core*.jar oldlib
mv lib/lucene-misc*.jar oldlib
cd $BASE_DIR/$NUTCH/lib
ln -s ../../$LUCENE/build/lucene-core-$NEW_LUCENE_VER-dev.jar
ln -s ../../$LUCENE/build/contrib/misc/lucene-misc-$NEW_LUCENE_VER-dev.jar
cd $BASE_DIR/$NUTCH
## get rid of the original versions of the plugins that we have rebuilt
## so they don't get picked up on the class path accidentally
rm -rf plugins/lib-lucene-analyzers
rm -rf plugins/language-identifier
echo Edit nutch-site.xml as appropriate
echo Make sure that the plugin.includes property value includes
echo language-identifier\|analysis-my
gedit conf/nutch-site.xml
echo Edit src/web/include/style.html to include a suitable Myanmar font
echo e.g. add a style: "
* {
font-family: Padauk, Myanmar3, Arial, Helvetica, sans-serif;
line-height: 1.5em;
}"
gedit src/web/include/style.html
## build the war file with the configuration data
ant war
if test -f $CATALINA_HOME/webapps
then
cp $NUTCH/build/$NUTCH.war $CATALINA_HOME/webapps
else
echo now copy $NUTCH/build/$NUTCH.war into your Tomcat catalina webapps directory
fi
echo Follow the instructions at http://lucene.apache.org/nutch/tutorial8.html
echo to build an index e.g. in /opt/nutch/crawl and then test searching.
When this script runs, you will be asked to edit a couple of files. You will need to update your nutch-0.8.1/conf/nutch-site.xml to specify the language plugins and your crawl directory as in the example below. The searcher.dir property must match the one that you use in the -dir option to bin/nutch/crawl as described in the Nutch Tutorial.
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|language-identifier|analysis-(my)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
<property>
<name>searcher.dir</name>
<value>/opt/nutch/crawl</value>
</property>
You will probably want to edit the styles in nutch-0.8.1/src/web/include/style.html e.g.
/* Default to a Myannmar font compliant to the latest Unicode proposal */
* {
font-family: PadaukOT, Myanmar2, Padauk, Arial, Helvetica, sans-serif;
line-height: 1.5em;
font-size: 12px;
}
/* Underline doesn't look great with Myanmar, so use a bottom border instead */
a {
text-decoration: none;
border-bottom-style: solid;
border-bottom-width: 1px;
}
The Apache Tomcat $CATALINA_HOME/conf/server.xml file needs to be modified to support UTF-8 queries. Make sure that the HTTP connector has the URIEncoding attribute set. e.g.
<Connector port="8080" protocol="HTTP/1.1"
maxThreads="150" connectionTimeout="20000"
redirectPort="8443"
URIEncoding="UTF-8" />
You are now ready to follow the instructions in the Nutch Tutorial to create an index and then test searching using the Tomcat interface.