Nutch uses Lucene to index and search documents on the Web. Normally, it tokenizes text based on whitespace. However, it is easy to specify language specific tokenizers. An implementation of a Myanmar Tokenizer and associated files is available below.
Download Nutch and Lucene from Apache and then download the following Myanmar specific files.
/var/www/ThanLwinSoft/Downloads/Searching/ not foundThese instructions assume that you have a suitable Java JDK (1.4 or higher) and ant and installed. Create a directory to contain the files e.g. /opt/nutch and place the downloaded files in it, then run this script:
#!/bin/bash ## This script assumes that it is run as a user who has write permission in /opt/nutch ## Update these variables to the versions that you are using export BASE_DIR=/opt/nutch export LUCENE=lucene-1.9.1 export ORIG_LUCENE_VER=1.9-rc1 export NEW_LUCENE_VER=1.9.2 export NUTCH=nutch-0.8.1 export MY_NUTCH_DATE=20070302 export CATALINA_HOME=/opt/apache-tomcat-6.0.10 missing=0 if ! (test -f $BASE_DIR/$LUCENE-src.tar.gz) then missing=1; fi if ! (test -f $BASE_DIR/$NUTCH.tar.gz) then missing=1; fi if ! (test -f $BASE_DIR/nutch-analysis-my-$MY_NUTCH_DATE.zip) then missing=1; fi if ! (test -f $BASE_DIR/lucene-analysis-my-$MY_NUTCH_DATE.zip) then missing=1; fi if ! (test -f $BASE_DIR/myNgram-$MY_NUTCH_DATE.zip) then missing=1; fi if test $missing then echo WARNING: $BASE_DIR/$LUCENE-src.tar.gz $BASE_DIR/$NUTCH.tar.gz $BASE_DIR/nutch-analysis-my-$MY_NUTCH_DATE.zip echo $BASE_DIR$BASE_DIR/lucene-analysis-my-$MY_NUTCH_DATE.zip $BASE_DIR/myNgram-$MY_NUTCH_DATE.zip not found. return 1 fi cd $BASE_DIR tar -zxvf $LUCENE-src.tar.gz cd $LUCENE ant cd $BASE_DIR/$LUCENE/contrib/analyzers/src/java/ unzip $BASE_DIR/lucene-analysis-my-$MY_NUTCH_DATE.zip cd $BASE_DIR/$LUCENE/contrib/analyzers ant cd $BASE_DIR/$LUCENE/contrib/miscellaneous ant cd $BASE_DIR tar -zxvf $NUTCH.tar.gz cd $NUTCH ant ## use your locally built lucene-analyzers cd $BASE_DIR/$NUTCH/src/plugin/lib-lucene-analyzers cp plugin.xml plugin.xml.orig ## update the version to the one that you are using cat plugin.xml.orig | sed s/$ORIG_LUCENE_VER/$NEW_LUCENE_VER/g > plugin.xml cd lib ln -s ../../../../../$LUCENE/build/contrib/analyzers/lucene-analyzers-$NEW_LUCENE_VER-dev.jar ## build the myanmar nutch analysis wrapper cd $BASE_DIR/$NUTCH/src/plugin unzip $BASE_DIR/nutch-analysis-my-$MY_NUTCH_DATE.zip cd analysis-my ant cd ../ ## You may want to create your own ngrams ## e.g. if you have a large UTF-8 encoded Myanmar text file called myanmar.txt in your base directory # cd $BASE_DIR/$NUTCH/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ # java -cp $BASE_DIR/$NUTCH/build/plugins/language-identifier/language-identifier.jar:$BASE_DIR/$NUTCH/lib/commons-logging-1.0.4.jar org.apache.nutch.analysis.lang.NGramProfile -create my $BASE_DIR/myanmar.txt UTF-8 ## otherwise you can just use this one: unzip $BASE_DIR/myNgram-$MY_NUTCH_DATE.zip cd languageidentifier ant cd $BASE_DIR/$NUTCH mkdir oldlib mv lib/lucene-core*.jar oldlib mv lib/lucene-misc*.jar oldlib cd $BASE_DIR/$NUTCH/lib ln -s ../../$LUCENE/build/lucene-core-$NEW_LUCENE_VER-dev.jar ln -s ../../$LUCENE/build/contrib/misc/lucene-misc-$NEW_LUCENE_VER-dev.jar cd $BASE_DIR/$NUTCH ## get rid of the original versions of the plugins that we have rebuilt ## so they don't get picked up on the class path accidentally rm -rf plugins/lib-lucene-analyzers rm -rf plugins/language-identifier echo Edit nutch-site.xml as appropriate echo Make sure that the plugin.includes property value includes echo language-identifier\|analysis-my gedit conf/nutch-site.xml echo Edit src/web/include/style.html to include a suitable Myanmar font echo e.g. add a style: " * { font-family: Padauk, Myanmar3, Arial, Helvetica, sans-serif; line-height: 1.5em; }" gedit src/web/include/style.html ## build the war file with the configuration data ant war if test -f $CATALINA_HOME/webapps then cp $NUTCH/build/$NUTCH.war $CATALINA_HOME/webapps else echo now copy $NUTCH/build/$NUTCH.war into your Tomcat catalina webapps directory fi echo Follow the instructions at http://lucene.apache.org/nutch/tutorial8.html echo to build an index e.g. in /opt/nutch/crawl and then test searching.
When this script runs, you will be asked to edit a couple of files. You will need to update your nutch-0.8.1/conf/nutch-site.xml to specify the language plugins and your crawl directory as in the example below. The searcher.dir property must match the one that you use in the -dir option to bin/nutch/crawl as described in the Nutch Tutorial.
<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|language-identifier|analysis-(my)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> <property> <name>searcher.dir</name> <value>/opt/nutch/crawl</value> </property>
You will probably want to edit the styles in nutch-0.8.1/src/web/include/style.html e.g.
/* Default to a Myannmar font compliant to the latest Unicode proposal */ * { font-family: PadaukOT, Myanmar2, Padauk, Arial, Helvetica, sans-serif; line-height: 1.5em; font-size: 12px; } /* Underline doesn't look great with Myanmar, so use a bottom border instead */ a { text-decoration: none; border-bottom-style: solid; border-bottom-width: 1px; }
The Apache Tomcat $CATALINA_HOME/conf/server.xml file needs to be modified to support UTF-8 queries. Make sure that the HTTP connector has the URIEncoding attribute set. e.g.
<Connector port="8080" protocol="HTTP/1.1" maxThreads="150" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8" />
You are now ready to follow the instructions in the Nutch Tutorial to create an index and then test searching using the Tomcat interface.