Configuring Lucene Nutch 0.9
From Mavaball
This page describes the notes I took (and trouble I ran into) while trying to install the Nutch Web Search software, based on the Apache Lucene Java project.
These instructions mirror the Nutch 0.8 Tutorial
Contents |
System Requirements
I've confirmed this tutorial using a Macbook running Mac OS X 10.5.6 and with Xcode installed. This should be pretty close to most Linux (or Unix) distributions, but since OS X comes from BSD there may be a few quirks.
Installation
Download
Get the distributors keys and signature:
wget http://www.apache.org/dist/lucene/nutch/KEYS wget http://www.apache.org/dist/lucene/nutch/nutch-0.9.tar.gz.asc
Download nutch-0.9.tar.gz from one of the mirrors.
Verify
Check the cryptographic integrity of the distribution (using Gnu privacy guard):
gpg --import apache-KEYS gpg --verify nutch-0.9.tar.gz.asc
(Look for the good signature output from gpg --verify. You probably haven't added Chris Mattmann as a trusted signer, so ignore the part about it not being a trusted signature -- the root of trust has to start somewhere...)
Unpack
Use a GNU-compliant version of tar to unpack this. Supposedly, the standard tar that is distributed with Solaris and Mac OS X is not GNU-compliant, but in looking at the tar --help on my Mac OS X distribution, it does claim to be gnu tar (verison 1.15.1). Maybe Mac has changed recently...
tar -xvzf nutch-0.9.tar.gz cd nutch-0.9
(Note: I got an error at the end of my tar command about a zero-size block, but can't tell whether it was a problem)
Configure Environment
Nutch needs you to set the environment variable NUTCH_JAVA_HOME. Basically, Nutch expects to find the java executable at
${NUTCH_JAVA_HOME}/bin/java
Here's how I figured this out:
which java
(In my case, this command found java at /usr/bin/java). Remove the 'bin/java' prefix, and set NUTCH_JAVA_HOME to the part before, which in this case is /usr)
Based on where you have java installed, do this for BASH:
NUTCH_JAVA_HOME=/usr
You can add this to your .bashrc or .bash_profile if you expect to use Nutch frequently.
Configure Websites
Nutch needs to know what websites to search. You'll want to start with a small set of websites at first. This example shows how to make a crawler that only looks at the nutch website.
Create the URLs directory and files
Nutch needs a set of files that contain URLs that you want to crawl. In this example, we'll add only a root url (i.e., where the crawling starts) for the nutch website.
mkdir urls echo "http://lucene.apache.org/nutch/" > urls/nutch
URL Filter Configuration
(text from 0.8 tutorial:) Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read:
+^http://([a-z0-9]*\.)*apache.org/
This will include any url in the domain apache.org.
Nutch Site Configuration
Edit the file conf/nutch-site.xml, insert the following properties into it:
<configuration> <property> <name>http.agent.name</name> <value>your-agent-name</value> </property> <property> <name>http.agent.description</name> <value>your-description</value> </property> <property> <name>http.agent.url</name> <value>your-URL</value> </property> <property> <name>http.agent.email</name> <value>your-email-address</value> </property> </configuration>
Fill in valid values for:
- your-agent-name: Text to name your crawler
- your-description: your text to describe your crawler
- your-URL: A web address (like http://example.com) for your crawler
- your-email-address: A obfuscated e-mail address for your (e.g., your dot name at example dot com)
If you don't make these changes, then the crawler will fail with this somewhat cryptic error:
fetch of http://example.com/ failed with: java.lang.RuntimeException: Agent name not configured!
Run Nutch Crawler
This section describes some basics on how to get the nutch web crawler to do a simple crawl.
Start a crawl
Type:
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Notes:
- crawl is the command to tell nutch to crawl the websites configured previously
- urls is the directory we made previously with all the root URLS
- -dir crawl is the directory to place the results into. This cannot be previously created. If crawl already exists (from a previous run), rename crawl to something else and run this command again.
- -depth 3: only search three nodes away from the root
- -topN 50: only collect 50 pages at each depth
Verify Searching Works
Run:
bin/nutch org.apache.nutch.searcher.NutchBean apache
where apache is a search term that will get at least one hit from the previous crawl (kept in the directory crawl).
