Book review: Web Crawling and Data Mining with Apache Nutch

I have received a copy of the book “Web Crawling and Data Mining with Apache Nutch”.  Dr. Zakir Laliwala and Abdulbasit Shaikh are the authors. When I took a look to its tittle some memories crossed my mind. Some years ago I was dealing with Nutch and I was impressed with its power. You could create your own “Google” searcher effortless. However, I couldn’t go deep due to the lack of resources in that time. But now, you could save a lot of time in your learning of web crawling and data mining using this book.


They explain how to install Apache Nutch as well as Apache Solr. But instead of just pointing their websites there is a list of steps collecting all the commands and files that you have to modify in order to have a proper installation. I think that this example is the best feature of the book. Authors really wanted to avoid people being stuck in a middle point, looking for helping in the web… So don’t be worried if you are just starting to study this field.

However, there is something I didn’t like. The book mentioned some tools that required the installation of a previous version of Nutch. I know it isn’t authors’ fault but it’s a bit confusing.

Some of interesting subjects covered in the book are the creation of a specific Nutch plugin so you could adjust it to your requirements, how to deal with what people called ‘big data’ and how to create a fronted page using JavaScript.

Are you an expert in web crawling? This book includes too how to use Nutch and Apache Hadoop for running applications in a cluster environment.

In my opinion, if you are interested in this field I would recommend you this book. You could save a lot of time and focus in your data instead of installation problems.

Note: I have received a copy of the book for creating this review but I haven’t received any money or gifts. So this is just my personal opinion.

Nutch and Lucene in Eclipse or Netbeans

This entry is for helping you to programme with the Nutch’s API under Netbeans (I think it will work with Eclipse).

First of all, you should download and install Nutch. There are a lot of tutorials for that. Before go to the next step you shold have something like that:

Searching with Nutch
Searching with Nutch

Now, you want to create your own class in Netbeans. Create a new proejct in Netbeans and copy that:

package ull;

import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.nutch.util.NutchConfiguration;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.nutch.searcher.*;

public class Buscador {

public static void main(String[] args) {
Configuration conf = NutchConfiguration.create();
NutchBean bean;
Path searchdir = new Path(“/home/ivan/Documentos/proyecto/nutch1/crawl”);
try {
bean = new NutchBean(conf,searchdir);
Query query = Query.parse(“enTodos”, conf);
Hits hits =, 10);
System.out.println(“Total hits: ” + hits.getTotal());
int length = (int) Math.min(hits.getTotal(), 10);
Hit[] show = hits.getHits(0, length);
HitDetails[] details = bean.getDetails(show);
Summary[] summaries = bean.getSummary(details, query);

for (int i = 0; i < hits.getLength(); i++) { System.out.println(" " + i + " " + details[i] + "\n" + summaries[i]); } } catch (IOException ex) { Logger.getLogger(Buscador.class.getName()).log(Level.SEVERE, null, ex); } } } [/sourcecode] Now, you have to add Nutch.jar and after that all the jars under lib folder.Right click in Library and choose Add external jar/folder for do that. The line conf.set("plugin.folders","/home/ivan/Descargas/nutch-0.9/build/plugins"); is for determining the folder where are the plugins. I know you should modify nutch-site.xml but it didn't work for me. If you do that you will avoid the errors: java.lang.RuntimeException: org.apache.nutch.searcher.QueryFilter not found.


java.lang.IllegalArgumentException: plugin.folders is not defined

Thats all!

If you want to debuggin all the Nutch project you can open it installing the free-form plugin in Netbeans.