Book review: Web Crawling and Data Mining with Apache Nutch

I have received a copy of the book “Web Crawling and Data Mining with Apache Nutch”.  Dr. Zakir Laliwala and Abdulbasit Shaikh are the authors. When I took a look to its tittle some memories crossed my mind. Some years ago I was dealing with Nutch and I was impressed with its power. You could create your own “Google” searcher effortless. However, I couldn’t go deep due to the lack of resources in that time. But now, you could save a lot of time in your learning of web crawling and data mining using this book.


They explain how to install Apache Nutch as well as Apache Solr. But instead of just pointing their websites there is a list of steps collecting all the commands and files that you have to modify in order to have a proper installation. I think that this example is the best feature of the book. Authors really wanted to avoid people being stuck in a middle point, looking for helping in the web… So don’t be worried if you are just starting to study this field.

However, there is something I didn’t like. The book mentioned some tools that required the installation of a previous version of Nutch. I know it isn’t authors’ fault but it’s a bit confusing.

Some of interesting subjects covered in the book are the creation of a specific Nutch plugin so you could adjust it to your requirements, how to deal with what people called ‘big data’ and how to create a fronted page using JavaScript.

Are you an expert in web crawling? This book includes too how to use Nutch and Apache Hadoop for running applications in a cluster environment.

In my opinion, if you are interested in this field I would recommend you this book. You could save a lot of time and focus in your data instead of installation problems.

Note: I have received a copy of the book for creating this review but I haven’t received any money or gifts. So this is just my personal opinion.