Nutch2: Parse All Content and Get All Outlinks


Some links in our documentation site are dynamically generated: especially the left side menu. This cause Nutch2 and Google unable to crawl all pages in our site. So we decide to have one invisible link which lists all pages in our site. 

But Nutch2 is unable to get all outlinks from the invisible listing-all-pages page.
In org.apache.nutch.parse.ParseUtil.process(String, WebPage), Nutch use parameter db.max.outlinks.per.page to specify the max number of outlinks Nutch fetches from a page.
int maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
maxOutlinks = (maxOutlinksPerPage < 0) ? Integer.MAX_VALUE : maxOutlinksPerPage;

We can set db.max.outlinks.per.page to -1 or tell Nutch to get all outlinks.
Meanwhile, we need change http.content.limit to -1, so Nutch will parse all content of a page, change http.timeout to some bigger number.
We will put our change in nutch-site.xml like below:

 db.max.outlinks.per.page
 -1
 


 http.content.limit
 -1



 http.timeout
 1000000



 db.ignore.internal.links
 false
 

  file.content.limit
  -1


Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)