Nutch2: Parse All Content and Get All Outlinks

Some links in our documentation site are dynamically generated: especially the left side menu. This cause Nutch2 and Google unable to crawl all pages in our site. So we decide to have one invisible link which lists all pages in our site. 

But Nutch2 is unable to get all outlinks from the invisible listing-all-pages page.
In org.apache.nutch.parse.ParseUtil.process(String, WebPage), Nutch use parameter db.max.outlinks.per.page to specify the max number of outlinks Nutch fetches from a page.
int maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
maxOutlinks = (maxOutlinksPerPage < 0) ? Integer.MAX_VALUE : maxOutlinksPerPage;

We can set db.max.outlinks.per.page to -1 or tell Nutch to get all outlinks.
Meanwhile, we need change http.content.limit to -1, so Nutch will parse all content of a page, change http.timeout to some bigger number.
We will put our change in nutch-site.xml like below:

 db.max.outlinks.per.page
 -1
 


 http.content.limit
 -1



 http.timeout
 1000000



 db.ignore.internal.links
 false
 

  file.content.limit
  -1


Linux Notes All In One

Add third-party yum-repositories
rpm -Uvh http://repo.webtatic.com/yum/centos/5/latest.rpm
yum install --enablerepo=webtatic git-all
Using CentOS 5 Repos in RHEL 5 Server
wget http://mirrors.nl.kernel.org/centos/5/os/x86_64/CentOS/centos-release-notes-5.10-0.x86_64.rpm
wget http://mirrors.nl.kernel.org/centos/5/os/x86_64/CentOS/centos-release-5-10.el5.centos.x86_64.rpm
rpm -e redhat-release-notes-5Server redhat-release-5Server --nodeps
rpm -ivh centos-release-notes-5.10-0.x86_64.rpm centos-release-5-10.el5.centos.x86_64.rpm
Installing RPMforge
rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt
wget http://pkgs.repoforge.org/rpmforge-release/rpmforge-release-0.5.3-1.el5.rf.x86_64.rpm
rpm -ivh rpmforge-release-0.5.3-1.el5.rf.x86_64.rpm
yum search java | grep 'java-'
cd /etc/yum.repos.d

Open a file browser from bash
nautilus --browser .
Exit from git diff
Window+Q 
VNC Server Setup
/etc/sysconfig/vncservers
VNCSERVERS="1:vncuser 2:vncuser2 3:vncuser3"
VNCSERVERARGS[1]="-geometry 1600x1200"

service vncserver start|stop|restart
Create xstartup scripts
vi ~/.vnc/xstartup
Uncomment the following two lines (remove the "#" characters):
unset SESSION_MANAGER
exec /etc/X11/xinit/xinitrc
Managing your VNC sessions
vncserver -kill :1
List all VNC server sessions

ls ~/.vnc/*.pid
Check vnc version in redhat
rpm -qa | grep vnc-server
rpm -qf /usr/bin/vncserver

Copy and paste stops working in VNC session
Run vncconfig &

Install Java in Redhat
Add CentOs repository
yum search java-1.7
yum install java-1.7**
alternatives --display java
/usr/sbin/alternatives --config java

alternatives --install /usr/bin/java java

Labels

Java (159) Lucene-Solr (110) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts