Friday, December 28, 2012

Nutch 2.1


Usefull websites : 



Setting up Nutch 2.1 with MySQL to handle UTF-8


These instructions assume Ubuntu 12.04 and Java 6 or 7 installed and JAVA_HOME configured.
Install MySQL Server and MySQL Client using the Ubuntu software center or sudo apt-get install mysql-server mysql-client at the command line.
As MySQL defaults to latin (are we still in the 1990s?) we need to edit sudo vi /etc/mysql/my.cnf and under [mysqld] add
innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
The innodb options are to help deal with the small primary key size restriction of MySQL. Restart your machine for the changes to take effect.
Check to make sure MySQL is running by typing sudo netstat -tap | grep mysql  and you should see something like
tcp        0      0 localhost:mysql         *:*                     LISTEN
We need to set up the nutch database manually as the current Nutch/Gora/MySQL generated db schema defaults to latin. Log into mysql at the command line using your previously set up MySQL id and password type
mysql -u xxxxx -p
then in the MySQL editor type the following:
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;
and enter followed by
use nutch;
and enter and then copy and paste the following altogether:
CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;
Then type enter. You are done setting up the MySQL database for Nutch.

Set up Nutch 2.1 by downloading the latest version from http://www.apache.org/dyn/closer.cgi/nutch/. Untar the contents of the file you just downloaded and going forward we will refer to this folder as ${APACHE_NUTCH_HOME}.
From inside the nutch folder ensure the MySQL dependency for Nutch is available by editing the following in ${APACHE_NUTCH_HOME}/ivy/ivy.xml
<!– Uncomment this to use MySQL as database with SQL as Gora store. –>
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>
Edit the ${APACHE_NUTCH_HOME}/conf/gora.properties file either deleting or commenting out the Default SqlStore Properties using #. Then add the MySQL properties below replacing xxxxx with the user and password you set up when installing MySQL earlier.
###############################
# MySQL properties            #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxxx
gora.sqlstore.jdbc.password=xxxxx
Edit the ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml file changing the length of the primarykey from 512 to 767 in both places.
<primarykey column=”id” length=”767″/>
Configure ${APACHE_NUTCH_HOME}/conf/nutch-site.xml to put in a name in the value field under http.agent.name. It can be anything but cannot be left blank. Add additional languages if you want (I have added Japanese ja-jp below) and utf-8 as default as well. You must specify Sqlstore.
<property>
<name>http.agent.name</name>
<value>Your Nutch Spider</value>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>
Install ant using the Ubuntu software center or sudo apt-get install ant at the command line.
From the command line cd to your nutch folder type ant runtime
This may take a few minutes to compile.

Start your first crawl by typing the lines below at the terminal (replace ‘http://nutch.apache.org/’ with whatever site you want to crawl):

cd ${APACHE_NUTCH_HOME}/runtime/local
mkdir -p urls
echo 'http://nutch.apache.org/' > urls/seed.txt
bin/nutch crawl urls -depth 3 -topN 5
You can easily add more urls to search by hand in seed.txt if you want. For the crawl, depth is the number of rounds of generate/fetch/parse/update you want to do (not depth of links as you might think at first) and topN is the max number of links you want to actually parse each time. Note however Nutch keeps track of all links it encounters in the webpage table (it just limits the amount it actually parses to TopN so don’t be surprised by seeing many more rows in the webpage table than you expect by limiting with TopN).
Check your crawl results by looking at the webpage table in the nutch database.

mysql -u xxxxx -p
use nutch;
SELECT * FROM nutch.webpage;
You should see the results of your crawl (around 159 rows). It will be hard to read the columns so you may want to install MySQL Workbench via sudo apt-get install mysql-workbench and use that instead for viewing the data. You may also want to run the following SQL command select * from webpage where status = 2; to limit the rows in the webpage table to only urls that were actually parsed.

Set up and index with Solr If you are using Nutch 2.1 at this time you are into the bleeding edge and probably want thelatest version of Solr 4.0 as well. Untar it to to $HOME/apache-solr-4.0.0-XXXX. This folder will be now referred to as ${APACHE_SOLR_HOME}.
Download http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml  and use it to replace ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.
From the terminal start solr:
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar
You can check this is running by opening http://localhost:8983/solr in your web browser.
Leave that terminal running and from a different terminal type the following:
cd ${APACHE_NUTCH_HOME}/runtime/local/
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
You can now run queries using Solr versus your crawled content. Open http://localhost:8983/solr/#/collection1/query and assuming you have crawled nutch.apache.org in the input box titled “q” you can do a search by inputting text:nutch and you should see something like this:

There remains a lot to configure to get a good web search going but you are at least started.

Fields / Columns in Webpage 

1. Id  = URL
2. Headers = Http Headers eg. Set-CookieXPHPSESSID=hmsgtrfqsnfql14aaj8lv0n2l6; path=/ Age 0 X-Powered-By*PHP/5.3.3-7+squeeze13 Content-Length 7184 Linkf<http://creativecommons.org/?p=5064>; rel=shortlink Accept-Ranges

bytes Content-Encoding gzip X-PingbackJhttp://creativecommons.org/xmlrpc.php Connection
close Via 1.1 varnish Content-Type0text/html; charset=UTF-signature8 X-Varnish 2185015582 Date:Thu, 09 Aug 2012 01:50:38 GMT Server,Apache/2.2.16 (Debian) Vary Accept-Encoding
3. Text = Parsed text
4. statusfetch field used to store whether the link was actually fetched
1unfetched (links not yet fetched due to limits set in regex-urlfilter.txt, -TopN crawl parameters, etc.)
2fetched (page was successfully fetched)
3gone (that page no longer exists)
4redir_temp (temporary redirection — see reprUrl below for more details)
5redir_perm (permanent redirection — see reprUrl below for more details)
34retry
38not modified
5. markers =  _ftcmrk_(1344476998-339713229 _gnmrk_(1344476998-339713229 __prsmrk__(1344476998-339713229

6.  parseStatus – Parse field normally null until parsing attempted. For list of codes see. ParseStatusCodes.html
example (3 bytes): 02 00 00

modifiedTime – Fetch field – supposed to be last time signature changed (may have defect causing it to turn to 0 after multiple crawls).
example: 1344597206693

score – DbUpdate field ranking a given url/page’s importance. Higher is better. See NewScoring
example: 0.0183057

typ – Fetch field containing the mime type Internet_media_type for the document such as text/html or application/pdf. Note that some Mime types are excluded by default and this can be modified in conf/regex-urlfilter.txt.
example: text/html

baseUrl – Fetch field. The base url for relative links contained in the content. Maybe be different from url if the request redirected.
example: http://gora.apache.org/

content – Fetch Field – content of the URL.
example:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 
<html> 
<head> 
<META http-equiv="Content-Type" content="text/html; charset=UTF-8"> 
<meta content="Apache Forrest" name="Generator"> 
<meta name="Forrest-version" content="0.9"> 
<meta name="Forrest-skin-name" content="nutch"> 
<title>Welcome to Apache Nutch&#153;</title> 
<link type="text/css" href="skin/basic.css" rel="stylesheet"> 
<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet"> 
<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
...



title – Parse field – The text in the title tags of the HTML head.
example: Welcome to Apache Nutch™

reprUrl – Fetch field for representative urls used for redirects. The default behaviour is that the fetcher won’t immediately follow redirected URLs, instead it will record them for fetching during the next round. The documentation indicates that this can be changed to immediately follow redirected urls by copying the http.redirect.max property from conf/nutch-default.xml to conf/nutch-site.xml and changing the value to a value greater than 0. However, this is not yet implemented for Nutch 2.0 at this time and every redirect is handled during the next fetch regardless of the property of http.redirect.max.*
example: http://www.apachecon.eu/c/aceu2009/sessions/136

fetchInterval – Fetch field containing default interval until next fetch in seconds (defaults to 30 days). See fetchTime field default explanation. Can be set at the url level when injecting so the field is necessary (see nutch_inject).
example: 2592000

prevFetchTime – Fetch field – previous value of fetch time, or null if not available. This is the previous Nutch fetch time, not to be confused with modifiedTime which is the time the content was actually modified. See fetchTime field default explanation.
example: 1347093015591

inlinks – DbUpdate field with inbound links useful for Linkrank. See Webgraph at NewScoring
example:  xhttp://blog.foofactory.fi/2007/03/twice-speed-half-size.html Website up

prevSignature – Parse field — previous signature. For more details see signature further down.
example (16 bytes): 25 59 5c 73 03 09 bb ed a0 98 5e b6 5e 0c 89 63

outlinks – DbUpdate field – outbound links
example:  http://www.adobe.com/jp/products/acrobat/readstep2.html

fetchTime – Fetch field used by Mapper to decide if it is time to fetch this url. See this link how-to-re-crawl-with-nutch for a well written overview. Also see the Nutch API documentation AbstractFetchSchedule. The default re-fetch schedule is somewhat simplistic. No matter if the page was changed or not, the fetchInterval remains unchanged, and the updated page fetchTime will always be set to fetchTime + fetchInterval * 1000. See DefaultFetchSchedule. A better implementation for most cases is the AdaptiveFetchSchedule AdaptiveFetchSchedule. The FetchSchedule implementation can be changed by copying the db.fetch.schedule.class property from conf/nutch-default.xml to conf/nutch-site.xml and changing the value.
example: 1347093160403
retriesSinceFetch – Fetch field counter for number of retries to fetch due to (hopefully transient) errors since the last success. See AbstractFetchSchedule
example: 2
protocolStatus – Fetch field – see ProtocolStatusCodes
example (3 bytes): 02 00 00
ACCESS_DENIED 17
BLOCKED 23
EXCEPTION 16
FAILED 2
GONE 11
MOVED 12
NOTFETCHING 20
NOTFOUND 14
NOTMODIFIED 21
PROTO_NOT_FOUND 10
REDIR_EXCEEDED 19
RETRY 15
ROBOTS_DENIED 18
SUCCESS 1
TEMP_MOVED 13
WOULDBLOCK 22
signature – This parse field contains a signature calculated every time a page is fetched so that Nutch knows whether a page has changed or not the next time it does a fetch. The default signature calculation implementation uses both content and header as information for calculating the signature. For various reasons (etags, etc.) the header can change without the actual content changing making the default implementation less than optimal for most requirements. For those looking to save some bandwidth on current status crawl or those implementing archival crawling (requires more changes than just this) the TextProfileSignature implementation is more appropriate. The signature calculation implementation can be changed by copying the db.signature.class property from conf/nutch-default.xml to conf/nutch-site.xml and changing the value to org.apache.nutch.crawl.TextProfileSignature.
example (16 bytes): e1 f7 cc cc 49 7a 45 6b e7 fc 05 68 9a e8 ea 93
metadata – This is a mixed catch all field for metadata (see metadata-package-summary.html). The IndexMetatags plugin does not currently work in Nutch 2.0 or 2.1. metadata-package-summary.html has more information but it is unclear how much works with 2.x
example: _csh_ :テつケ)e language en


Nutch Properties 



file properties

1. file.content.limit   = 65536
2. file.content.ignored = true 
3. file.crawl.parent = true 

HTTP Properties 

1. http.agent.name = ? 
2. http.robots.agents = * 
3. http.robots.403.allow = true 
4. http.agent.description = ? 
5. http.agent.url = ? 
6. http.agent.email = ? 
7. http.agent.version = Nutch-2.1 
8. http.agent.host = ? 
9. http.timeout = 10000 
10. http.max.delays = 100
11. http.content.limit = 65536
12. http.proxy.host = ? 
13. http.proxy.port = ? 
14. http.proxy.username = ? 
15. http.proxy.password = ? 
16. http.proxy.realm = ?
17. http.auth.file = httpclient-auth.xml
18. http.verbose = false 
19. http.useHttp11 = false 
20. http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
21. http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

FTP properties 

1. ftp.username = anonymous
2. ftp.password = anonymous@example.com
3. ftp.content.limit  = 65536
4. ftp.timeout = 60000
5. ftp.server.timeout = 100000
6. ftp.keep.connection = false 
7. ftp.follow.talk = false 


Web db properties 

1. db.default.fetch.interval = 30
2. db.fetch.interval.default = 60
3. db.fetch.interval.max = 7776000 ( 90 days ) 
4. db.fetch.schedule.class = org.apache.nutch.crawl.DefaultFetchSchedule
5. db.fetch.schedule.adaptive.inc_rate = 0.4
6. db.fetch.schedule.adaptive.dec_rate = 0.2
7. db.fetch.schedule.adaptive.min_interval = 60 ( 1 minutes ) 
8. db.fetch.schedule.adaptive.max_interval= 31536000.0 ( 365 days ) 
9. db.fetch.schedule.adaptive.sync_delta = true 
10. db.fetch.schedule.adaptive.sync_delta_rate = 0.3
11. db.update.additions.allowed = true 
12. db.update.max.inlinks = 10000
13. db.ignore.internal.links = true 
14. db.ignore.external.links = false 
15. db.score.injected = 1.0 
16. db.score.link.external = 1.0
17. db.score.link.internal = 1.0 
18. db.score.count.filtered = false 
19. db.max.inlinks = 10000
20. db.max.outlinks.per.page = 100 
21. db.max.anchor.length = 100 
22. db.parsemeta.to.crawldb = ? 
23. db.fetch.retry.max = 3
24. db.signature.class = org.apache.nutch.crawl.MD5Signature 
25. db.signature.text_profile.min_token_len = 2 
26. db.signature.text_profile.quant_rate = 0.01 


Generate Properties

1. generate.max.count = -1 
2. generate.max.distance = -1 
3. generate.count.mode = host or domain
4. generate.update.crawldb = false 


urlpartitioner properties

1. partition.url.mode = byHost  ( others byDomain, byIP)
2. crawl.gen.delay  = 604800000 ( 7 days ) 

fetcher properties

1. fetcher.server.delay = 5.0
2. fetcher.server.min.delay = 0.0
3. fetcher.max.crawl.delay = 30 ( seconds ) 
4. fetcher.threads.fetch = 10 
   desc : a. each FetcherThread handles one connection
            b. total No. of threads running = No. of fetcher threads * No. of nodes 
            c. Fetcher has one map task per node 

5. fetcher.threads.per.queue = 1 
6. fetcher.queue.mode = byHost 
7. fetcher.verbose = false 
8. fetcher.parse = false 
9. fetcher.store.content = true 
10. fetcher.timelimit.mins = -1 
11. fetcher.max.exceptions.per.queue = -1 
12. fetcher.throughput.threshold.pages = -1 
13. fetcher.throughput.threshold.sequence = 5
14. fetcher.throughput.threshold.check.after = 5 
15. fetcher.queue.depth.multiplier= 50 


indexingfilter plugin properties  

1. indexingfilter.order = ? 
2. indexer.score.power = 0.5 

BasicIndexingfilter plugin properties

1. indexer.max.title.length = 100

moreindexingfilter plugin properties

1. moreIndexingFilter.indexMimeTypeParts = true

AnchorIndexing filter plugin properties

1. anchorIndexingFilter.deduplicate = false 

URL normalizer properties

1. 









































No comments:

Post a Comment