Technology: July 2012

Saturday, July 28, 2012

Mahout Clustering

./bin/mahout seqdirectory -i /home/venkat/Downloads/cj4test/newscluster/reuters-out/ -o /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir -c UTF-8 -chunk 5
./bin/mahout seq2sparse -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir/ -o /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse -ng 2 -nv

./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -o /home/venkat/Downloads/cj4test/newscluster/canopy-output -t1 0.8 -t2 0.7 -ow -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -o /home/venkat/Downloads/cj4test/newscluster/canopy-output -t1 3.0 -t2 2.8 -ow -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -o /home/venkat/Downloads/cj4test/newscluster/canopy-output -t1 0.5 -t2 1.0 -ow -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -o /home/venkat/Downloads/cj4test/newscluster/canopy-output -t1 2.0 -t2 1.0 -ow -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -o /home/venkat/Downloads/cj4test/newscluster/canopy-output -t1 0.2 -t2 0.1 -ow -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -o /home/venkat/Downloads/cj4test/newscluster/canopy-output -t1 0.1 -t2 0.1 -ow -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -o /home/venkat/Downloads/cj4test/newscluster/canopy-output -t1 0.8 -t2 0.8 -ow -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -o /home/venkat/Downloads/cj4test/newscluster/canopy-output -t1 3.0 -t2 2.8 -ow -xm sequential -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -o /home/venkat/Downloads/cj4test/newscluster/canopy-output -t1 3.0 -t2 2.8 -ow -xm sequential -dm org.apache.mahout.common.distance.ManhattanDistanceMeasure

./bin/mahout kmeans -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -c /home/venkat/Downloads/cj4test/newscluster/canopy-output/clusters-0 -o /home/venkat/Downloads/cj4test/newscluster/reuters-kmeans -x 10 -ow -cl -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout kmeans -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -c /home/venkat/Downloads/cj4test/newscluster/canopy-output -o /home/venkat/Downloads/cj4test/newscluster/reuters-kmeans -x 10 -k 2 -ow -cl -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout kmeans -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -c /home/venkat/Downloads/cj4test/newscluster/clusters -o /home/venkat/Downloads/cj4test/newscluster/reuters-kmeans -x 10 -k 4 -ow -cl -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout kmeans -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -c /home/venkat/Downloads/cj4test/newscluster/canopy-output/clusters-0 -o /home/venkat/Downloads/cj4test/newscluster/reuters-kmeans -x 10 -ow -cl -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
./bin/mahout kmeans -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -c /home/venkat/Downloads/cj4test/newscluster/canopy-output/clusters-0 -o /home/venkat/Downloads/cj4test/newscluster/reuters-kmeans -x 10 -ow -cl -dm org.apache.mahout.common.distance.ManhattanDistanceMeasure

Sunday, July 22, 2012

Top 6 Magazines That Every Entrepreneur Must Read

1. Fortune Magazine
2. Forbes
3.Inc. Magazine
4. Harvard Business Review
5. Entrepreneur Magazine
6. Fast Company

Great Blogs Every Entrepreneur Must Read

1. Quora
2. PandoDaily
3. LinkedIn Today
4. Both Sides of the Table
5. Steve Blank
6. Penelope Trunk's Brazen Careerist
7. Wise Bread
8. Church of the Customer
9. How to change the world
10. AllBusiness.com
11. A VC
12. Entrepreneur Daily Dose
13.Small business brief
14.Blog Maverick
15.Venture Hacks
16.The entrepreneurial mind

7 Blockbuster Movies that Inspire Every Entrepreneur

1. Guru
2. Rocket Singh
3. Social Network
4. Wall Street
5. Forrest Gump
6. The Bridge on River Kwai
7. Charlie Bartlett

Every Startup Team Must Have These 6 Skills

Revenue Driver and a Technical Founder
Experience matters
Lead the leaders
Don’t be afraid
Titles can fool you
Managing expectations

Friday, July 20, 2012

Command Line commands for Kmeans Clustering

./bin/mahout seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5

./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse -ng 2 -nv

./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x 10 -k 2 -ow -cl

./bin/mahout clusterdump -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt sequencefile -s ./examples/bin/work/reuters-kmeans/clusters-3/part-r-00000 -n 20 -b 100 -p ./examples/bin/work/reuters-kmeans/clusteredPoints

./bin/mahout seqdumper -s ./examples/bin/work/reuters-kmeans/clusteredPoints/part-m-00000 | more

./bin/mahout rowid -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000 -o ./examples/bin/work/reuters-matrix

./bin/mahout rowid -Dmapred.input.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000 -Dmapred.output.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-matrix

4 rows and 1073 columns

./bin/mahout seqdumper -s /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-matrix/matrix | more

./bin/mahout rowsimilarity -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-matrix/matrix
-o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-named-similarity
-r 1073
--similarityClassname SIMILARITY_COSINE -m 10

./bin/mahout rowsimilarity -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-matrix/matrix -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-named-similarity -r 1073 --similarityClassname SIMILARITY_COOCCURRENCE -m 10 --tempDir /home/venkat/Desktop/tmp

SIMILARITY_COOCCURRENCE,
SIMILARITY_EUCLIDEAN_DISTANCE,
SIMILARITY_LOGLIKELIHOOD,
SIMILARITY_PEARSON_CORRELATION,
SIMILARITY_TANIMOTO_COEFFICIENT,
SIMILARITY_UNCENTERED_COSINE,
SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE,
SIMILARITY_CITY_BLOCK

./bin/mahout seqdumper -s /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-matrix/docIndex

Canopy

./bin/mahout canopy -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -o ./examples/bin/work/canopy-output -t1 3.0 -t2 2.8 -t3 3.0 -t4 2.8 -ow -xm sequential

./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/canopy-output -o ./examples/bin/work/reuters-kmeans -x 10 -k 3 -ow
./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/canopy-output -o ./examples/bin/work/reuters-kmeans -x 10 -ow

./bin/mahout seqdumper -s ./examples/bin/work/reuters-kmeans/clusters-1/part-r-00000

bin/mahout canopy
-i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/
-o ./examples/bin/work/reuters-out-seqdir-sparse
-dm new ManhattanDistanceMeasure()
-t1 3.0
-t2 2.8
-t3 3.0
-t4 2.8
-ow
-cl <run input vector clustering after computing Canopies>
-xm sequential

Usage:
[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFPercent
<maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR> --numReducers
<numReducers> --maxNGramSize <ngramSize> --overwrite --help
--sequentialAccessVector --namedVector --logNormalize]
Options
--minSupport (-s) minSupport (Optional) Minimum Support. Default
Value: 2
--analyzerName (-a) analyzerName The class name of the analyzer
--chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 MB
--output (-o) output The output directory
--input (-i) input input dir containing the documents in
sequence file format
--minDF (-md) minDF The minimum document frequency. Default
is 1
--maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF.
Can be used to remove really high
frequency terms. Expressed as an integer
between 0 and 100. Default is 99.
--weight (-wt) weight The kind of weight to use. Currently TF
or TFIDF
--norm (-n) norm The norm to use, expressed as either a
float or "INF" if you want to use the
Infinite norm. Must be greater or equal
to 0. The default is not to normalize
--minLLR (-ml) minLLR (Optional)The minimum Log Likelihood
Ratio(Float) Default is 1.0
--numReducers (-nr) numReducers (Optional) Number of reduce tasks.
Default Value: 1
--maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to
create (2 = bigrams, 3 = trigrams, etc)
Default Value:1
--overwrite (-ow) If set, overwrite the output directory
--help (-h) Print out help
--sequentialAccessVector (-seq) (Optional) Whether output vectors should
be SequentialAccessVectors. If set true
else false
--namedVector (-nv) (Optional) Whether output vectors should
be NamedVectors. If set true else false
--logNormalize (-lnorm) (Optional) Whether output vectors should
be logNormalize. If set true else false

Monday, July 16, 2012

	Hadoop

	Hadoop Distributed File System

1	A scalable, Fault tolerant, High performance distributed file system(storage)
2	Asynchronous replication
3	Write-one and read-many (WORM)
4	Data compression ( BZIP2)
5	Hadoop cluster with 3 modes minimum
6	Data divided into multiple of 64MB blocks
7	Each block is replicated 3 times ( default)
8	No RAID required
9	Access from RESTful, Java, FUSE
10	Name Node holds filesystem metadata
11	Files are broken up and spread over the DataNodes

	Map Reduce

1	Software framework for distributed computation
2	Input \| Map() \| copy/sort \| reduce() \| output
3	JobTracker schedules and manages jobs on the NameNode
4	TaskTracker executes individual map() and reduce() tasks on each DataNode

5	Map Phase – Raw data analyzed and converted to name / value pair
6	Shuffle Phase – All name / value pairs are sorted and grouped by their keys
7	Reduce Phase – All values associated with a key are processed for results

8	NameNode & JobTracker on the same server
9	DataNode & TaskTrakcer on the same server

Best Virtual Private Servers (VPS) Hosting

	Plans	Rate	Ram	Storage	Bandwidth	OS	CPU	IP	Comments	Swap

GoDaddy	Economy	Rs.1,620.00	1 GB	40 GB	1000 GB/month	Centos	?
	Value	Rs.2,160.00	2 GB	60 GB	2000 GB/month	centos	?

Hostgator	Level 1	$20	384 MB	10 GB	250 GB /month	Centos	0.57 Ghz	2	claim to host around 250,000 domains
	Level 2	$30	576 MB	22 GB	375 GB /month	Centos	0.85 GHZ	2
	Level 3	$40	768 MB	30 GB	500 GB	Centos	1.13 GHZ	2
	Level 4	$70	1344 MB	59 GB	1050 GB	Centos	1.98 GHZ	2
	Level 5	$95	1824 MB	80 GB	1425 GB	Centos	2.69 GHZ	2
	Level 6	$120	2304 MB	102 GB	1800 GB	Centos	3.4 GHZ	2
	Level 7	$150	3168 MB	165 GB	2250 GB	Centos	4.25 GHZ	2
	Level 8	$180	3801 MB	198 GB	2700 GB	Centos	5.09 GHZ	2
	Level 9	$210	4435 MB	231 GB	3150 GB	Centos	5.94 GHZ	2

Inmotionhosting	VPS-1000	$40	512 MB	40 GB	750 GB	Centos	?	2	Best VPS
	VPS-2000	$65	1 GB	80 GB	1500 GB	Centos	?	5
	VPS-3000	$130	2 GB	160 GB	2500 GB	Centos	?	10

Liquid Web	Smart VPS 1 GB	$50	880 MB	75 GB	2 TB	Linux	1 CPU		Reliable VPS
	Smart VPS 2 GB	$90	1750 MB	150 GB	2 TB	Linux	1 CPU
	Smart VPS 4 GB	%150	3500 MB	300 GB	2 TB	Linux	2 CPUs
	Smart VPS 8 GB	%270	6900 MB	600 GB	2 TB	Linux	4 CPUs

MDDHosting	Basic	$99.50 + $25 sup	1 GB	500 GB	?	Centos	2 Cores @ 2.4+ GHz	2
	Intermediate	$124.50 + $25	1.5 GB	1000 GB	?	Centos	2 Cores @ 2.4+ GHz	2
	Advanced	$149.50 + $25	2 GB	1500 GB	?	Centos	2 Cores @ 2.4+ GHz	4


arvixe	VPS Class	$40	1 GB	40 GB	UL	Linux	2 CPU cores	2
	VPS Class Pro	$70	2 GB	80 GB	UL	Linux	4 CPU cores	5


downtownhost	Copper	$20	384 MB	10 GB	200 GB	Linux	?	2		768 MB
	Bronze	$30	512 MB	20 GB	300 GB	Linux	?	2		1024 MB
	Silver	$50	1.25 GB	30 GB	400 GB	Linux	?	2		2.5 GB
	Gold	$60	1.5 GB	40 GB	500 GB	Linux	?	2		3 GB
	Platinum	$70	2 GB	50 GB	700 GB	Linux	?	2		4 GB


1 and 1	Virtual Server L	$29	512 MB	20 GB	1000 GB/month	Linux	?
	Virtual Server XL	$49	1 GB	40 GB	2000 GB/month	Linux	?
	Virtual Server XXL	$69	2 GB	80 GB	2500 GB	Linux	?


lunarpages	Linux VPS	$45	512 MB	30 GB	1000 GB/month	Linux		1


	List of Women's Colleges in Kerala

	Universities	: 13
	Medical Colleges	: 19
	Teacher Training Colleges	: 66
	Engineering Colleges	: 74
	Arts and Science Colleges	: 195
	Polytechnic	: 54
	Others	: 323


	Engineering Colleges

1	LBS Institute of Technology for Women
	Poojappura, Tiruvananthapuram. Pin 695012,
	Phone : 471

	Computer Science & Engineering, Electronics & Communication Engg., Applied Electronics & Instrumentation , Information Technology

2	Younus College of Engineering for Women
	pallimukku
	Vadakkevila P.O
	Kollam

	Arts and Science Colleges

1	Anvarul Islam Arabic College for Women
	Anvarul Islam Arabic College for Women, Monga
	BA, BSc.

2	B K College for Women
	B K College for Women, Amalagiri, Athirampuzh
	BA,BSC,MA,MCom.

3	College for Women
	College for Women, Thiruvananthapuram
	BSc,BA,BCom,MA,MSc.

4	Krishna Menon Memorial Got Women's College
	Krishna Menon Memorial Got Women's College, C
	BA.

5	Mar Thoma College for Women
	Perumbavoor - 683 542.
	Pre Degree.

6	Providence Womens College
	Providence Womens College. Calicut
	BA,BCom,MCom

7	Sacred Heart College for Women
	Sacred Heart College for Women, Chalakudy
	BA.

8	Sree Narayanan College for Women
	Sree Narayanan College lor Women, Kollam
	BA,BSc,BCom,MSc.

9	St Joseph College for Women
	St Joseph College for Women, Alapuzha
	BSc,BCom,BA.

10	St Xavier's College for Women
	Alwaye, Ernakulam
	Pre Degree


	MCA Cources

1	College of Engineering
	Thiruvanthapuram Kerala
	MBA, MCA
	Management University of Kerala

2	Dayapuram Institute of Management Arts & Technology
	Dayapuram Institute of Management Arts & Technology Dayapuram-District:Rec Calicut Kerala
	MCA

3	ER & DCI Institute of Technology
	ER & DCI-Campus Vellayambalam Thiruvananthapuram Kerala
	MCA

4	M.E.S. College
	Marampally Aluva Kerala
	MCA

5	Mar Athanasios College for Advanced Studies
	Tiruvalla
	MCA, MBA

6	Marian College
	P.O. Kuttikkanam-Peermade Kerala
	MCA

7	Rajiv Gandhi Institute of Technology
	Kottayam-Velloor P.O. Kottym
	MCA

8	Regional Centre School of Technology & Applied
	Near Government High School Edappally Kochi Kerala
	MCA

9	Union Christian College U.C. College
	P.O.:- Aluva Kerala
	MCA


	Engineering Colleges

	Younus College of Engineering for Women, KollamL.
	B.S. Institute of Technology for Women, Thiruvananthapuram
	Indira Gandhi Institute of Engineering & Technology for Women, Ernakulam
	K.M.C.T College of Engineering for Women, Calicut
	K.R. Gouri Amma College of Engineering for Women, Alappuzha
	Mount Zion College of Engineering for Women, Chengannur
	Prime College of Engineering for Women, Palakkad
	Sree Buddha College of Engineering for Women, Pathanamthitta
	St. Thomas Institute for Science and Technology, Thiruvananthapuram

	Arts and Science Colleges

	Govt. College for Women, Trivandrum
	Krishna Menon Memorial women's College, Kannur
	Ansar Women's College, Thrissur
	Chinmaya Arts and Science for Women, Kannur
	Dayapuram Arts and Science College for Women, Calicut
	Marthoma College for Women, CochinN.S.S. College for Women, Peringamala
	Providence Women's College, CalicutS.N. College for Women, Kottiyam
	Sacred Heart College for Women, Chalakkudy
	St. Joseph College for Women, Cherthala
	St. Xaviers College for Women, AluvaUnity women's College, Manjeri

Google Architecture

How Google Serves Data from Multiple Datacenters

Google App Engine uses master/slave replication between datacenters
lowish latency writes
datacenter failure survival
strong consistency guarantees.

Google Architecture ( 2008)

Sorting 1 PB with MapReduce, took 6:02 hrs to sort on 4000 computers.
Results were replicated thrice on 48,000 disks.
100k MapReduce Jobs are executed each day
> 20 petabytes of data are processed / day.
> 10k MapReduce program have been implemented.
Machines are dual processor with gigabit ethernet and 4-8 GB of memory.

Sats

4.5 lakhs low-cost commodity servers in 2006.
indexed 8 billion web pages in 2005.
200 GFS clusters, a cluster can have 1000 or even 5000 machines
Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
6000 MapReduce applications

Stack

Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
Spend more money on hardware to not lose log data, but spend less on other types of data.

Google File System

core storage platform
large distributed log structured file system
high reliability across data centers
scalibility to thousands of network nodes
huge read/write bandwith requirements
support for large blocks of data which are gigabytes in size.
efficient distribution of operations across nodes to reduce bottlenecks
System has master and chunk servers.
Master servers keep metadata on the various data files
Data are stored in the file system in 64MB chunks
Each chunk is replicated across 3 different chunk servers-
Key is enough infrastructure to make sure people have choices for their application

MapReduce

GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS.
Nice way to partition tasks across lots of machines.
Handle machine failure.
Works across different application types, like search and ads.
MapReduce system has 3 different types of servers.
Master Server, Map Server, Reduce Server
The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks.
The Map servers accept user input and performs map operations on them. The results are written to intermediate files
The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically.
The Google indexing pipeline has about 20 different map reductions
One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

Big Table

BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
Machines can be added and deleted while the system is running and the whole system just works.
Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
BigTable has three different types of servers: Master Server, Tablet Server, Lock Server
The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed.
The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
A locality group can be used to physically store related bits of data together for better locality of reference.
Tablets are cached in RAM as much as possible.

Hardware

use ultra cheap commodity hardware and built software on top to handle their death.
Linux, in-house rack design, PC class mother boards, low end storage.

Lessons Learned

Infrastructure can be a competitive advantage
Spanning multiple data centers is still an unsolved problem.
Take a look at Hadoop
Build self-managing systems that work without having to take the system down.
Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.
Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.

Google+

Stack

- Java servlets

- Javascript

- closure framework ( closure's JavaScript compiler and template system )

- HTML5 History API

- BigTable

- Colossus/GFS

- MapReduce

Closure a suite of JavaScript tools consisting of a library, compiler and templates
library is modular and cross-browser JavaScript library
compiler is a true compiler for JavaScript for making JavaScript download and run faster.
Templates is a server-side templating system that helps you dynamically build reusable HTML and UI elements

Hbase = BigTable
Hadoop
MapReduce
Colossus is Google's next generation file system, a replacement for GFS = HDFS
OpenStack – cloud like infrastructure glue
google uses a custom Java Servlet container
MessagePack, JSON, Hadoop, jQuery, MongoDB
Jquery vs Closur

Hadoop Projects

1. Hadoop Common

2. Hadoop Distributed File System (HDFS)

3. Hadoop MapReduce

1. Avro - A data serialization system.

2. Cassandra - A scalable multi-master database with no single points of failure.

3. Chukwa – A data collection system for managing large distributed systems.

4. Hbase - A scalable, distributed database that supports structured data storage for large tables.

5. Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.

6. Mahout - A Scalable machine learning and data mining library.

7. Pig - A high-level data-flow language and execution framework for parallel computation.

8. Zookeeper - A high-performance coordination service for distributed applications.

10 NoSQL Systems

considerations

the ability to add new machines to a live cluster transparently to your applications
support for multiple datacenters
data model
Query API
Persistance design
Scalability

Databases

Cassandra (j)
CouchDB (Erlang)
Hbase (j)
MongoDB (c++)
Neo4j (j)
Redis
Riak
Scalaris
Tokyo Cabinet
Voldemort (j)