Saturday, July 28, 2012

Mahout Clustering 


./bin/mahout seqdirectory -i /home/venkat/Downloads/cj4test/newscluster/reuters-out/ -o /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir -c UTF-8 -chunk 5
./bin/mahout seq2sparse -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir/ -o /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse -ng 2 -nv

./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/  -o /home/venkat/Downloads/cj4test/newscluster/canopy-output  -t1 0.8 -t2 0.7   -ow  -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/  -o /home/venkat/Downloads/cj4test/newscluster/canopy-output  -t1 3.0 -t2 2.8   -ow  -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/  -o /home/venkat/Downloads/cj4test/newscluster/canopy-output  -t1 0.5 -t2 1.0   -ow  -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/  -o /home/venkat/Downloads/cj4test/newscluster/canopy-output  -t1 2.0 -t2 1.0   -ow  -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/  -o /home/venkat/Downloads/cj4test/newscluster/canopy-output  -t1 0.2 -t2 0.1   -ow  -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/  -o /home/venkat/Downloads/cj4test/newscluster/canopy-output  -t1 0.1 -t2 0.1   -ow  -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/  -o /home/venkat/Downloads/cj4test/newscluster/canopy-output  -t1 0.8 -t2 0.8   -ow  -xm sequential -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/  -o /home/venkat/Downloads/cj4test/newscluster/canopy-output  -t1 3.0 -t2 2.8   -ow  -xm sequential -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
./bin/mahout canopy -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/  -o /home/venkat/Downloads/cj4test/newscluster/canopy-output  -t1 3.0 -t2 2.8   -ow  -xm sequential -dm org.apache.mahout.common.distance.ManhattanDistanceMeasure



./bin/mahout kmeans -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -c /home/venkat/Downloads/cj4test/newscluster/canopy-output/clusters-0 -o /home/venkat/Downloads/cj4test/newscluster/reuters-kmeans -x 10  -ow -cl -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout kmeans -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -c /home/venkat/Downloads/cj4test/newscluster/canopy-output -o /home/venkat/Downloads/cj4test/newscluster/reuters-kmeans -x 10 -k 2 -ow -cl -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout kmeans -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -c /home/venkat/Downloads/cj4test/newscluster/clusters -o /home/venkat/Downloads/cj4test/newscluster/reuters-kmeans -x 10 -k 4 -ow -cl  -dm org.apache.mahout.common.distance.CosineDistanceMeasure
./bin/mahout kmeans -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -c /home/venkat/Downloads/cj4test/newscluster/canopy-output/clusters-0 -o /home/venkat/Downloads/cj4test/newscluster/reuters-kmeans -x 10  -ow -cl -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
./bin/mahout kmeans -i /home/venkat/Downloads/cj4test/newscluster/reuters-out-seqdir-sparse/tfidf-vectors/ -c /home/venkat/Downloads/cj4test/newscluster/canopy-output/clusters-0 -o /home/venkat/Downloads/cj4test/newscluster/reuters-kmeans -x 10  -ow -cl -dm org.apache.mahout.common.distance.ManhattanDistanceMeasure

Sunday, July 22, 2012



Top 6 Magazines That Every Entrepreneur Must Read

1. Fortune Magazine
2. Forbes
3.Inc. Magazine
4. Harvard Business Review
5. Entrepreneur Magazine
6. Fast Company

Great Blogs Every Entrepreneur Must Read

1. Quora
2. PandoDaily
3. LinkedIn Today
4. Both Sides of the Table
5. Steve Blank
6. Penelope Trunk's Brazen Careerist
7. Wise Bread
8. Church of the Customer
9. How to change the world
10. AllBusiness.com
11. A VC
12. Entrepreneur Daily Dose
13.Small business brief
14.Blog Maverick
15.Venture Hacks
16.The entrepreneurial mind


7 Blockbuster Movies that Inspire Every Entrepreneur

1. Guru
2. Rocket Singh
3. Social Network
4. Wall Street
5. Forrest Gump
6. The Bridge on River Kwai
7. Charlie Bartlett


Every Startup Team Must Have These 6 Skills

Revenue Driver and a Technical Founder
Experience matters
Lead the leaders
Don’t be afraid
Titles can fool you
Managing expectations





Friday, July 20, 2012


Command Line commands for Kmeans Clustering 

./bin/mahout seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5

./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse -ng 2 -nv

./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x 10 -k 2 -ow -cl

./bin/mahout clusterdump -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0  -dt sequencefile -s ./examples/bin/work/reuters-kmeans/clusters-3/part-r-00000 -n 20  -b 100 -p ./examples/bin/work/reuters-kmeans/clusteredPoints

./bin/mahout seqdumper -s ./examples/bin/work/reuters-kmeans/clusteredPoints/part-m-00000  | more

./bin/mahout rowid -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000  -o ./examples/bin/work/reuters-matrix

./bin/mahout rowid -Dmapred.input.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000 -Dmapred.output.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-matrix

4 rows and 1073 columns

./bin/mahout seqdumper -s /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-matrix/matrix | more

./bin/mahout rowsimilarity -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-matrix/matrix
 -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-named-similarity
  -r 1073
    --similarityClassname SIMILARITY_COSINE  -m 10

./bin/mahout rowsimilarity -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-matrix/matrix  -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-named-similarity -r 1073  --similarityClassname SIMILARITY_COOCCURRENCE -m 10 --tempDir /home/venkat/Desktop/tmp

SIMILARITY_COOCCURRENCE,                    
                                                        SIMILARITY_EUCLIDEAN_DISTANCE,                
                                                        SIMILARITY_LOGLIKELIHOOD,                      
                                                        SIMILARITY_PEARSON_CORRELATION,                
                                                        SIMILARITY_TANIMOTO_COEFFICIENT,              
                                                        SIMILARITY_UNCENTERED_COSINE,                  
                                                        SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE,    
                                                        SIMILARITY_CITY_BLOCK

./bin/mahout seqdumper -s /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-matrix/docIndex

Canopy 

./bin/mahout canopy -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/  -o ./examples/bin/work/canopy-output  -t1 3.0 -t2 2.8  -t3 3.0  -t4 2.8 -ow  -xm sequential

./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/canopy-output -o ./examples/bin/work/reuters-kmeans -x 10 -k 3 -ow
./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/canopy-output -o ./examples/bin/work/reuters-kmeans -x 10  -ow


./bin/mahout seqdumper -s ./examples/bin/work/reuters-kmeans/clusters-1/part-r-00000


bin/mahout canopy
    -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/
    -o ./examples/bin/work/reuters-out-seqdir-sparse
    -dm new ManhattanDistanceMeasure()
    -t1 3.0
    -t2 2.8
    -t3 3.0
    -t4 2.8
    -ow
    -cl <run input vector clustering after computing Canopies>
    -xm sequential



Usage:                                                                        
 [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize          
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFPercent  
<maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR> --numReducers
<numReducers> --maxNGramSize <ngramSize> --overwrite --help                    
--sequentialAccessVector --namedVector --logNormalize]                        
Options                                                                        
  --minSupport (-s) minSupport        (Optional) Minimum Support. Default      
                                      Value: 2                                
  --analyzerName (-a) analyzerName    The class name of the analyzer          
  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB
  --output (-o) output                The output directory                    
  --input (-i) input                  input dir containing the documents in    
                                      sequence file format                    
  --minDF (-md) minDF                 The minimum document frequency.  Default
                                      is 1                                    
  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.  
                                      Can be used to remove really high        
                                      frequency terms. Expressed as an integer
                                      between 0 and 100. Default is 99.        
  --weight (-wt) weight               The kind of weight to use. Currently TF  
                                      or TFIDF                                
  --norm (-n) norm                    The norm to use, expressed as either a  
                                      float or "INF" if you want to use the    
                                      Infinite norm.  Must be greater or equal
                                      to 0.  The default is not to normalize  
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood    
                                      Ratio(Float)  Default is 1.0            
  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.      
                                      Default Value: 1                        
  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to
                                      create (2 = bigrams, 3 = trigrams, etc)  
                                      Default Value:1                          
  --overwrite (-ow)                   If set, overwrite the output directory  
  --help (-h)                         Print out help                          
  --sequentialAccessVector (-seq)     (Optional) Whether output vectors should
                                      be SequentialAccessVectors. If set true  
                                      else false                              
  --namedVector (-nv)                 (Optional) Whether output vectors should
                                      be NamedVectors. If set true else false  
  --logNormalize (-lnorm)             (Optional) Whether output vectors should
                                      be logNormalize. If set true else false  



Monday, July 16, 2012



Hadoop



Hadoop Distributed File System


1 A scalable, Fault tolerant, High performance distributed file system(storage)
2 Asynchronous replication
3 Write-one and read-many (WORM)
4 Data compression ( BZIP2)
5 Hadoop cluster with 3 modes minimum
6 Data divided into multiple of 64MB blocks
7 Each block is replicated 3 times ( default)
8 No RAID required
9 Access from RESTful, Java, FUSE
10 Name Node holds filesystem metadata
11 Files are broken up and spread over the DataNodes



Map Reduce


1 Software framework for distributed computation
2 Input | Map() | copy/sort | reduce() | output
3 JobTracker schedules and manages jobs on the NameNode
4 TaskTracker executes individual map() and reduce() tasks on each DataNode


5 Map Phase – Raw data analyzed and converted to name / value pair
6 Shuffle Phase – All name / value pairs are sorted and grouped by their keys
7 Reduce Phase – All values associated with a key are processed for results


8 NameNode & JobTracker on the same server
9 DataNode & TaskTrakcer on the same server


Best Virtual Private Servers (VPS) Hosting 


Plans Rate Ram Storage Bandwidth OS CPU IP Comments Swap











GoDaddy Economy Rs.1,620.00 1 GB 40 GB 1000 GB/month Centos ?



Value Rs.2,160.00 2 GB 60 GB 2000 GB/month centos ?













Hostgator Level 1 $20 384 MB 10 GB 250 GB /month Centos 0.57 Ghz 2  claim to host around 250,000 domains 

Level 2 $30 576 MB 22 GB 375 GB /month Centos 0.85 GHZ 2


Level 3 $40 768 MB 30 GB 500 GB Centos 1.13 GHZ 2


Level 4 $70 1344 MB 59 GB 1050 GB Centos 1.98 GHZ 2


Level 5 $95 1824 MB 80 GB 1425 GB Centos 2.69 GHZ 2


Level 6 $120 2304 MB 102 GB 1800 GB Centos 3.4 GHZ 2


Level 7 $150 3168 MB 165 GB 2250 GB Centos 4.25 GHZ 2


Level 8 $180 3801 MB 198 GB 2700 GB Centos 5.09 GHZ 2


Level 9 $210 4435 MB 231 GB 3150 GB Centos 5.94 GHZ 2












Inmotionhosting VPS-1000 $40 512 MB 40 GB 750 GB Centos ? 2 Best VPS

VPS-2000 $65 1 GB 80 GB 1500 GB Centos ? 5


VPS-3000 $130 2 GB 160 GB 2500 GB Centos ? 10












Liquid Web Smart VPS 1 GB $50 880 MB 75 GB 2 TB Linux 1 CPU
Reliable VPS

Smart VPS 2 GB $90 1750 MB 150 GB 2 TB Linux 1 CPU



Smart VPS 4 GB %150 3500 MB 300 GB 2 TB Linux 2 CPUs



Smart VPS 8 GB %270 6900 MB 600 GB 2 TB Linux 4 CPUs













MDDHosting Basic $99.50 + $25 sup 1 GB 500 GB ? Centos 2 Cores @ 2.4+ GHz 2


Intermediate $124.50 + $25 1.5 GB 1000 GB ? Centos 2 Cores @ 2.4+ GHz 2


Advanced $149.50 + $25 2 GB 1500 GB ? Centos 2 Cores @ 2.4+ GHz 4























arvixe VPS Class $40 1 GB 40 GB UL Linux 2 CPU cores 2


VPS Class Pro $70 2 GB 80 GB UL Linux 4 CPU cores 5























downtownhost Copper $20 384 MB 10 GB 200 GB Linux ? 2
768 MB

Bronze $30 512 MB 20 GB 300 GB Linux ? 2
1024 MB

Silver $50 1.25 GB 30 GB 400 GB Linux ? 2
2.5 GB

Gold $60 1.5 GB 40 GB 500 GB Linux ? 2
3 GB

Platinum $70 2 GB 50 GB 700 GB Linux ? 2
4 GB






















1 and 1 Virtual Server L $29 512 MB 20 GB 1000 GB/month Linux ?



Virtual Server XL $49 1 GB 40 GB 2000 GB/month Linux ?



Virtual Server XXL $69 2 GB 80 GB 2500 GB Linux ?
























lunarpages Linux VPS $45 512 MB 30 GB 1000 GB/month Linux
1
















List of Women's Colleges in Kerala 




Universities : 13

Medical Colleges : 19

Teacher Training Colleges : 66

Engineering Colleges : 74

Arts and Science Colleges : 195

Polytechnic : 54

Others : 323







Engineering Colleges



1 LBS Institute of Technology for Women

Poojappura, Tiruvananthapuram. Pin 695012,

Phone : 471




Computer Science & Engineering, Electronics & Communication Engg., Applied Electronics & Instrumentation , Information Technology



2 Younus College of Engineering for Women

pallimukku

Vadakkevila P.O

Kollam




Arts and Science Colleges



1 Anvarul Islam Arabic College for Women

Anvarul Islam Arabic College for Women, Monga

BA, BSc.



2 B K College for Women

B K College for Women, Amalagiri, Athirampuzh

BA,BSC,MA,MCom.



3 College for Women

College for Women, Thiruvananthapuram

BSc,BA,BCom,MA,MSc.



4 Krishna Menon Memorial Got Women's College

Krishna Menon Memorial Got Women's College, C

BA.



5 Mar Thoma College for Women

Perumbavoor - 683 542.

Pre Degree.



6 Providence Womens College

Providence Womens College. Calicut

BA,BCom,MCom



7 Sacred Heart College for Women

Sacred Heart College for Women, Chalakudy

BA.



8 Sree Narayanan College for Women

Sree Narayanan College lor Women, Kollam

BA,BSc,BCom,MSc.



9 St Joseph College for Women

St Joseph College for Women, Alapuzha

BSc,BCom,BA.



10 St Xavier's College for Women

Alwaye, Ernakulam

Pre Degree







MCA Cources



1 College of Engineering

Thiruvanthapuram Kerala

MBA, MCA

Management University of Kerala



2 Dayapuram Institute of Management Arts & Technology

Dayapuram Institute of Management Arts & Technology Dayapuram-District:Rec Calicut Kerala

MCA



3 ER & DCI Institute of Technology

ER & DCI-Campus Vellayambalam Thiruvananthapuram Kerala

MCA



4 M.E.S. College

Marampally Aluva Kerala

MCA



5 Mar Athanasios College for Advanced Studies

Tiruvalla

MCA, MBA



6 Marian College

P.O. Kuttikkanam-Peermade Kerala

MCA



7 Rajiv Gandhi Institute of Technology

Kottayam-Velloor P.O. Kottym

MCA



8 Regional Centre School of Technology & Applied

Near Government High School Edappally Kochi Kerala

MCA



9 Union Christian College U.C. College

P.O.:- Aluva Kerala

MCA







Engineering Colleges




Younus College of Engineering for Women, KollamL.

B.S. Institute of Technology for Women, Thiruvananthapuram

Indira Gandhi Institute of Engineering & Technology for Women, Ernakulam

K.M.C.T College of Engineering for Women, Calicut

K.R. Gouri Amma College of Engineering for Women, Alappuzha

Mount Zion College of Engineering for Women, Chengannur

Prime College of Engineering for Women, Palakkad

Sree Buddha College of Engineering for Women, Pathanamthitta

St. Thomas Institute for Science and Technology, Thiruvananthapuram




Arts and Science Colleges




Govt. College for Women, Trivandrum

Krishna Menon Memorial women's College, Kannur

Ansar Women's College, Thrissur

Chinmaya Arts and Science for Women, Kannur

Dayapuram Arts and Science College for Women, Calicut

Marthoma College for Women, CochinN.S.S. College for Women, Peringamala

Providence Women's College, CalicutS.N. College for Women, Kottiyam

Sacred Heart College for Women, Chalakkudy

St. Joseph College for Women, Cherthala

St. Xaviers College for Women, AluvaUnity women's College, Manjeri
Google Architecture

How Google Serves Data from Multiple Datacenters

  • Google App Engine uses master/slave replication between datacenters
  • lowish latency writes
    datacenter failure survival
    strong consistency guarantees.




    Google Architecture ( 2008)
    • Sorting 1 PB with MapReduce, took 6:02 hrs to sort on 4000 computers.
    • Results were replicated thrice on 48,000 disks.
    • 100k MapReduce Jobs are executed each day
    • > 20 petabytes of data are processed / day.
    • > 10k MapReduce program have been implemented.
    • Machines are dual processor with gigabit ethernet and 4-8 GB of memory.
Sats
  • 4.5 lakhs low-cost commodity servers in 2006.
  • indexed 8 billion web pages in 2005.
  • 200 GFS clusters, a cluster can have 1000 or even 5000 machines
  • Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
  • 6000 MapReduce applications
Stack
  • Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
  • Spend more money on hardware to not lose log data, but spend less on other types of data.
Google File System
  • core storage platform
  • large distributed log structured file system
  • high reliability across data centers
  • scalibility to thousands of network nodes
  • huge read/write bandwith requirements
  • support for large blocks of data which are gigabytes in size.
  • efficient distribution of operations across nodes to reduce bottlenecks
  • System has master and chunk servers.
  • Master servers keep metadata on the various data files
  • Data are stored in the file system in 64MB chunks
  • Each chunk is replicated across 3 different chunk servers-
  • Key is enough infrastructure to make sure people have choices for their application
MapReduce
  • GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS.
  • Nice way to partition tasks across lots of machines.
  • Handle machine failure.
  • Works across different application types, like search and ads.
  • MapReduce system has 3 different types of servers.
  • Master Server, Map Server, Reduce Server
  • The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks.
  • The Map servers accept user input and performs map operations on them. The results are written to intermediate files
  • The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
  • you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically.
  • The Google indexing pipeline has about 20 different map reductions
  • One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
  • Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

Big Table
  • BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
  • BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
  • It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
  • Machines can be added and deleted while the system is running and the whole system just works.
  • Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
  • Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
  • BigTable has three different types of servers: Master Server, Tablet Server, Lock Server
  • The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed.
  • The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
  • The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
  • A locality group can be used to physically store related bits of data together for better locality of reference.
  • Tablets are cached in RAM as much as possible.
Hardware
  • use ultra cheap commodity hardware and built software on top to handle their death.
  • Linux, in-house rack design, PC class mother boards, low end storage.
Lessons Learned
  • Infrastructure can be a competitive advantage
  • Spanning multiple data centers is still an unsolved problem.
  • Take a look at Hadoop
  • Build self-managing systems that work without having to take the system down.
  • Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.
  • Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.
  1. Google+
    Stack
    - Java servlets
    - Javascript
    - closure framework ( closure's JavaScript compiler and template system )
    - HTML5 History API
    - BigTable
    - Colossus/GFS
    - MapReduce
    • Closure a suite of JavaScript tools consisting of a library, compiler and templates
    • library is modular and cross-browser JavaScript library
    • compiler is a true compiler for JavaScript for making JavaScript download and run faster.
    • Templates is a server-side templating system that helps you dynamically build reusable HTML and UI elements
    • Hbase = BigTable
    • Hadoop
    • MapReduce
    • Colossus is Google's next generation file system, a replacement for GFS = HDFS
    • OpenStack – cloud like infrastructure glue
    • google uses a custom Java Servlet container
    • MessagePack, JSON, Hadoop, jQuery, MongoDB
    • Jquery vs Closur
  1. Hadoop Projects
    1. Hadoop Common
    2. Hadoop Distributed File System (HDFS)
    3. Hadoop MapReduce

    1. Avro - A data serialization system.
    2. Cassandra - A scalable multi-master database with no single points of failure.
    3. Chukwa – A data collection system for managing large distributed systems.
    4. Hbase - A scalable, distributed database that supports structured data storage for large tables.
    5. Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.
    6. Mahout - A Scalable machine learning and data mining library.
    7. Pig - A high-level data-flow language and execution framework for parallel computation.
    8. Zookeeper - A high-performance coordination service for distributed applications.
  1. 10 NoSQL Systems

considerations
    • the ability to add new machines to a live cluster transparently to your applications
    • support for multiple datacenters
    • data model
    • Query API
    • Persistance design
    • Scalability
Databases
      1. Cassandra (j)
      2. CouchDB (Erlang)
      3. Hbase (j)
      4. MongoDB (c++)
      5. Neo4j (j)
      6. Redis
      7. Riak
      8. Scalaris
      9. Tokyo Cabinet
      10. Voldemort (j)