Technology

Google Architecture

How Google Serves Data from Multiple Datacenters

Google App Engine uses master/slave replication between datacenters
lowish latency writes
datacenter failure survival
strong consistency guarantees.

Google Architecture ( 2008)

Sorting 1 PB with MapReduce, took 6:02 hrs to sort on 4000 computers.
Results were replicated thrice on 48,000 disks.
100k MapReduce Jobs are executed each day
> 20 petabytes of data are processed / day.
> 10k MapReduce program have been implemented.
Machines are dual processor with gigabit ethernet and 4-8 GB of memory.

Sats

4.5 lakhs low-cost commodity servers in 2006.
indexed 8 billion web pages in 2005.
200 GFS clusters, a cluster can have 1000 or even 5000 machines
Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
6000 MapReduce applications

Stack

Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
Spend more money on hardware to not lose log data, but spend less on other types of data.

Google File System

core storage platform
large distributed log structured file system
high reliability across data centers
scalibility to thousands of network nodes
huge read/write bandwith requirements
support for large blocks of data which are gigabytes in size.
efficient distribution of operations across nodes to reduce bottlenecks
System has master and chunk servers.
Master servers keep metadata on the various data files
Data are stored in the file system in 64MB chunks
Each chunk is replicated across 3 different chunk servers-
Key is enough infrastructure to make sure people have choices for their application

MapReduce

GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS.
Nice way to partition tasks across lots of machines.
Handle machine failure.
Works across different application types, like search and ads.
MapReduce system has 3 different types of servers.
Master Server, Map Server, Reduce Server
The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks.
The Map servers accept user input and performs map operations on them. The results are written to intermediate files
The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically.
The Google indexing pipeline has about 20 different map reductions
One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

Big Table

BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
Machines can be added and deleted while the system is running and the whole system just works.
Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
BigTable has three different types of servers: Master Server, Tablet Server, Lock Server
The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed.
The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
A locality group can be used to physically store related bits of data together for better locality of reference.
Tablets are cached in RAM as much as possible.

Hardware

use ultra cheap commodity hardware and built software on top to handle their death.
Linux, in-house rack design, PC class mother boards, low end storage.

Lessons Learned

Infrastructure can be a competitive advantage
Spanning multiple data centers is still an unsolved problem.
Take a look at Hadoop
Build self-managing systems that work without having to take the system down.
Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.
Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.

Google+

Stack

- Java servlets

- Javascript

- closure framework ( closure's JavaScript compiler and template system )

- HTML5 History API

- BigTable

- Colossus/GFS

- MapReduce

Closure a suite of JavaScript tools consisting of a library, compiler and templates
library is modular and cross-browser JavaScript library
compiler is a true compiler for JavaScript for making JavaScript download and run faster.
Templates is a server-side templating system that helps you dynamically build reusable HTML and UI elements

Hbase = BigTable
Hadoop
MapReduce
Colossus is Google's next generation file system, a replacement for GFS = HDFS
OpenStack – cloud like infrastructure glue
google uses a custom Java Servlet container
MessagePack, JSON, Hadoop, jQuery, MongoDB
Jquery vs Closur

Hadoop Projects

1. Hadoop Common

2. Hadoop Distributed File System (HDFS)

3. Hadoop MapReduce

1. Avro - A data serialization system.

2. Cassandra - A scalable multi-master database with no single points of failure.

3. Chukwa – A data collection system for managing large distributed systems.

4. Hbase - A scalable, distributed database that supports structured data storage for large tables.

5. Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.

6. Mahout - A Scalable machine learning and data mining library.

7. Pig - A high-level data-flow language and execution framework for parallel computation.

8. Zookeeper - A high-performance coordination service for distributed applications.

10 NoSQL Systems

considerations

the ability to add new machines to a live cluster transparently to your applications
support for multiple datacenters
data model
Query API
Persistance design
Scalability

Databases

Cassandra (j)
CouchDB (Erlang)
Hbase (j)
MongoDB (c++)
Neo4j (j)
Redis
Riak
Scalaris
Tokyo Cabinet
Voldemort (j)

Technology

Monday, July 16, 2012

How Google Serves Data from Multiple Datacenters

No comments:

Post a Comment

About Me