Google Architecture
How Google Serves Data from Multiple Datacenters
- Google App Engine uses master/slave replication between datacenters
- lowish latency writes
datacenter failure survival
strong consistency guarantees.
Google Architecture ( 2008)
- Sorting 1 PB with MapReduce, took 6:02 hrs to sort on 4000 computers.
- Results were replicated thrice on 48,000 disks.
- 100k MapReduce Jobs are executed each day
- > 20 petabytes of data are processed / day.
- > 10k MapReduce program have been implemented.
- Machines are dual processor with gigabit ethernet and 4-8 GB of memory.
Sats
- 4.5 lakhs low-cost commodity servers in 2006.
- indexed 8 billion web pages in 2005.
- 200 GFS clusters, a cluster can have 1000 or even 5000 machines
- Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
- 6000 MapReduce applications
Stack
- Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
- Spend more money on hardware to not lose log data, but spend less on other types of data.
Google
File System
- core storage platform
- large distributed log structured file system
- high reliability across data centers
- scalibility to thousands of network nodes
- huge read/write bandwith requirements
- support for large blocks of data which are gigabytes in size.
- efficient distribution of operations across nodes to reduce bottlenecks
- System has master and chunk servers.
- Master servers keep metadata on the various data files
- Data are stored in the file system in 64MB chunks
- Each chunk is replicated across 3 different chunk servers-
- Key is enough infrastructure to make sure people have choices for their application
MapReduce
- GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS.
- Nice way to partition tasks across lots of machines.
- Handle machine failure.
- Works across different application types, like search and ads.
- MapReduce system has 3 different types of servers.
- Master Server, Map Server, Reduce Server
- The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks.
- The Map servers accept user input and performs map operations on them. The results are written to intermediate files
- The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
- you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically.
- The Google indexing pipeline has about 20 different map reductions
- One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
- Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.
Big
Table
- BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
- BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
- It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
- Machines can be added and deleted while the system is running and the whole system just works.
- Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
- Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
- BigTable has three different types of servers: Master Server, Tablet Server, Lock Server
- The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed.
- The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
- The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
- A locality group can be used to physically store related bits of data together for better locality of reference.
- Tablets are cached in RAM as much as possible.
Hardware
- use ultra cheap commodity hardware and built software on top to handle their death.
- Linux, in-house rack design, PC class mother boards, low end storage.
Lessons
Learned
- Infrastructure can be a competitive advantage
- Spanning multiple data centers is still an unsolved problem.
- Take a look at Hadoop
- Build self-managing systems that work without having to take the system down.
- Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.
- Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.
- Google+Stack- Java servlets- Javascript- closure framework ( closure's JavaScript compiler and template system )- HTML5 History API- BigTable- Colossus/GFS- MapReduce
- Closure a suite of JavaScript tools consisting of a library, compiler and templates
- library is modular and cross-browser JavaScript library
- compiler is a true compiler for JavaScript for making JavaScript download and run faster.
- Templates is a server-side templating system that helps you dynamically build reusable HTML and UI elements
- Hbase = BigTable
- Hadoop
- MapReduce
- Colossus is Google's next generation file system, a replacement for GFS = HDFS
- OpenStack – cloud like infrastructure glue
- google uses a custom Java Servlet container
- MessagePack, JSON, Hadoop, jQuery, MongoDB
- Jquery vs Closur
- Hadoop Projects1. Hadoop Common2. Hadoop Distributed File System (HDFS)3. Hadoop MapReduce
1. Avro - A data serialization system.2. Cassandra - A scalable multi-master database with no single points of failure.3. Chukwa – A data collection system for managing large distributed systems.4. Hbase - A scalable, distributed database that supports structured data storage for large tables.5. Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.6. Mahout - A Scalable machine learning and data mining library.7. Pig - A high-level data-flow language and execution framework for parallel computation.8. Zookeeper - A high-performance coordination service for distributed applications.
- 10 NoSQL Systems
considerations
- the ability to add new machines to a live cluster transparently to your applications
- support for multiple datacenters
- data model
- Query API
- Persistance design
- Scalability
Databases
- Cassandra (j)
- CouchDB (Erlang)
- Hbase (j)
- MongoDB (c++)
- Neo4j (j)
- Redis
- Riak
- Scalaris
- Tokyo Cabinet
- Voldemort (j)
No comments:
Post a Comment