Para nadie es un secreto que Google es el mayor motor de búsqueda en la actualidad, pero lo que si era un ’secreto’ era cuanto espacio requiere ser el mayor motor de búsqueda.
Algunas estadísticas de Google, presentadas en el informe OSDI06:
Entendiendo a BigTable:
Bigtable es un sistema de almacenamiento distribuido para el manejo de datos estructurados y que está diseñada para escalas bastante grandes: Petabytes de datos distribuidos en cientos de servidores. Muchos proyectos en Google almacenan sus datos en la Bigtable, los que incluyen el indexado (Google Search), Google Earth y Google Finance. [ ... ]
El artículo también menciona los retos y problemas a los que se ha enfrentado Google en la implementación de Bigtable.
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving).
As of August 2006, there are 388 non-test Bigtable clusters running in various Google machine clusters, with a combined total of about 24,500 tablet servers. [ ... ]
We have described Bigtable, a distributed system for storing structured data at Google. Bigtable clusters have
been in production use since April 2005, and we spent roughly seven person-years on design and implementation
before that date. As of August 2006, more than sixty projects are using Bigtable. Our users like the performance
and high availability provided by the Bigtable implementation, and that they can scale the capacity of their
clusters by simply adding more machines to the system as their resource demands change over time.
El motor de búsqueda, Google Search, como puede apreciarse en la tabla siguiente, ocupa nada menos que ~850TB en un radio de compresión de entre 11 y 33%.
Google Analytics ocupa ~220 Tb en dos tablas:
The raw click table (˜~200 TB) maintains a row for each end-user session. The row name is a tuple containing the website’s name and the time at which the session was created. This schema ensures that sessions that visit the same web site are contiguous, and that they are sorted chronologically. This table compresses to 14% of its original size.
The summary table (~˜20 TB) contains various predefned summaries for each website. This table is generated from the raw click table by periodically scheduled MapReduce jobs. Each MapReduce job extracts recent session data from the raw click table. The overall system’s throughput is limited by the throughput of GFS.
This table compresses to 29% of its original size.
Google Earth, por su parte ocupa cerca de 70 TB:
The preprocessing pipeline uses one table to store raw imagery. During preprocessing, the imagery is cleaned and consolidated into final serving data. This table contains approximately 70 terabytes of data and therefore is served from disk. The images are eficiently compressed already, so Bigtable compression is disabled.
Google Labs revela algunos datos interesantes en su artículo: Bigtable: A Distributed Storage System for Structured Data
Supongo que la mayoría de nosotros podríamos alojar y manejar la misma cantidad en nuestros computadores de escritorio.
Más en [ El rincón de Tolito ] | [ Google Operating System ]









