Percolator notes

Percolator only runs over a single datacenter.
snapshot isolation properties over data
incremental processing over Petabyte-database
Observers set over user table data. When something is written to the table, they operate on it. Results are forwarded/written, for the next observer. An external process makes the initial data write, so that an observer may start processing the information.
Main abstractions: ACID transactions, Observers.
Every machine: Percolator worker, Bigtable tablet server, GFS chunckserver.
Observers linked into the Percolator worker, which scans the Bigtable for changed columns.
Programmer’s perspective: Percolator repository, consisting of a small number of tables. Each table is a collection of ‘cells’ indexed by row and column. Each ‘cell’ contains a value (array of bytes).
Relaxed latency. Lazy approach for cleaning up locks (e.g. locks left by transactions running on failed machines). No global deadlock detector (higher latency, but more scalable).
Percolator explicitly maintains locks. Uses Bigtable to store locks in special in-memory columns.
Data organized into Bigtable rows and columns, with additional Percolator metadata stored alongside in special columns
cross-row, cross-table transactions, ACID snapshot-isolation semantics.
Uses ‘far more resources to process a fixed ammount of data than a tradittional DBMS would’. But is more scalable.Resultado: Double the resources to process the same crawl rate when compared to MapReduce. However, Clustering latency is much quicker (e.g. MapReduce goes in complete batches, so no results until all are ready. Percolator pages are continually ready - median document is parsed over 100x faster than mapreduce).

How notifications are set: when a transaction writes an ‘observed’ cell, it also sets a corresponding ‘notify cell’ (in Bigtable notify column, with an entry for each dirty cell). Workers perform distributed scan over notify column, to find and process dirty cells. After observer runs, notify cell is removed. - random =) - escolhe random tablet, random key in tablet...random region. Se uma scanning thread se ve a procurar uma região de outra, escolhe outro local aleatório.