After a successful Hadoop summit, we went along to the HBase meetup (#11) at Facebook to see how things are going. It turns out, pretty well. The community seems stronger than ever. Some of the main commiters are Facebook, StumbleUpon and Cloudera and they’re pushing significant work.
The next version will be 0.90. It will be a reliability release, but also includes performance gains. The version change will break from hadoop version numbers. 0.90 was chosen as there’s a belief it is maturing towards a 1.0 release.
The main points I picked up are:
* New batch importing allows writing hfiles directly and then just telling hbase where they are.
* Taking advantage of appends in hdfs for genuine durability.
* The namenode single point of failure is being addressed, facebook is planning to release their HA namenode.
* Replication between clusters. Allows cross data center replication. Eventually consistent.
* Tighter integration with zookeeper through a master rewrite.
* Significant work to have less temperamental behaviour during compaction and splits.
* Facebook are planning to release their distribution of hadoop and their highly available namenode.
All in all it’s very encouraging progress. I think there’s a case for us at Last.fm to look at HBase again soon.
Here are my notes from the meetup…
===Reliability===
* Master overhaul. Facebook is making it highly available.
* Test framework.
* Testing failure scenarios.
* Ops team friendly tools.
* HBase fsck.
* More performance metrics (in progress)
===HBase Master Rewrite===
Why?
* Master failover does not always work.
* ZK is patched on.
* Master to region server communication is inefficient.
* The code is a bitch.
The crux
* Better zookeeper integration
* Use zookeeper to track how far operations have progressed. eg: moving a region to another region server.
* Cleaner code.
Opens up future enhanchments
* Master need not do META edits.
* Region servers can recover their own regions on restarts.
* Reporting on shutdown.
* Limit concurrent major compactions across cluster. Because we know the progress.
Why not put META in zk?
* Could be great. It’ll take some work yet. Seems likely in the future.
===HBase cluster replication===
What?
* Fully integrated replication between clusters.
* Ships edits.
* Can be cross data centre.
* Eventually consistent across clusters.
* Master-slave, master-master, circular.
How?
* Master push.
* Write ahead log shipping.
* Meta data stored in zookeeper.
* Logs get shipped in batch and applied to the new region server locally using the htable client.
* Cute stuff.
* Timestamp based
Also…
* There is a new seperate utility program. A distcp-like map reduce job for copying tables between clusters.
===Bloom Filters, they’re back!===
Why they were originally removed?
* Tricky bugs during compaction.
** Solution: Fixed with cleverness.
* We had to estimate key size ahead of time, and under certain conditions the memory use bloated enough to make them counter productive.
** Solution: Fold / compress / cleverness.
Now.
* They don’t bloat memory uselessly.
* Defaulting to 1% error rate. Configureable.
* Great for exact queries (not good for scans, or if you know what you are always querying for already exists they’re a waste).
Usage.
* super granluarity, defaults to off.
* tweak max fold rate for compression if you know how big your rows are.
* property for turning them off at any time.
* you can turn them on on pre-existing rows, and they get added as compactions happen.
See HBASE-1200 for more information.
===HBase Bulk Loads===
* Better than the last one.
* Skips rpc paths.
* 10x faster then api use.
* Writes mapred output directly into hfiles on hdfs.
* They can be loaded into hbase easily. You just have to just tell it where they are.
steps
* Run mapred job.
* bin/hbase completebulkload /output-path/ tablename
* also new importtsv tool.
===Miscellaneous (fluffy stuff)===
Maven now the build system, all the way.
* it’s working.
Logo change.
* no one seems too attached to the bass cleff symbol.
HBASE-50 Facebook summer of code project.
* implementing snapshots (they’ll be a transaction).
* Design plan and implementation is looking so good, Stack says the committers should read it as they might learn something.
===Performance (in progress)===
Reduce IO around splits.
* Currently only triggered after compactions. So you rewrite the data both sides of the split.
* looks at the sum of all storefiles, so compactions don’t have to happen.
* checked after flush not compaction.
* Much faster.
* Avoids 50% of the io during the split.
Reduce time regions go offline.
* Splits, load balancing, region server failover.
* Fixes
** make splits faster
** Double flush memstore of region close. One before the close, one after, so that the flush after close is super fast. Which means reassigning regions wont make a region go offline for many seconds.
** Using zookeeper for more intelligent region movement.
Concurrency and priorities.
* Added multi threading to flushes and compactions.
* Multi threading of master messages.
* Flushes and compactions now have priorities associated with them.
HFile seek/reseek.
* Projections.
** seek to columns you want, not the start of the row.
** seek the the versions you want not the start of the column.
Configurable WAL
* HDFS appends to wal, provides durability.
* Optional deferred log flush that does not block requests, but constantly appends (aka scrobble server). For when 3 seconds data loss in a failure is acceptable.
* You can always disable WAL completely for speed.
Other stuff.
* Internally storing min/max timestamp of each file, for allowing skipping files that don’t over lap.
* Faster enable/disable/drop of tables.

