database sharding

general ideas

distribute data across multiple databases through some heuristic that splits data evenly.
- sometimes based on user ids, last names, or geographical location.
has similar advantages to database-federation because each individual database is smaller.
high-level disadvantages.
- adds complexity in application code.
- data can become lopsided because of power users or inconsistent hashing.
  - rebalancing adds even more complexity.
  - consistent-hashing can help.
- joining data is more complex.
- adds more hardware.

can’t perform range queries with a key other than the one used to partition.
can have uneven distribution in traffic based on which key was used to partition.

apply a hash function to an attribute to determine which partition a piece of data goes into.
we do this for every new record, and use the same hashing function to find out where to find an existing record.
- we hash the value in the attribute, then do $mod n$ where $n$ is the number of nodes we have.
this method can be improved with consistent hashing.

can’t perform range queries at all without storing additional data.
adding or removing nodes causes repartitioning across all nodes in the system.
consistent hashing.
- adding or removing nodes only affects neighboring nodes on the ring.