somewhat serious

Graphs All The Way Down

I find the separation of graph database management systems (GDBMSs) and relational database management systems (RDBMSs) peculiar. They're not different in all things important. As far as I can tell, the two mainly differ in their marketing.

For those unfamiliar with the graph as an abstract data type, see the wikipedia article for an overview.

How RDBMSs Model Graphs

Relational database management systems typically impose a strict order on a graph via a declared schema. The schema may declare a set of relations to other schema, or itself.

An example of how accounts, profiles and friends may be declared can be represented by some boxes and arrows. Boxes are tables, arrows are relations:

  ┌──────────────────┐   
  │accounts          │   
  │---               │   
  │                  │   
┌─┤id                │   
│ │email             │   
│ │                  │   
│ └──────────────────┘   
│                        
│ ┌──────────────────┐   
│ │profiles          │   
│ │---               │   
│ │                  │   
│ │id                │   
└►│account_id        ├─┬┐
  │name              │ ││
  │                  │ ││
  └──────────────────┘ ││
                       ││
  ┌──────────────────┐ ││
  │friends           │ ││
  │---               │ ││
  │                  │ ││
  │profile_left      │◄┘│
  │profile_right     │◄─┘
  │                  │   
  └──────────────────┘                                    

It is typical to weave values and references together in tables, of which I am (conceptually) not a fan. More on such a concept in another post.

What is important about the above symbolic representation: it is specifying a graph. A "node" or "vertex" is a tuple in a schema (denoted by a box), and the "edge" is the column in the tuple which references another box (denoted by the arrows).

To make the above boxes tangible:

postgres=# \e
-- in embedded vim

begin;

create table accounts (
  id    int,
  email text,

  -- constraints
  primary key(id),
  unique(email)
);

create table profiles (
  id   int,
  account_id int references accounts(id) on delete cascade,
  name text,

  -- constraints
  primary key (id),
  unique(name)
);

create table friends (
  profile_left int references profiles(id) on delete cascade,
  profile_right int references profiles(id) on delete cascade,

  -- constraints
  primary key(profile_left, profile_right)
);

commit;

-- :wq
BEGIN
CREATE TABLE
CREATE TABLE
CREATE TABLE
COMMIT

postgres=# -- create accounts
postgres=# insert into accounts (id, email) values (1, 'brent@example.com'), (2, 'bnert@swedish.com');
INSERT 0 2
postgres=# -- create profiles
postgres=# insert into profiles (id, account_id, name) values (10, 1, 'brent'), (11, 2, 'bnert');
INSERT 0 2
postgres=# -- create a bi-directional edge between the two profiles
postgres=# insert into friends (profile_left, profile_right) values (10, 11), (11, 10);
INSERT 0 2

-- query: find emails for friends
postgres=# \e
-- in embedded vim
select
  accounts.id    as source_id,
  accounts.email as source_email,
  profiles.name  as source_name,
  a.id           as target_id,
  a.email        as target_email,
  p.name         as target_name
from
  accounts
  left join profiles
    on accounts.id = profiles.account_id
  left join friends
    on profiles.id = friends.profile_left
  left join profiles p
    on friends.profile_right = p.id
  left join accounts a
    on p.account_id = a.id
order by
  source_id;
-- :wq

 source_id |   source_email    | source_name | target_id |   target_email    | target_name
-----------+-------------------+-------------+-----------+-------------------+-------------
         1 | brent@example.com | brent       |         2 | bnert@swedish.com | bnert
         2 | bnert@swedish.com | bnert       |         1 | brent@example.com | brent
(2 rows)

The last query, we're having to manually "walk the graph" per the schema specification. Even though SQL can be quite manual with the "how" of walking a graph, nonetheless a graph is being walked.

How GDBMSs Model Graphs

Given the populatiry of neo4j and its query language cypher, I'll futz my way through creating nodes and relationships (née edges) using the following as a symbolic representation:

  ┌──────────────────┐   
┌─┤accounts          │   
│ │---               │   
│ │                  │   
│ │id                │   
│ │email             │   
│ │                  │   
│ └──────────────────┘   
│                        
│ ┌──────────────────┐   
└►│profiles          ├──┐
  │---               │◄─┘
  │                  │   
  │id                │   
  │name              │   
  │                  │   
  │                  │   
  └──────────────────┘   

Notice how w/ neo4j, there is no need to specify a schema for relationships, as it is baked into the model.

Now, to make the symbolic representation tangible:

@neo4j> // create accounts
@neo4j> CREATE (brent:Account { id: 1, email: 'brent@example.com' }), (bnert:Account { id: 2, email: 'bnert@swedish.com' });
0 rows
ready to start consuming query after 56 ms, results consumed after another 0 ms
Added 2 nodes, Set 4 properties, Added 2 labels

@neo4j> // create profiles
@neo4j> CREATE (brent:Profile { id: 10, name: 'brent' }), (bnert:Profile { id: 11, name: 'bnert' });
0 rows
ready to start consuming query after 55 ms, results consumed after another 0 ms
Added 2 nodes, Set 4 properties, Added 2 labels

@neo4j> // relate brent account to brent profile
@neo4j> MATCH (acct:Account { id: 1 }), (profile:Profile { id: 10 }) CREATE (acct)-[:REL]->(profile);
0 rows
ready to start consuming query after 31 ms, results consumed after another 0 ms
Created 1 relationships

@neo4j> // relate bnert account to brent profile
@neo4j> MATCH (acct:Account { id: 2 }), (profile:Profile { id: 11 }) CREATE (acct)-[:REL]->(profile);
0 rows
ready to start consuming query after 33 ms, results consumed after another 0 ms
Created 1 relationships

@neo4j> // relate brent profile to bnert profile
@neo4j> MATCH (left:Profile { id: 10 }), (right:Profile { id: 11 }) CREATE (left)-[:REL]->(right);
0 rows
ready to start consuming query after 63 ms, results consumed after another 0 ms
Created 1 relationships

@neo4j> // query: find emails for friends
@neo4j> MATCH (source:Account)-[]-(sourceProfile:Profile)-[]-(targetProfile:Profile)-[]-(target:Account)
        RETURN
          source.id as source_id,
          source.email as source_email,
          sourceProfile.name as source_name,
          target.id as target_id,
          target.email as target_email,
          targetProfile.name as target_name
        ORDER BY
          source_id;
+-----------------------------------------------------------------------------------------------+
| source_id | source_email        | source_name | target_id | target_email        | target_name |
+-----------------------------------------------------------------------------------------------+
| 1         | "brent@example.com" | "brent"     | 2         | "bnert@swedish.com" | "bnert"     |
| 2         | "bnert@swedish.com" | "bnert"     | 1         | "brent@example.com" | "brent"     |
+-----------------------------------------------------------------------------------------------+

2 rows
ready to start consuming query after 115 ms, results consumed after another 2 ms

Note that nodes are written without an explicit schema (i.e. schema on read) where node "documents" have an associated type which enables clients to infer the resulting data structure.

The query language for navigating the graph is still quite mechanical given the desired "walk path" needs to be specified, however, cypher is a little clearer once the syntax is understood.

In Sum

We're all attempting to model graphs. If you're an OOP class of person (pun intended), object heirarchies are graphs. If your more of a maps kind of person, a nested data structure implies a relation and therefore, implies a graph. Relational tuples are graphs. Everything is essentially a graph. Its graphs all the way down.

Therefore, we should optimize for interfaces which make the impedence mismatch between our graphs and our datastructures as small as possible, ideally they are transparent.