Property Graph Exchange Format (PG)

Note

THIS IS WORK IN PROGRESS. See also https://github.com/orgs/pg-format/discussions and https://github.com/pg-format/pg-formatter/wiki for discussion and references! Issue tracker of this specification document is at https://github.com/pg-format/specification/issues.

1 Introduction

1.1 Why property graphs?

Property Graphs (also known as Labeled Property Graphs) are used as abstract data structure in graph databases and related applications.

Implementations of property graphs slightly differ in support of data types, restrictions on labels etc. The definition of property graphs used in this specification is aimed to be a superset of property graph models of common graph databases and formats. The model and its serializations have first been proposed by Hirokazu Chiba, Ryota Yamanaka, and Shota Matsumoto (2019, 2022) and revised into this specification together with Jakob Voß.

1.2 About this document

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 (RFC 2119 and RFC 8174) when, and only when, they appear in all capitals, as shown here.

Normative parts of this specification are limited to section 2 to 6 and the normative references.

2 Data Model

A property graph consists of nodes and edges between these nodes. Each node has a unique node identifier. Each edge is either directed or undirected and can have an optional edge identifier. Each of the nodes and edges can have properties and labels. Properties are mappings from keys to non-empty lists of values. Node identifiers, labels, and keys are non-empty Unicode strings. A value is a Unicode string, a boolean value, or a number as defined by RFC 8259.

Extended graph features not being part of this data model include graph attributes, hierarchies, hyper-edges and semantics of individual labels and property keys.

3 PG format

A PG format document allows writing down a property graph in a compact textual form. A PG format document is a Unicode string that conforms to grammar and rules defined in this specification.

3.1 Basic structure

A PG format document encodes a property graph as Unicode string. The document MUST be encoded in UTF-8 (RFC 3629). Unicode codepoints can also be given by escape sequences in quoted strings.

The document consists of a sequence of statements, each defining a node or an edge, or being empty. Statements are separated from each other with a line break. Optional spaces and/or a comment at the end of a statement are ignored.

3.2 Identifiers

An identifier is a string used to uniquely identify a node, an edge, a label, or the name of a property. An identifiers can be given as quoted string or as unquoted identifier.

An unquoted identifier is a non-empty string not including control codes U+0000 to U+0020 (tabulator, line breaks, space…), nor any of the characters “<” (U+003C), “>” (U+003E), ‘"’ (U+0022), “{” (U+007B), “}” (U+007D), “|” (U+007C), “\” (U+005C), “^” (U+005E), and “`” (U+0060).1

Example 1: Several unquoted identifiers
abc
~2

dc:title
http://example.org/?a=-&c=0#x
-':

An unquoted identifier MUST NOT start with a colon (U+003A) or comma (U+002C).

Example 2: Invalid unquoted identifier
:id

An unquoted identifier MUST NOT start with hash (“#”) because this starts a comment nor with apostrophe (“'”) or quotation mark (“"”) because these start a quoted string. Colon, hash, comman, and apostrophe are allowed in an unquoted identifier after its first character.

3.3 Nodes

A node consists of the following elements, given in this order and separated by delimiting whitespace:

Example 3: Some node statements
id :label key:value
42 :answer
"node id with spaces"

3.3.1 Node merging

A node can be defined with multiple statements having the same node identifier: a node is merged with an existing node by appending labels and property values.

Example 4: One node defined by multiple statements
a :x k:1 m:true
a :y k:2
Example 5: Same node defined by one statement
a :x :y k:1,2 m:true

3.3.2 Implicit nodes

Nodes can also be defined implicitly as part of an edge: node identifiers referenced in edges imply the existence of nodes with these identifiers.

Example 6: Simple graph with two nodes and one edge
a -> b
Example 7: Same graph with explicit node statements
a
b
a -> b

3.4 Edges

A edge consists of the following elements, given in this order and separated by delimiting whitespace:

Example 8: Some edge statements
a -> b
a -- b key:value
1: a -> b :label key:value

The following statement does not define an edge but a node with identifier “a--b”:

Example 9: Not an edge statement
a--b

The following is not valid:

Example 10: Invalid edge statement
a->b

3.4.1 Edge identifiers

An edge identifier is an identifier as first element of an edge statement, directly followed by a colon (U+003A).

Example 11: Graph with two equivalent edges, differentiated by edge identifiers
1: a -> b :follows since:2024
"x": a -> b :follows since:2024

Colons are not forbidden in edge identifiers:

Example 12: Edge identifiers with colon
x::  a -> b  # edge identifier "x:"
":": a -> b  # edge identifier ":"

Edge identifiers MUST NOT be repeated.

Example 13: The second statement is invalid because of repeated edge identifier
1: a -> b :follows
1: a -> b since:2024

No space is allowed between edge identifier and its colon:

Example 14: Invalid statement
1 : a -> b

3.4.2 Edge directions

The direction element of an edge is either the character sequence -> for a directed edge or the character sequence -- for an undirected edge.

3.4.3 Loops

Edges can connect a node to itself.

Example 15: Directed and undirected loop
a -> a
a -- a

3.4.4 Multi-edges

The Property Graph Data Model allows for multiple edges between same nodes.

Example 16: Graph with two indistinguishable edges
a -> b :follows since:2024
a -> b :follows since:2024

Edge identifiers can be used to identify and reference individual multi-edges.

3.5 Labels

A label is a an identifier following a colon (U+003A). Spaces between colon and label identifier are OPTIONAL but NOT RECOMMENDED.

Labels of a node or an edge are unique: repeated labels are ignored. Applications SHOULD preserve the order of labels of a node or an edge.

Example 17: Repeated labels on same node or edge are ignored
a :label1 :label2 :label1   # label1 is repeated
a :label1 :label2           # equivalent statement
a : label1 : label2         # equivalent statement

Colons are not forbidden in labels:

Example 18: Labels with colons
a :b:c                      # label "b:c"
a :http://example.org/      # label "http://example.org/"

3.6 Properties

A property consists of the following elements, given in this order:

  • a REQUIRED property key, being an identifier
  • a colon (U+003A)
  • a non-empty list of property values, separated by comma (U+002C)

Each property value MAY be preceded and followed by delimiting whitespace.

Example 19: Property with optional spaces and/or whitespace
a key: value        # spaces before value
a key:value         # short form
a "key":value       # key can be quoted string
a key:              # delimiting whitespace between colon and value
  value             

a key: 1,2          # short form of a list
a key: 1            # delimiting whitespace... 
  ,                 # ...after value 1 and before value 2
  2
Example 20: Invalid property
a key               # delimiting whitespace not allowed before colon
  : value

3.6.1 Property values

A property value is one of

  • a number value, given as defined in section 6 of RFC 8259. As mentioned there, implementations MAY set limits on the range and precision of numbers and double precision (IEEE754) is the most likely common limit.
  • a boolean value, given as one of the literal character sequences true and false
  • a string value, given as one of

The data type of a property value in PG format is either string, or number, or boolean.2 Applications MAY internally map these types to other type systems. Values of the same property are allowed to have different data types.

Example 21: Property values
x n: 1,-1,2e+3          # numbers 
  b: true, false        # boolean values       
  s: hello,"true",""    # strings

3.6.2 Property merging

Value lists of properties of the same property key are concatenated. Value lists are no sets: the same value can be included multiple times.

Example 22: Three nodes with same properties
a x:1,2,3       # property values given as list
b x:1 x:2,3     # property values given as two lists
c x:1           # property values given...
c x:2 x:3       # ...in two node statements

3.7 Quoted Strings

A quoted string starts with an apostrophe (“'”) or quotation mark (“"”) and ends with the same character. In between all Unicode characters are allowed, except for the characters that MUST be escaped:

  • apostrophe, when the string is quoted with apostrophe
  • quotation mark, when the string is quoted with quotation mark
  • reverse solidus (\ U+005C)
  • control characters U+0000 through U+001F except line feed (U+000A), carriage return (U+000D), and tabular (U+0009)

All characters can be escaped as defined by JSON specification (RFC 8259, section 7) with the addition of the two-character escape sequence \' to escape an apostrophe. Quoted strings in PG format further differ from JSON by string quoting with apostrophe in addition to quotation mark and by allowing unquoted line feed, carriage return, and tabular.

Example 23: The same string given in multiple quoted forms
"hello,\nworld"
'hello,\u000Aworld'
"hello,
world"
Example 24: Invalid string escape sequences
"h\ello\u21"

3.8 Whitespace

A line break is either a line feed (U+000A) or a carriage return (U+000D) optionally followed by a line feed.

Spaces are a non-empty sequence of space (U+0020) and/or tabular (U+0009).

A comment begins with a hash (# = U+0023) and it ends before the next line break or at the end of the document.

Delimiting whitespace separates elements of a statement. Delimiting whitespace consists of an optional sequence of spaces, comment, and/or line breaks and it ends with spaces. The inclusion of line breaks in delimiting whitespace is also called line folding.

Example 25: Line folding
a :x  # node id and label
  # this and the following line are empty 

  :y  # another label of the same node at continuation line
Example 26: Same graph as above
a :x :y

3.9 Grammar

Note

A formal grammar of PG format will be provided

4 PG-JSON

A PG-JSON document serializes a property graph in JSON. A PG-JSON document is a JSON document (RFC 8259) with a JSON object with exactely two fields:

  • nodes an array of nodes
  • edges an array of edges

Each node is a JSON object with exactely three fields:

  • id the node identifier, being a non-empty string. Node identifiers MUST be unique per PG-JSON document.
  • labels an array of labels, each being a non-empty string. Labels MUST be unique per node.
  • properties a JSON object mapping non-empty strings as property keys to non-empty arrays of scalar JSON values (string, number, boolean) as property values.

Each edge is a JSON object with one optional and four mandatory fields:

  • id (optional) the edge identifier, being a non-empty string. Edge identifiers MUST be unique per PG-JSON document.
  • undirected (optional) a boolean value whether the edge is undirected
  • from an identifier of the source node from nodes array
  • to an identifier of the target node from nodes array
  • labels and properties as defined above at nodes

5 PG-JSONL

A PG-JSONL document or stream serializes a property graph in JSON Lines format, also known as newline-delimited JSON. A PG-JSONL document is a sequence of JSON objects, separated by line separator (U+000A) and optional whitespace (U+0020, U+0009, and U+000D) around JSON objects, and an optional line separator at the end. Each object is

  • either a node with field type having string value "node" and the same mandatory node fields from PG-JSON format,
  • or an edge with field type having string value "edge" and the same mandatory edge fields from PG-JSON format.

Node objects SHOULD be given before their node identifiers are referenced in an edge object but applications MAY also create implicit node objects for this cases. Applications MAY allow multiple node objects with same node identifier in PG-JSONL but they MUST make clear whether nodes with repeated identifier are ignored, merged into existing nodes, or replace existing nodes.

6 Robustness principle

Applications MAY automatically convert documents not fully conforming to the specification of PG-JSON and/or PG-JSONL to valid form, for instance by:

  • creation of implicit nodes for node identifiers referenced in edges
  • addition of missing empty fields labels and/or properties
  • removal or mapping of invalid property values such as null and JSON objects
  • mapping of numeric node identifiers and edge identifiers to strings
  • removal of additional fields not defined in this specification

7 JSON Schemas

The PG-JSON format can be validated with a non-normative JSON Schema file pg-json.json in this repository. Rules not covered by the JSON schema include:

  • nodes referenced in edges must be defined (no implicit nodes)
  • node identifiers must be unique per graph
  • edge identifiers must be unique per graph

The PG-JSONL format can be validated with a non-normative JSON Schema file pg-jsonl.json in this repository. Validation is limited in the same way as validation of PG-JSON with its JSON Schema.

8 References

8.1 Normative References

8.2 Informative references

9 Changes

9.1 Version 1.0.0

not published yet

Introduced comments, line folding, edge identifiers. Aligned property values with JSON syntax. Added more formal rules for quoted strings and unquoted identifiers. Added PG-JSONL. Changed node identifiers to be strings.

9.2 Version 0.3

Less formal specification published in 2019. See latest version from 2020.

Footnotes

  1. This definition is equivalent to the definition of IRI references in SPARQL and in Turtle excluding empty strings and escape sequences. Forbidden characters are reserved for extensions of PG format.↩︎

  2. This is identical to scalar JSON values (string, number, boolean) and every serialized JSON scalar is a valid property value in PG format.↩︎