1 Introduction
1.1 Motivation
Property Graphs (also known as Labeled Property Graphs) are used to structure data in graph databases and related applications.
Implementations of property graphs slightly differ in support of data types, restrictions on labels etc. The definition of property graphs used in this specification is aimed to be a superset of property graph models of common graph databases and formats. The model and its serializations have first been proposed by Hirokazu Chiba, Ryota Yamanaka, and Shota Matsumoto (2019, 2020) and revised into this specification together with Jakob Voß.
1.2 Terminology
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 (RFC 2119 and RFC 8174) when, and only when, they appear in all capitals, as shown here.
1.3 Robustness principle
Applications MAY automatically convert documents not fully conforming to the specification of PG-JSON and/or PG-JSONL to a valid form, for instance by:
- creation of implicit nodes for node identifiers referenced in edges
- addition of missing empty fields
labels
and/orproperties
- removal or mapping of invalid property values such as
null
and JSON objects - mapping of numeric node identifiers and edge identifiers to strings
- removal of additional fields not defined in this specification
2 Data Model
A property graph consists of nodes and edges between these nodes. Each node has a unique node identifier. Each edge is either directed or undirected and can have an optional edge identifier. Each of the nodes and edges can have properties and labels. Properties are mappings from keys to non-empty lists of values. Node identifiers, labels, and keys are non-empty Unicode strings. A value is a Unicode string, a boolean value, or a number as defined by RFC 8259.
Extended graph features not being part of this data model include graph attributes, hierarchies, hyper-edges and semantics of individual labels and property keys.
3 PG Format
A PG Format document allows writing down a property graph in a compact textual form. A PG Format document is a Unicode string that conforms to grammar and rules defined in this specification.
3.1 Basic structure
A PG Format document encodes a property graph as Unicode string. The document MUST be encoded in UTF-8 (RFC 3629). Unicode codepoints can also be given by escape sequences in quoted strings.
The document consists of a sequence of statements, each defining a node or an edge, or being empty. Statements are separated from each other by a line break. Optional spaces and/or a comment at the end of a statement are ignored.
3.2 Identifiers
An identifier is a string used to uniquely identify a node, an edge, a label, or the name of a property. An identifier can be given as a quoted string or as an unquoted identifier.
An unquoted identifier is a non-empty string not including control codes U+0000
to U+0020
(tabulator, line breaks, space…), nor any of the characters “<
” (U+003C
), “>
” (U+003E
), ‘"
’ (U+0022
), “{
” (U+007B
), “}
” (U+007D
), “|
” (U+007C
), “\
” (U+005C
), “^
” (U+005E
), and “`
” (U+0060
). An unquoted identifier MUST NOT start with a colon (U+003A
), comma (U+002C
), minus (U+002D
), hash (“#
”), apostrophe (“'
”), or quotation mark (“"
”). Colon, hash, comma, and apostrophe are allowed in an unquoted identifier after its first character.1
3.3 Nodes
A node consists of the following elements, given in this order and separated by delimiting whitespace:
- a REQUIRED identifier
- an OPTIONAL list of labels
- an OPTIONAL list of properties
3.3.1 Node merging
A node can be defined with multiple statements having the same node identifier: a node is merged with an existing node by appending labels and property values.
3.3.2 Implicit nodes
Nodes can also be defined implicitly as part of an edge: node identifiers referenced in edges imply the existence of nodes with these identifiers.
3.4 Edges
An edge consists of the following elements, given in this order and separated by delimiting whitespace:
- an OPTIONAL edge identifier
- a REQUIRED source node identifier
- a REQUIRED direction
- a REQUIRED target node identifier
- an OPTIONAL list of labels
- an OPTIONAL list of properties
3.4.1 Edge identifiers
An edge identifier is an identifier as the first element of an edge statement, directly followed by a colon (U+003A
).
Colons are not forbidden in edge identifiers:
Edge identifiers MUST NOT be repeated.
No space is allowed between the edge identifier and its colon:
3.4.2 Edge directions
The direction element of an edge is either the character sequence ->
for a directed edge or the character sequence --
for an undirected edge.
3.4.3 Loops
Edges can connect a node to itself.
3.4.4 Multi-edges
The Property Graph Data Model allows for multiple edges between the same node.
Edge identifiers can be used to identify and reference individual multi-edges.
3.5 Labels
A label is an identifier following a colon (U+003A
). Spaces between colon and label identifier are OPTIONAL but NOT RECOMMENDED.
Labels of a node or an edge are unique: repeated labels are ignored. Applications SHOULD preserve the order of labels of a node or an edge.
Colons are not forbidden in labels:
3.6 Properties
A property consists of the following elements, given in this order:
- a REQUIRED property key, being an identifier
- a colon (
U+003A
) - a non-empty list of property values, separated by comma (
U+002C
)
Each property value MAY be preceded and followed by delimiting whitespace. If the property key is an unquoted identifier and no delimiting whitespace is given before the first value, then the property key MUST NOT contain a colon.
3.6.1 Property values
A property value is one of
- a number value, given as defined in section 6 of RFC 8259. As mentioned there, implementations MAY set limits on the range and precision of numbers and double precision (IEEE754) is the most likely common limit.
- a boolean value, given as one of the literal character sequences
true
andfalse
- a string value, given as one of
- a quoted string
- an unquoted identifier not including a comma
The data type of a property value in PG Format is either string, or number, or boolean.2 Applications MAY internally map these types to other type systems. Values of the same property are allowed to have different data types.
3.6.2 Property merging
Value lists of properties of the same property key are concatenated. Value lists are no sets: the same value can be included multiple times.
3.7 Quoted Strings
A quoted string starts with an apostrophe (“'
”) or quotation mark (“"
”) and ends with the same character. In between, all Unicode characters are allowed, except for the characters that MUST be escaped:
- apostrophe, when the string is quoted with apostrophe
- quotation mark, when the string is quoted with a quotation mark
- reverse solidus (
\
U+005C
) - control characters
U+0000
throughU+001F
except line feed (U+000A
), carriage return (U+000D
), and tabular (U+0009
)
All characters can be escaped as defined by JSON specification (RFC 8259, section 7) with the addition of the two-character escape sequence \'
to escape an apostrophe. Quoted strings in PG Format further differ from JSON by string quoting with apostrophe in addition to quotation mark and by allowing unquoted line feed, carriage return, and tabular.
3.8 Whitespace
A line break is either a line feed (U+000A
) or a carriage return (U+000D
) optionally followed by a line feed.
Spaces are a non-empty sequence of space (U+0020
) and/or tabular (U+0009
).
A comment begins with a hash (#
= U+0023
) and it ends before the next line break or at the end of the document.
Delimiting whitespace separates elements of a statement. Delimiting whitespace consists of an optional sequence of spaces, comment, and/or line breaks and it ends with spaces. The inclusion of line breaks in delimiting whitespace is also called line folding.
3.9 Grammar
The formal grammar of PG Format is specified in EBNF Notation used in the specification of XML, with the addition of negative lookahead operator (!A B
matches any expression B
that does not start with expression A
) and the terminal symbol END
denoting the end of a document.
/* 3.1 Basic Structure */
PG ::= ( Statement? Empty LineBreak )* Statement? Empty
Statement ::= Edge | Node
/* 3.2 Identifiers */
Identifier ::= QuotedId | UnquotedStart UnquotedChar*
UnquotedChar ::= [^#x00-#x20<>"{}|^`\]
UnquotedStart ::= !['":#,-] UnquotedChar
/* 3.3 Nodes & 3.4 Edges */
Node ::= Identifier Labels Properties
Edge ::= ( EdgeIdentifier )? EdgeNodes Labels Properties
EdgeIdentifier ::= QuotedKey DWS | UnquotedKey DWS
EdgeNodes ::= Identifier DWS Direction DWS Identifier
Direction ::= "--" | "->"
/* 3.5 Labels */
Labels ::= ( DWS ":" Label )*
Label ::= ":" Spaces? Identifier
/* 3.6 Properties */
Properties ::= ( DWS Property )*
Property ::= Key ValueList
Key ::= QuotedKey | UnquotedKey DWS | PlainKey
QuotedKey ::= QuotedId ":"
UnquotedKey ::= UnquotedStart ( ( !":" UnquotedChar )* ":" )+
PlainKey ::= UnquotedStart ( !":" UnquotedChar )* ":"
/* 3.6.1 Property Values */
ValueList ::= DWS? Value ( DWS? "," DWS? Value )*
Value ::= Number | Boolean | QuotedString | UnquotedValue
Number ::= "-"? ("0" | [1-9] [0-9]*) ( "." [0-9]+ )? ([eE] [+-]? [0-9]+)?
Boolean ::= "true" | "false"
UnquotedValue ::= UnquotedStart (!"," UnquotedChar)*
/* 3.7 Quoted Strings */
QuotedString ::= "'" SingleQuoted* "'" | '"' DoubleQuoted* '"'
QuotedId ::= "'" SingleQuoted+ "'" | '"' DoubleQuoted+ '"'
SingleQuoted ::= Unescaped | '"' | Escaped
DoubleQuoted ::= Unescaped | "'" | Escaped
Unescaped ::= [^#x00-#x08#x0B#x0C#x0E-#x1F"'\]+
Escaped ::= "\" ( '"' | "'" | "\" | "/" | [bfnrt] | "u" Hex Hex Hex Hex )
Hex ::= [0-9a-fA-Z]
/* 3.8 Whitespace */
Spaces ::= [#x20#x09]+
LineBreak ::= [#x0A] | [#x0D] [#x0A]?
Comment ::= "#" [^#x0D#x0A]*
Empty ::= Spaces? Comment?
DWS ::= (Empty LineBreak)* Spaces
4 PG-JSON
A PG-JSON document serializes a property graph in JSON. A PG-JSON document is a JSON document (RFC 8259) with a JSON object with exactely two fields:
nodes
an array of nodesedges
an array of edges
Each node is a JSON object with exactely three fields:
id
the node identifier, being a non-empty string. Node identifiers MUST be unique per PG-JSON document.labels
an array of labels, each being a non-empty string. Labels MUST be unique per node. The array SHOULD be sorted by unicode codepoints.properties
a JSON object mapping non-empty strings as property keys to non-empty arrays of scalar JSON values (string, number, boolean) as property values.
Each edge is a JSON object with one optional and four mandatory fields:
id
(optional) the edge identifier, being a non-empty string, or the valuenull
equivalent to no edge identifier. Edge identifiers MUST be unique per PG-JSON document.undirected
(optional) a boolean value whether the edge is undirectedfrom
an identifier of the source node fromnodes
arrayto
an identifier of the target node fromnodes
arraylabels
andproperties
as defined above at nodes
5 PG-JSONL
A PG-JSONL document or stream serializes a property graph in JSON Lines format, also known as newline-delimited JSON. A PG-JSONL document is a sequence of JSON objects, separated by line separator (U+000A
) and optional whitespace (U+0020
, U+0009
, and U+000D
) around JSON objects, and an optional line separator at the end. Each object is
- either a node with field
type
having the string value"node"
and the same mandatory node fields from PG-JSON format, - or an edge with field
type
having the string value"edge"
and the same mandatory edge fields from PG-JSON format.
Node objects SHOULD be given before their node identifiers are referenced in an edge object, but applications MAY also create implicit node objects for these cases. Applications MAY allow multiple node objects with identical node identifier in PG-JSONL but they MUST make clear whether nodes with repeated identifiers are ignored, merged into existing nodes, or replace existing nodes.
6 References
6.1 Normative References
Bradner, S.: Key words for use in RFCs to Indicate Requirement Levels. BCP 14, RFC 2119, March 1997, http://www.rfc-editor.org/info/rfc2119.
Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259, December 2017. https://tools.ietf.org/html/rfc8259
Bray, T. et al: Section 6 (Simple Extended Backus-Naur Form (EBNF) notation). In: W3C Extensible Markup Language (XML) 1.0 (Fifth Edition). November 2008. https://www.w3.org/TR/REC-xml/#sec-notation
Leiba, B.: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. BCP 14, RFC 8174, May 2017, http://www.rfc-editor.org/info/rfc8174.
The Unicode Consortium: The Unicode Standard. http://www.unicode.org/versions/latest/
Yergeau, F.: UTF-8, a transformation format of ISO 10646. RFC 3629, November 2003. https://tools.ietf.org/html/rfc3629
6.2 Informative references
Property Graph Exchange Format Homepage https://pg-format.github.io/ including PG Test Suite, PG Syntax Highlighting, PG Format discussion forum, and links to implementations.
JSON Schema schema language
Chiba, H., Yamanaka, R., Matsumoto, S.: Property Graph Exchange Format. 2019
Appendices
The following information is non-normative.
JSON Schemas
The PG-JSON format can be validated with a non-normative JSON Schema file pg-json.json
in the specification repository. Rules not covered by the JSON schema include:
- nodes referenced in edges must be defined (no implicit nodes)
- node identifiers must be unique per graph
- edge identifiers must be unique per graph
The PG-JSONL format can be validated with a non-normative JSON Schema file pg-jsonl.json
in the specification repository. Validation is limited in the same way as validation of PG-JSON with its JSON Schema.
Changes
This document is managed in a revision control system at https://github.com/pg-format/specification, including an issue tracker.
Version 1.0.0
Introduced comments, line folding, edge identifiers. Aligned property values with JSON syntax. Added more formal rules for quoted strings and unquoted identifiers. Added PG-JSONL. Changed node identifiers to be strings.
Version 0.3
Less formal specification first published in 2019. See latest draft from 2020.
Acknowledgements
Many thanks to Ryota Yamanaka (Meer Consulting Group) and Shota Matsumoto (Lifematics Inc.) for their contribution to the first versions of PG Format.
Footnotes
This definition is equivalent to the definition of IRI references in SPARQL and in Turtle excluding empty strings, escape sequences and forbidding some start characters.↩︎
This is identical to scalar JSON values (string, number, boolean) and every serialized JSON scalar is a valid property value in PG Format.↩︎