Bulk APOC JSON Load


For a faster bulk JSON load, avoid loading the JSON into a Cypher query directly (e.g. using WITH/UNWIND) and instead look at the apoc.load.json and apoc.perioditc.iterate APOC functions.

From APOC User Guide:-

With apoc.periodic.iterate you provide 2 statements, the first outer statement is providing a stream of values to be processed. The second, inner statement processes one element at a time or with iterateList:true the whole batch at a time.

With apoc.load.json, it’s now very easy to load JSON data from any file or URL, to avoid directly inserting the JSON into a script

First run a neo4j instance:-

docker run -e NEO4J_AUTH=none \
  -e NEO4J_dbms_security_procedures_unrestricted=apoc.\\\* \
  -e NEO4J_apoc_import_file_enabled=true \
  -e NEO4J_dbms_memory_pagecache_size=4G \
  -e NEO4J_dbms_memory_heap_maxSize=4G \
  --rm \
  --name img \
  --publish=7474:7474 \
  --publish=7687:7687 \
  -v ./proj/data:/data \
  -v ./proj/import:/var/lib/neo4j/import \
  -v ./proj/plugins:/plugins \
  -v ./proj/conf:/var/lib/neo4j/conf \
  neo4j

Combining both these apoc calls, we can take an input, (in this case a list of Named Entity Recognitions):-

[
  {
    "confidence": 99.0,
    "entities": [
      {
        "entity": ["NEO4J"],
        "tag": "I-ORG"
      }
    ],
    "id": "11819",
    "locale": "en",
    "read_bytes": 1214
  },
  {
    "confidence": 99.0,
    "entities": [
      {
        "entity": ["ATLASSIAN"],
        "tag": "I-ORG"
      },
      {
        "entity": ["APPLE"],
        "tag": "I-ORG"
      }
    ],
    "id": "11820",
    "locale": "en",
    "read_bytes": 1186
  }
]

And load it like so:-

CALL apoc.periodic.iterate("
    CALL apoc.load.json('file:///ner.json')
    YIELD value AS parsedResponse RETURN parsedResponse
", "
MATCH (c:Crime) WHERE c.source_key = parsedResponse.id

FOREACH(entity IN parsedResponse.entities |

    // Person
    FOREACH(_ IN CASE WHEN entity.tag = 'I-PER' THEN [1] ELSE [] END |
        MERGE (p:NER_Person {name: trim(reduce(s = \"\", x IN entity.entity | s + x + \" \"))})
        MERGE (p)<-[:CONTAINS_ENTITY]-(c)
    )

    // Organization
    FOREACH(_ IN CASE WHEN entity.tag = 'I-ORG' THEN [1] ELSE [] END |
        MERGE (o:NER_Organization {name: trim(reduce(s = \"\", x IN entity.entity | s + x + \" \"))})
        MERGE (o)<-[:CONTAINS_ENTITY]-(c)
    )

    // Location
    FOREACH(_ IN CASE WHEN entity.tag = 'I-LOC' THEN [1] ELSE [] END |
        MERGE (l:NER_Location {name: trim(reduce(s = \"\", x IN entity.entity | s + x + \" \"))})
        MERGE (l)<-[:CONTAINS_ENTITY]-(c)
    )
)",
{
    batchSize: 10000,
    iterateList: true
}
);

This completes much quicker and loads the data gradually in batches opposed to other methods, especially for huge/large datasets.