For a faster bulk JSON load, avoid loading the JSON into a Cypher query
directly (e.g. using WITH
/UNWIND
) and instead look at the
apoc.load.json
and apoc.perioditc.iterate
APOC functions.
From APOC User Guide:-
With
apoc.periodic.iterate
you provide 2 statements, the first outer statement is providing a stream of values to be processed. The second, inner statement processes one element at a time or with iterateList:true the whole batch at a time.
With
apoc.load.json
, it’s now very easy to load JSON data from any file or URL, to avoid directly inserting the JSON into a script
First run a neo4j instance:-
docker run -e NEO4J_AUTH=none \
-e NEO4J_dbms_security_procedures_unrestricted=apoc.\\\* \
-e NEO4J_apoc_import_file_enabled=true \
-e NEO4J_dbms_memory_pagecache_size=4G \
-e NEO4J_dbms_memory_heap_maxSize=4G \
--rm \
--name img \
--publish=7474:7474 \
--publish=7687:7687 \
-v ./proj/data:/data \
-v ./proj/import:/var/lib/neo4j/import \
-v ./proj/plugins:/plugins \
-v ./proj/conf:/var/lib/neo4j/conf \
neo4j
Combining both these apoc calls, we can take an input, (in this case a list of Named Entity Recognitions):-
[
{
"confidence": 99.0,
"entities": [
{
"entity": ["NEO4J"],
"tag": "I-ORG"
}
],
"id": "11819",
"locale": "en",
"read_bytes": 1214
},
{
"confidence": 99.0,
"entities": [
{
"entity": ["ATLASSIAN"],
"tag": "I-ORG"
},
{
"entity": ["APPLE"],
"tag": "I-ORG"
}
],
"id": "11820",
"locale": "en",
"read_bytes": 1186
}
]
And load it like so:-
CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///ner.json')
YIELD value AS parsedResponse RETURN parsedResponse
", "
MATCH (c:Crime) WHERE c.source_key = parsedResponse.id
FOREACH(entity IN parsedResponse.entities |
// Person
FOREACH(_ IN CASE WHEN entity.tag = 'I-PER' THEN [1] ELSE [] END |
MERGE (p:NER_Person {name: trim(reduce(s = \"\", x IN entity.entity | s + x + \" \"))})
MERGE (p)<-[:CONTAINS_ENTITY]-(c)
)
// Organization
FOREACH(_ IN CASE WHEN entity.tag = 'I-ORG' THEN [1] ELSE [] END |
MERGE (o:NER_Organization {name: trim(reduce(s = \"\", x IN entity.entity | s + x + \" \"))})
MERGE (o)<-[:CONTAINS_ENTITY]-(c)
)
// Location
FOREACH(_ IN CASE WHEN entity.tag = 'I-LOC' THEN [1] ELSE [] END |
MERGE (l:NER_Location {name: trim(reduce(s = \"\", x IN entity.entity | s + x + \" \"))})
MERGE (l)<-[:CONTAINS_ENTITY]-(c)
)
)",
{
batchSize: 10000,
iterateList: true
}
);
This completes much quicker and loads the data gradually in batches opposed to other methods, especially for huge/large datasets.