Graph databases excel at modeling data where relationships are first class citizens. The relationships exist as a permanent aspect of the data set rather than as an ephemeral join created only when a query is run. This presents a wide diversity of interesting analytical opportunities which aren’t natural in a traditional “relational” database. Once you understand graph data structures, you start to see graphs everywhere.
Neo4j is a vendor on the cutting edge of graph database world. Their cypher query language makes building and querying graphs easy and their browser application is one of the most visually appealing database interfaces I’ve ever worked with. In conjunction with GrapheneDB, a Neo4j managed hosting provider, any developer can get up and running with a graph database in about 20 minutes.
We at IP Street view patents not only as assets which convey rights but also as government verified relationships between people and technologies, technologies and companies, and people and companies. These relationships are far more robust than the self-reported relationships on LinkedIn or Facebook and provide some deep insight about what people are good at and what companies are interested in developing. Graph databases are a great way to model this domain and the IP Street patent data API makes this exceedingly easy to do.
To show off how cool graph databases can be, we wanted to provide a code walkthrough of how to access patent data with the IP Street API, parse it to nodes and relationships, write them to Neo4j, and then run some interesting queries to identify a worrying case of Hit-By-A-Bus Syndrome at Magic Leap. More on that below.
For this walkthrough, we are going to be specifically looking at inventor networks using IP Street’s new Python SDK and Neo4j v3 running on a GrapheneDB managed instance. We are going to build a small database which contains companies, inventors, and patents, and the relationships between them.
Our SDK and this full code example can be found on our GitHub repository. You will need an IP Street API key to run the example but you can start a trial account by signing up here.
The first step is to model the data in our domain. One way to do this is to describe the entities and relationships in our data in words. For our inventor network example, we can say:
"Inventors invented Patents"
"Companies own Patents"
The nouns in our textual description should be treated nodes, and verbs should be treated as relationships. In Neo4j, relationship can also be directional depending on whether the verb action can flow only one way or can be true in either direction. Ryan Boyd from the Neo4j developer relations team has a great video on graph database modeling best practices. For our example, writing out this description suggests to us that we have three nodes, “inventors”, “patents”, and “companies”, and two relationships, “invented” and “own”.
Python’s class system is great for implementing these nodes and relationships in code. Keeping it simple, we’ll just say that People nodes and Companies only have two attributes, node__type and full_name. Patents have a node_type but also have a title, a grant_number, a publication_number, and and application_number.
After creating node classes, we need a way to connect to our Neo4j database. Earlier this year, they launched their new Bolt connection protocol, which uses a compact binary encoding over TCP or web sockets for higher throughput and lower latency. Fortunately for us, Neo4j has also created a Python driver for Bolt making it easy to wrap into a writer client class.
With our nodes defined and our Neo4j client set up, it’s now time to query some data from IP Street. Our Python SDK is a wrapper around the IP Street API and it does a good job of abstracting away issues like pagination and repeating queries. Virtual Reality is hot these days so, for our example, let’s get all the patents owned by Magic Leap, Inc.
To run any query, first instantiate a client object with an API key. Next, instantiate a query then add parameters to your query. Finally, send your query with the client. A pythonic list of dicts will be returned ready to work with.
For every patent in the results, we want to write a person node, a patent node, and a company node. Writing to the Neo4j database is actually executed in their own query language called Cypher. Cypher is intended to visually represent the nodes and relationships you are querying. It feels like pictograms as () describes a node and -→ describes relationship. It’s very human readable and most people catch on really quickly.
Just like SQL, queries start with a function call. We use MERGE in this case instead of CREATE because we don’t want duplicate nodes and only want to write a new node if it doesn’t exist already. Implementing these node MERGE transactions can be done in our Python script by instantiating a node class and passing it to the appropriate Neo4j client write method.
After all the nodes are written, we can then write the relationships between the nodes. In Cypher, relationships are created by using the function MATCH matching two nodes based on some query parameters and assigning them to temporary variables. In this case, we find the Person node with the attribute full_name == “Jon Doe” and assign it to variable “a”. Then we find the Patent node with the attribute grant_number == “7477713” and assign it to variable “b”. The final line creates a new relationship with the relationship type “Invented” and assigns a priority_date attribute to the relationship.
These relationships can be written by passing two nodes into the appropriate Neo4j client write method. For our database this can be accomplished by passing a Person node and Patent node to write_person_to_patent and a Company node and Patent node to write_company_to_patent. These methods simply use string formatting and concatination to create the correct cypher query and then send them as a transaction with the Bolt driver.
Putting it all together at runtime, we instantiate the Neo4j Client, instantiate the IP Street client, send the IP Street query, and then loop over the results for every patent in results; writing all nodes and relationships.
Now comes the fun part. Let’s explore our data by querying it in the Neo4j web interface. You can find the web interface to your database by going to <url>. To run queries, you can just type cypher directly into the search bar at the top.
Let’s start with the basic question of how many patents does Magic Leap own? Similar to your standard SQL query, this can be run with a one-liner.
Now let’s look at inventor networks by start with Magic Leap’s founder, Rony Abovitz. How many patents does he own? What are they?
Wow! 186 patents owned by Magic Leap and Rony Abovitz has his fingers in 169 of them. If Rony is really so innovate and the center of nearly all innovation at the company, Magic Leap seems to suffer from a very serious Hit-By-A-Bus problem.
I wonder what areas of technology Rony’s fingers haven’t got into? Again, suprising. Of the 800+ employees at Magic Leap, only 8 of them are have invented patents without Abovitz's oversight.
Not being an insider, it's really hard to say. It may just be that Mr. Abovitz likes to put his name on patents applications in a Dilbert-esque way. Either way, there's probably something wrong with this picture.