Detect Data Islands of Nodes and Edges using SQL on SQL Server

SQL developers can query hierarchy data using recursive CTE expression on SQL Server databases easily. Although SQL recursive CTE queries are advanced techniques for database developers which is not supported by all database providers like SAP HANA, it seems to be an option to work with connected nodes just like graph data. For example, to separate nodes connected to each other by edges into groups like islands seperating connected ones from others, many SQL programmers get stuck in recursive SQL queries.

In this SQL tutorial, I want to share a solution for SQL database developers showing how to group nodes which are connected into islands. I use the island term as grouped nodes which are related to each other but totally separated from nodes in other islands.

Following image is showing a group of nodes and relation between nodes (fromNode connected to toNode).
This fromNode and toNode naming calls a similarity for hierarchy data. On the other hand, it seems to be impossible for me at least to solve this grouping nodes into islands problem with recursive CTE query.

As seen in below image, nodes with same color have a relation and there is a path (ignoring the arrow direction and assuming relation is bidirectional) from one node in island to an arbitrary node in the same island.

edge and nodes grouped into data islands by sql code

My sample data has four islands, now I want to build a SQL script that will identify these four islands including the nodes and edges within each data island.

Before you execute the given SQL code script, maybe I can tell you how it works for you to make you understand it easily.

First of all, I create a temporary table and store my sample data in it.
This is the ##Edges table where I insert data rows representing edges (relation of fromNodes and toNodes)

A second temp table I use for calculating groups in SQL database is ##RelatedNodes.
This table has nodes field where I add every node (fromNode, toNode) which are related to each other.
If we continue to read the SQL code, you will see I create a SQL cursor and loop through all edges of my sample data and try to place these nodes in one of the rows of ##RelatedNodes temp table.

Basically, ##RelatedNodes temporary table is a sandbox where I insert independent edge data as a separate row.

If there is a relation of the current cursor row nodes with only one of the rows of the temp table, then I concatenate the loop nodes to the related data group in ##RelatedNodes temp table

One of the cases that database programmers cannot handle is the situation where the current row of the cursor has a relation with two of the rows (two previously calculated islands).
In this case the current loop data (fromNode and toNode) is a bridge between two different nodes.
This case requires removal of two different islands by replacing them with a single island representation.

All these calculations are managed within the SQL cursor.
The number of relation of the loop data is calculated into @cnt variable.

If you place your pseudo-code within the loop or query ##RelatedNodes temp table to understand how SQL script code works, you will see there are repeating nodes in table's nodes field concatenated and separated by ","" comma character.'
I did not follow a path to remove these repeats since it will cause more complexity in SQL code and might cause the performance be worse.

And one last note for performance. Although I mentioned a few times that recursive SQL CTE query is not the way to solve this problem, running it once and grouping hierarhical data at the beginning of the script can gain performance. After the first detection of related nodes, you will see recursive CTE fails to combine some data. At this point the following structure can be executed to fetch additional relations, to detect bridges and to merge different data islands into a single group of nodes.

Here is the SQL codes that can be executed on SQL Server database


set nocount on



-- sample data

create table ##Edges (

 NodeFrom int,

 NodeTo int

)



insert into ##Edges values (1,4),(10,4),(10,16),(16,17),(10,11),(3,11),(19,11),(19,20)

insert into ##Edges values (5,2),(2,21),(2,9),(14,9),(15,14)

insert into ##Edges values (7,12),(12,8),(8,13)

insert into ##Edges values (6,18)



-- required temporary table

create table ##RelatedNodes (id int identity(1,1), nodes varchar(max))



-- variables

declare @NodeFrom int

declare @NodeTo int

declare @cnt int

declare @Nodes1 varchar(max)

declare @Nodes2 varchar(max)

declare @Id1 int

declare @Id2 int



----------------------- SQL CURSOR ---------------------

DECLARE sqlcursor CURSOR FAST_FORWARD FOR

select

 NodeFrom, NodeTo

from ##Edges

Order By NodeFrom, NodeTo



OPEN sqlcursor



FETCH NEXT FROM sqlcursor INTO @NodeFrom, @NodeTo



WHILE @@FETCH_STATUS = 0

BEGIN

--- SQL Code in Cursor



 select

  Id

 into ##tblId

 from ##RelatedNodes

 where 

  nodes like ('%,' + convert(varchar(max),@NodeFrom) + ',%') or 

  nodes like ('%,' + convert(varchar(max),@NodeTo) + ',%')



 select @cnt = count(*) from ##tblId



 if @cnt = 0

 begin

  insert into ##RelatedNodes 

  select 

   case when @NodeFrom < @NodeTo 

    then (',' + convert(varchar(max),@NodeFrom) + ',' + convert(varchar(max),@NodeTo) + ',')

    else (',' + convert(varchar(max),@NodeTo) + ',' + convert(varchar(max),@NodeFrom) + ',')

   end

 end

 else if @cnt = 1

 begin



  update ##RelatedNodes

  set nodes = nodes + convert(varchar(max),@NodeFrom) + ',' + convert(varchar(max),@NodeTo) + ','

  from ##RelatedNodes rn

  inner join ##tblId t

   on rn.id = t.id

  

 end

 else if @cnt = 2

 begin



  select top 1

   @Id1 = rn.id,

   @Nodes1 = nodes

  from ##RelatedNodes rn

  inner join ##tblId t

   on rn.id = t.id

  order by t.id



  select top 1

   @Id2 = rn.id,

   @Nodes2 = nodes 

  from ##RelatedNodes rn

  inner join ##tblId t

   on rn.id = t.id

  order by t.id desc



  set @Nodes1 = @Nodes1 + @Nodes2



  update ##RelatedNodes

  set nodes = @Nodes1

  where Id = @Id1



  delete ##RelatedNodes where Id = @Id2



 end



 drop table ##tblId



--- SQL Code in Cursor

 FETCH NEXT FROM sqlcursor INTO @NodeFrom, @NodeTo

END



CLOSE sqlcursor

DEALLOCATE sqlcursor 

----------------------- SQL CURSOR ---------------------



select

 NodeFrom, NodeTo, GroupId

from ##Edges

inner join (

 select nodes, row_number() over (order by id) as GroupId from ##RelatedNodes

) g

 on g.nodes like ('%,' + convert(varchar(max),NodeFrom) + ',%')

order by GroupId, NodeFrom, NodeTo



drop table ##RelatedNodes

drop table ##Edges



set nocount off

Code

When above SQL code is executed on a SQL Server database, it will create sample data, calculate node relations in SQL cursor code and display grouped data as islands identified with GroupId in execution output

SQL Server graph data grouped as islands

To summarize, I know executing SQL scripts with cursors is not a best practise for database developers. I have to admit that I will accept a solution using a set based operation instead of SQL cursors and looping through each data row to perform grouping and building relations between nodes. But I worked to solve this problem, seperating nodes and edges into data islands as I shared in the image at the beginning of this SQL tutorial but failed to resolve the issue without the use of SQL cursors.

Please feel free to inform me for additional solutions of this SQL problem. I will be pleased especially if there is a graph solution of this grouping and categorizing of nodes and edges.