Find all cycles (or the initial node of each cycle) in a very large directed graph (12m nodes, 6.7m edges)

7 views (last 30 days)
Hi if anyone could help me out with this problem I have been stuck for a while and would be really grateful for your help,
Context
I have about 7 million transactions between accounts, where a transaction is a single row that describes who the buyer account and seller account are, and the object that was sold. So an example of one row would be: [B:0x3, S:0x1, O:Object1] or to better understand 0x1 sold Object1 to 0x3
I have concatenated the object to the buyer and seller aka 0x1-Object1 -> 0x3-Object1 (because an account can sell multiple different objects) and created a directed graph from this data to create a history of how all objects have transferred between accounts.
Problem:
I want a list of all the accounts where an object has returned back to the account it originally sold from, so for example:
If I had the transactions such that: Account1-Object1 -> Account2-Object1 -> Account1-Object1 then the output should be: Account1-Object1
because the asset eventually cycled back to Account1.
Current Non-Feasible Solution:
this is interpreted as: 0xc sold Object1 to 0x5, 0x5 sold Object2 to 0xc, 0xc sold Object3 to 0x0, 0x5 sold Object5 to 0x0, etc
My code so far:
%concatenate Object to buyer and seller
sales.BuyerAndObject = strcat(string(sales.Buyer),"-",string(sales.Object));
sales.SellerAndObject= strcat(string(sales.Seller),"-",string(sales.Object));
buyer = transpose(sales.BuyerAndObject);
seller = transpose(sales.SellerAndObject);
%Directed Graph: Seller -> Buyer
G = digraph(seller,buyer);
%I have then been trying to use allcycles(G) to find all the cycles,
cycles = allcycles(G)
then return the first node of each cycle, but the data is too massive as I have about 12 Million nodes and 6.7 MIllion edges and the time to completion is beyond reasonable. I am not great at graph theory and at a loss for how to go about what I need to do, and any help is greatly apreciated.

Accepted Answer

Christine Tobler
Christine Tobler on 9 Aug 2021
Edited: Christine Tobler on 9 Aug 2021
Have you tried using 'MaxNumCycles' to only compute the first 100 / 1000 / 1e4 cycles? That could give some indication if it's computing each cycle that's slower than acceptable for your problem, or if there are just very many cycles.
It might also be worthwhile to take a look at some of these cycles, to make sure they are all legitimate results (that is, perhaps you're seeing several cycles that can be recombined in different ways, which could cause there to be many more cycles returned than you'd expect).
  7 Comments
Christine Tobler
Christine Tobler on 10 Aug 2021
Thank you for posting on this! It's always good to see comments on newer functions; I'll see if we can't make allcycles be faster for this case in a future release - we may not have thought of the case when a graph has millions of nodes but very few cycles when checking performance of this function.
Christine Tobler
Christine Tobler on 9 May 2022
Here is the bug report for this issue:
The performance problem in allcycles for very large graphs with few cycles was resolved for R2021b Update 1 and for R2022a from the initial release.

Sign in to comment.

More Answers (0)

Categories

Find more on Graph and Network Algorithms in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!