Parallel Indexing using RabbitMQ
By parallelizing the indexing process, you can significantly reduce the time required to index blockchain data
Parallelizing Indexing with RabbitMQ
Due to the need to index many entities, the process can take significant time for each block. To achieve real-time indexing, a smaller set of entities can be indexed separately by different workers. After indexing a block, the resulting objects are sent to RabbitMQ. Each type of entity has its own routing key, ensuring the message is delivered to the appropriate queue while maintaining indexing order.
For example, to index logs, blocks and transactions are required. In the first worker, indexed blocks and transactions are sent to RabbitMQ with the routing key block
. The second worker, which has its own queue configured for the routing key block
, receives this message and starts indexing logs using the already indexed blocks and transactions. Meanwhile, the first worker continues indexing the next set of blocks.
Steps for Parallel Indexing
Set Up RabbitMQ Queues and Routing Keys
Define queues for each entity type.
Assign a routing key for each entity type to ensure messages are delivered to the correct queue.
Configure Workers
Assign specific entities to be indexed by each worker.
Ensure each worker listens to the appropriate RabbitMQ queue.
There must be one worker for the RabbiMQ messaging trigger.
Indexing Workflow
Worker 1: Indexes blocks and transactions, then sends the indexed data to RabbitMQ with the routing key
block
.Worker 2: Listens to the queue for
block
routing key, receives the indexed blocks and transactions, and begins indexing logs.Worker 1: Continues to index the next set of blocks and transactions in parallel.
This approach allows for efficient and scalable real-time indexing, leveraging RabbitMQ for task distribution and coordination among workers.
By parallelizing the indexing process, you can significantly reduce the time required to index blockchain data, ensuring timely and accurate updates across your GuruNetwork.ai products.
How GURU run it
The attached diagram illustrates the data flow and the critical role of RabbitMQ in enabling parallel indexing. This workflow has proven to be the most efficient for our operations at dex.guru. Here's a detailed breakdown:
stream.py:
Responsible for streaming blocks, transactions, and logs.
Outputs the data to RabbitMQ with the exchange
ethereumetl_{chain_id}
.
RabbitMQ:
Manages the routing of messages based on routing keys.
Workers:
Token Transfer Worker: Listens to the
block
routing key, processes token transfers, and outputs data to ClickHouse and RabbitMQ.DEX Pool Worker: Listens to the
block
routing key, processes DEX pools, and outputs data to ClickHouse.Geth Trace Worker: Listens to the
block
routing key, processes geth traces, and outputs data to ClickHouse and RabbitMQ.
Subsequent Processing:
Balance and Token Workers: Listen to token transfer messages and process balances and tokens, outputting data to ClickHouse.
Enriched Data Workers: Process enriched DEX trades and transfers.
Native Balance and Contract Workers: Listen to internal transfers and process native balances and contracts, outputting data to ClickHouse.
Dead Letter Queue (DLQ):
Handles unhandled errors by routing them to the DLQ for further inspection and retry.
This system ensures that each type of entity is processed by a dedicated worker, allowing for efficient and scalable real-time indexing.
Improving the Messaging Mechanism
Currently, the messaging system operates primarily as a trigger, and the necessary data for the worker is fetched from ClickHouse. While this setup is functional, there is significant room for improvement. By transitioning to a system where data is directly retrieved from the messages themselves, we can enhance efficiency and reduce latency.
We invite the community to contribute to this enhancement. Your participation in developing and refining this mechanism will be greatly appreciated and beneficial to all users of GuruNetwork.ai products.
Last updated