PyCylon is the Python binding for LibCylon (C++ Cylon). The uniqueness of PyCylon is that it can be used as a library or a framework. As a library, PyCylon seamlessly integrates with PyArrow. This brings us the capability of providing the user the compatibility with Pandas, Numpy and Tensors. As a framework we support distributed relational algebra operations using MPI as the distributed backend.
PyCylon API is a Pandas like Dataframe API which supports fast, scalable, distributed memory, parallel operations.
In a Cylon programme, if you use Cylon with
MPI backend, the distributed envrionment
must be initialized as follows;
Using Pandas and convert to PyCylon Table
Cylon Table can be converted to a PyArrow Table, Pandas Dataframe or a Numpy Array
Local operations of PyCylon are backed by a high performance C++ core and can be simply executed as follows.
Same operations can be executed ina distributed environment by simply passing the CylonEnv to the same function.
This example shows how data can be loaded into Cylon using it's built in APIs and also using other frameworks like Pandas. When loading from Pandas, Numpy or Apache Arrow to Cylon, there is no additional data copying overhead. When running on a distributed environment, data can be either pre-partitioned and load based on the worker ID, or Cylon provide additional flags to partition data if all the workers are configured to read from the same source.
The Concat operation is analogous to the Union operation in databases when applied across axis 0. If applied across axis 1, it will be similar to doing a Join.
Join operation can be used to merge two DataFrames across the index columns. Cylon currently support two join algorithms(Sort Join and Hash Join) and four join types(Left, Right, Inner, Full Outer).
Unlike the Join, Merge can be applied on non index columns. Similar to Join, Merge can be performed using two join algorithms(Sort Join and Hash Join) and four join types(Left, Right, Inner, Full Outer).
Sort operation can be used to re-arrange the rows of a DataFrame based on one or more columns. If two(or more) columns are specified, sort will be first done on the first column and then rows having similar values in the first column will be sorted based on the second column.
Group BY works similar to GROUP BY operator in databases. This should be coupled with an aggregate operation such as min, max, std, etc.
PyCylon is backed by a C++ implementation to accelerate the operations. C++ implementation writes logs to the console for debugging purposes. By default, logging from C++ is disabled in PyCylon. However, logging can be enabled as follows by setting CYLON_LOG_LEVEL environment variable.
Additionally, this can be done programmatically as follows.
Use blow link to navigate to the PyCylon API docs.Python API docs