Skip to main content

PyCylon

PyCylon is the Python binding for LibCylon (C++ Cylon). The uniqueness of PyCylon is that it can be used as a library or a framework. As a library, PyCylon seamlessly integrates with PyArrow. This brings us the capability of providing the user the compatibility with Pandas, Numpy and Tensors. As a framework we support distributed relational algebra operations using MPI as the distributed backend.

Dataframe#

PyCylon API is a Pandas like Dataframe API which supports fast, scalable, distributed memory, parallel operations.

Initialize#

In a Cylon programme, if you use Cylon with MPI backend, the distributed envrionment must be initialized as follows;

from pycylon import DataFrame, CylonEnv
from pycylon.net import MPIConfig
env = CylonEnv(config=MPIConfig())
Note: In the current release, Cylon only supports MPI as a distributed backend

Load a Table#

Using Cylon

from pycylon import DataFrame, read_csv
df = read_csv('path/to/csv')

Using Pandas and convert to PyCylon Table

from pycylon import DataFrame, read_csv
import pandas as pd
df = DataFrame(pd.read_csv("http://path/to/csv"))

Cylon Table can be converted to a PyArrow Table, Pandas Dataframe or a Numpy Array

pyarrow_tb = cylon_tb.to_arrow()
pandas_df = cylon_tb.to_pandas()
numpy_arr = cylon_tb.to_numpy()

PyCylon Operations#

Local Operations

Local operations of PyCylon are backed by a high performance C++ core and can be simply executed as follows.

from pycylon import DataFrame
df1 = DataFrame([random.sample(range(10, 100), 50),
random.sample(range(10, 100), 50)])
df2 = DataFrame([random.sample(range(10, 100), 50),
random.sample(range(10, 100), 50)])
df2.set_index([0], inplace=True)
df3 = df1.join(other=df2, on=[0])
print(df3)

Distributed Operations

Same operations can be executed ina distributed environment by simply passing the CylonEnv to the same function.

from pycylon import DataFrame, CylonEnv
from pycylon.net import MPIConfig
env = CylonEnv(config=MPIConfig())
df1 = DataFrame([random.sample(range(10*env.rank, 15*(env.rank+1)), 5),
random.sample(range(10*env.rank, 15*(env.rank+1)), 5)])
df2 = DataFrame([random.sample(range(10*env.rank, 15*(env.rank+1)), 5),
random.sample(range(10*env.rank, 15*(env.rank+1)), 5)])
df2.set_index([0], inplace=True)
df3 = df1.join(other=df2, on=[0], env=env)
print(df3)

PyCylon Examples#

  1. Data Loading

This example shows how data can be loaded into Cylon using it's built in APIs and also using other frameworks like Pandas. When loading from Pandas, Numpy or Apache Arrow to Cylon, there is no additional data copying overhead. When running on a distributed environment, data can be either pre-partitioned and load based on the worker ID, or Cylon provide additional flags to partition data if all the workers are configured to read from the same source.

  1. Concat

The Concat operation is analogous to the Union operation in databases when applied across axis 0. If applied across axis 1, it will be similar to doing a Join.

  1. Join

Join operation can be used to merge two DataFrames across the index columns. Cylon currently support two join algorithms(Sort Join and Hash Join) and four join types(Left, Right, Inner, Full Outer).

  1. Merge

Unlike the Join, Merge can be applied on non index columns. Similar to Join, Merge can be performed using two join algorithms(Sort Join and Hash Join) and four join types(Left, Right, Inner, Full Outer).

  1. Sort

Sort operation can be used to re-arrange the rows of a DataFrame based on one or more columns. If two(or more) columns are specified, sort will be first done on the first column and then rows having similar values in the first column will be sorted based on the second column.

  1. Group By

Group BY works similar to GROUP BY operator in databases. This should be coupled with an aggregate operation such as min, max, std, etc.

Logging#

PyCylon is backed by a C++ implementation to accelerate the operations. C++ implementation writes logs to the console for debugging purposes. By default, logging from C++ is disabled in PyCylon. However, logging can be enabled as follows by setting CYLON_LOG_LEVEL environment variable.

export CYLON_LOG_LEVEL=<log_level_flag>
python python/examples/dataframe/join.py
Log LevelFlag
INFO0
WARN1
ERROR2
FATAL3

Additionally, this can be done programmatically as follows.

from pycylon.util.logging import log_level, disable_logging
log_level(0) # set an arbitrary log level
disable_logging() # disable logging completely

Python API docs#

Use blow link to navigate to the PyCylon API docs.

Python API docs