With an increased availability of three-dimensional (3D) scanning technology, point clouds are being used as a rich representation of everyday scenes. However, conventional techniques generate point clouds that are unstructured with a variable number of data points, which can be difficult to process using machine learning algorithms. Some conventional techniques may process the point clouds by applying voxelization, which increases the amount of data stored while at the same time reducing details through discretization. Other conventional techniques may process the point clouds using deep learning models with hand-tailored architectures designed by experts to handle the point clouds directly. However, these architectures use an increased number of parameters and are computationally inefficient.