A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch