Concatenating simple functions to form deep neural network (DNNs) has provided flexible models with excellent approximation properties, especially in high-dimensions. However, realizing the theoretical approximation properties in a practical task requires training the DNN weights which can be challenging. The training problem typically is posed as a stochastic optimization problem with respect to the DNN weights. With millions of weights, a non-convex and non-smooth objective function, and many hyperparameters to tune, solving the training problem well requires many iterations, many trials, and significant time and computational resources. In this talk joint talk with my collaborator Elizabeth Newman, we exploit the separability of commonly-used DNN architectures to simplify the training process. We call a DNN separable if the weights of the final layer of the DNN are applied linearly, which includes most state-of-the-art networks. We then present algorithms for exploiting this structure in two settings: First, we approximate the stochastic optimization problem via a sample average approximation (SAA). Here, we can eliminate the linear weights through partial optimization, a method affectionately known as Variable Projection (VarPro). Second, we consider a powerful iterative sampling approach in the stochastic approximation (SA) setting, which notably incorporates automatic regularization parameter selection methods. Throughout the talk, we will demonstrate the efficacy of these two approaches to exploit separability using numerical examples.