Joint Attention (JA) is a fundamental social interaction capability developed in early childhood by sharing a common focus point with others. However, this skill is commonly affected in children with Autism Spectrum Disorders (ASD). In this sense, social robots emerged as a tool for developing novel JA intervention strategies supporting therapists. Therefore, this work presents a study to determine the best combination of robot and therapist participation in JA therapies based on following the gaze of the mediator with verbal and pointing instructions. Sixteen subjects with ASD participated in seven robot-assisted sessions divided into a robot-assisted group (RAG) and a control group (CG), performing equivalent intervention sessions. A focus visual system measured quantitative and reliable JA metrics, while the robot's participation was progressively increased in the RAG to manage all the instructions at the last session. Results show that participants of the RAG had better JA scores than the CG during sessions 6 and 7 ($=0.029$ and $p=0.018$, respectively). The RAG paid more visual attention and presented more social engagement behaviours in the sessions ($\mathbf{p}\lt \mathbf{0. 0 5}$). Moreover, the best JA performance was given when the therapist and the robot requested similar instructions during the activities.