If I understand correctly, you could present the image as the first stimulus and for a time value of 0ms. Then present the 32 sound stimuli on the same trial, each with the time value:
rt:50,51,52,53,60>-2,61>-2,62>-2,63>-2,64>-2,65>-2,66>-2,67>-2,68>-2,87>-2,88>-2
And where the min-max value for the style of the trial is 0-1000! so that if no response occurs to a tone within 1000ms, the next tone occurs and so on. The negative skip values indicate that DirectRT should skip out of the trial immediately because a response to the visual stimulus has been given.
Does that help?