For example, I "give" the number 1736, and I have 100 .wav files (like 0.wav, 1.wav, etc), how should I concatenate the audios to make them sound more "fluid". Most of the time they have a gap in between the numbers and sound very "hard", I want to listen them as if a real person was saying it, well, as close as possible (exluding the sound quality).
This can be in any language, PHP, Python, etc. I just need the logic/algorithm.
Not sure if it's a vague question, feel free to tell me so I remove it if that's the case.
Thanks.
The issue you're likely having is intonation.
When speaking, the rising and falling tones help indicate phrasing. If I say, "one, seven, three, six", and end with a falling tone (pitch going down), it sounds final and the listener knows they've heard all the digits. If I end with a rising tone (pitch going up), it sounds like I'm asking a question, which is weird to the listener since the numbers aren't a question.
To make this sound more natural, at a minimum, you'll need to record each with different intonation and put them together correctly.
There's another problem though with the phrasing. When speaking, it sounds best when continuously moving air and using articulation to enunciate the words. If you were to record the sound of a radio announcer and play it back while filtering out all of the higher frequencies so that you couldn't hear the articulation, you would hear something close to a continuous tone that would change a bit in pitch. This isn't something you'll get by concatenating audio files together. The best you can do is have a proper speech engine speak.
See also: