Music Boxes and Chatterbots
The player piano made significant advances over its predecessor, the music box.
The music box could only play one song, and for a voice, it only had little metal strips that would be plucked to make a tune. There were advanced music boxes that came out later, that could play more than one song.
People didn't stop making music boxes.
A player piano was an output device for playing back human behavior. That's right, those little music scrolls contained holes that transcribed the behavior of the pianist's keyboard articulations into something that you could hold in your hand.
And you could recreate the pianist's behavior later, on another device, over and over, as many times as you liked. The pianist wasnt needed again, once the recording was made.
The roll of paper played more than just strips of metal, it had an entire piano at its disposal! It was a form of software in terms of it being a way to program a player piano. And these piano rolls could be bought and sold, just like any other commodity.
And then Edison introduced the gramophone. Not only did it record some representation of the panists' behavior, but it reproduced the original sound that was heard at the initial recording. That sound could take the form of pianos, flutes, trombones, ....., or simply the human voice, anything that could make a sound.
People stopped making player pianos.
Computers have historically been limited by the amount of audio and video they could store. Good sounding audio might take 10 million bytes per minute depending on several things, but in terms of the old school, audio and video recording makes huge files.
Yet hardware is getting cheaper and cheaper while at the same time getting better and better, having more and more capabilities.
If the typing behavior of people may be recorded, why not the audiovisual behavior as well. If typing behavior can be simulated, why not audiovisual behavior?
The threshold for this comes when we can make comparisons between two audio segments, and determine if they are "equivalent." Perhaps that is already here.
In some ways it makes sense to keep recording text, and keep working with that, while using converters to go from voice-to-text and from text-to-voice. But then again, if a recording of behavior is to be played back, wouldn't it be nice to keep the original inflections, tonality, and "voice" of the original speaker?
People are still making chatterbots.