Oral-History:James L. Flanagan

From ETHW

About James L. Flanagan

James Flanagan

Dr. James Flanagan was born in 1925 in Greenwood, Mississippi. He attended Mississippi State University, where he received his EE bachelor's degree in 1948. Flanagan then started graduate school at MIT, where he worked in the Acoustics Laboratory. He received his masters degree from MIT in 1950 and taught for two years at Mississippi State before returning to MIT to begin work on his Ph.D. Flanagan received his doctorate in 1955, having done much of his research work at the Acoustics Lab. He studied the bandwidth compression of speech communication, and built a spectral analyzer for voice signals tracking which helped shape the signals processing field. In 1956 Dr. Flanagan joined Bell Laboratories, where he worked on defense applications of acoustics synthesis. He helped shape the digital signal processing field by working on converting analog communications data to digital through the use of computer technology and filters. Flanagan helped develop the phase vocoder and researched possible future uses for speech imitation and recognition technology, as well as tactile interfaces for networked computer users. He was head of Bell Labs' Acoustics Research Department. He wrote over ninety technical publications and is the author of thirty patents and two books. His professional activities included service as president of the Acoustical Society of America, membership in the National Academy of Engineering, and presidency of the IEEE Acoustics, Speech and Signal Processing Society. Dr. Flanagan is an IEEE Fellow [Fellow award for "contributions to reduced bandwidth speech communications systems and to the fundamental understanding of human hearing"] and has received numerous awards from the IEEE and other organizations, including the IEEE Centennial Medal in 1984, the IEEE Edison Medal, 1986, the Gold Medal of the Acoustical Society of America, and the National Medal of Science, 1996, among others. Dr. Flanagan died 25 August 2015, a day short of his 90th birthday.

The interview discusses Flanagan's education and career, focusing on his experiences with MIT and Bell Labs. He explains his work in digital signal processing, and how his career evolved from his early experiences with radar during World War Two and his childhood love of science and mathematics. Flanagan discusses his work in the MIT Acoustics Lab in speech communications, and explains his decision to get his doctorate and then to work at Bell Labs. He recalls his work with image and sound processing research, including his work with Fourier transform programs for spectral analysis, sending voice signals over cable, speech compression and synthesis, vocoder research, converting analog data to digital, and future uses for speech signal processing technology. Flanagan explains his recent work with Japanese acoustics pioneers, and discusses cutting-edge computer interface software technology and speaker-sensitive electronic equipment. The interview concludes with his predictions for the future of speech recognition techniques.

Other interviews covering digital signal processing and speech research include Maurice Bellanger Oral History, Leo Beranek Oral History (1996), Leo Beranek Oral History (2005), C. Sidney Burrus Oral History, James W. Cooley Oral History, Ben Gold Oral History, Robert M. Gray Oral History, Alfred Fettweis Oral History, Fumitada Itakura Oral History, James Kaiser Oral History, William Lang Oral History, Wolfgang Mecklenbräuker Oral History, Russel Mersereau Oral History, Alan Oppenheim Oral History, Lawrence Rabiner Oral History, Charles Rader Oral History, Ron Schafer Oral History, Hans Wilhelm Schuessler Oral History, and Tom Parks Oral History.

About the Interview

JAMES L. FLANAGAN: An Interview Conducted by Frederik L. Nebeker, IEEE History Center, 8 April 1997

Interview #332 for the IEEE History Center, The Institute of Electrical and Electronics Engineers, Inc.

Copyright Statement

This manuscript is being made available for research purposes only. All literary rights in the manuscript, including the right to publish, are reserved to the IEEE History Center. No part of the manuscript may be quoted for publication without the written permission of the Director of IEEE History Center.

Request for permission to quote for publication should be addressed to the IEEE History Center Oral History Program, IEEE History Center, 445 Hoes Lane, Piscataway, NJ 08854 USA or ieee-history@ieee.org. It should include identification of the specific passages to be quoted, anticipated use of the passages, and identification of the user.

It is recommended that this oral history be cited as follows:

James L. Flanagan, Electrical Engineer, an oral history conducted in 1997 by Frederik L. Nebeker, IEEE History Center, Piscataway, NJ, USA.

Interview

Interview: James L. Flanagan

Interviewer: Frederik Nebeker

Date: 8 April 1997

Place: Rutgers University, New Jersey

Childhood, family, and education

Nebeker:

Could I ask you first to tell us where and when you were born and a little bit about your family?

Flanagan:

I was born August 26, 1925 on a cotton farm out from Greenwood, Mississippi, a family farm. I grew up there, went to high school in Greenwood, riding school buses.

Nebeker:

Is that upstate Mississippi?

Flanagan:

It's the very center of the state, in the Delta region, which is primarily cotton farming, also soybeans and grain. Cotton is the main crop. I graduated from Greenwood High School. This was 1943, and I went immediately into the summer session at Mississippi State University. I entered the Engineering School, not really certain at that time what I wanted to specialize in.

Nebeker:

Had you always been interested in science?

Flanagan:

Yes. I had been interested in science from secondary school, where I had a very good science teacher in, I guess it was about the eighth or ninth grade. Following that I was lucky enough to have an exceptional teacher in chemistry and physics, plus a lady who taught me algebra, geometry, trigonometry. She was also a rather exceptional person. So that interest was fueled at an early time.

Military service; electrial engineering studies

Flanagan:

This was during World War II. I joined the Air Force when I was seventeen years old. They called me in to serve as soon as I turned eighteen, and I left the university—it was called Mississippi State College then, it's now a university—at the mid-term. I spent a little more than two and a half years in the Army. I was eventually assigned to the Air Force and trained on airborne communications, and later some of the first radar equipment that the Air Force had for ground control approach. It's a blind landing system. It had two systems: a 3-centimeter radar for glide slope and azimuth, and a 10-centimeter system—a search system—for traffic control and vectoring to a final approach.

Nebeker:

Where were you when you were working these radars?

Flanagan:

Various places. Let's see. The radio and radar in one case was upstate Wisconsin, Truax Field out near Madison. Then I was in Illinois at Urbana-Champaign, a place called Chanute Field. I don't know if it still exists. I was a longer time at Boca Raton, Florida, at an Air Force base there. I don't believe that base exists anymore. I got interested in communications a little more from that experience, and when I got out of the Army I went back to the university and decided to study electrical engineering, which I did.

Nebeker:

Again at Mississippi State?

Flanagan:

Yes, Mississippi State. I finished there. I went during all the summers to try to catch up, to make up for the two and a half years I'd been out, and I finished there in '48. The head of the Electrical Engineering Department, a very exceptional person, urged me to try for graduate schools and in fact helped me get information on them.

Graduate studies

Flanagan:

I applied to MIT for a graduate assistantship in one of the research laboratories so I could earn tuition, and I was fortunate enough to get in. I was appointed in the Acoustics Laboratory at MIT, which at the time was headed by Richard Bolt, who was the Director, and Leo Beranek, who was the Technical Director. Both of these are fairly famous individuals. Richard Bolt is a physicist; Leo Beranek is an electrical engineer. They jointly headed the MIT Acoustics Lab.

Nebeker:

Was it in '48 that you started at MIT?

Flanagan:

Yes, I worked in the laboratory full time as a graduate assistant, and I took school about half time. So I got my master's degree in 1950. I was considering what to do at that point. Lacking money, I decided I would first earn some money and then pick up graduate school again. I was given a teaching job at Mississippi State in 1950 and taught there two years as Instructor and Assistant Professor. During that time I was lucky enough to get a Rockefeller Foundation Fellowship for Doctoral Study, whereupon I returned to MIT and finished my degree in 1955.

Nebeker:

You went back to the Acoustics Lab?

Flanagan:

Yes. My thesis supervisors were Leo Beranek and Kenneth Stevens. Leo Beranek was one of the founders of Bolt, Beranek and Newman; and Stevens is still a professor at MIT. Others on my thesis committee were Robert Fano, the information theorist, and, let me see, I believe Sam Mason, an electronic circuits professor. There may have been one or two others that I don't presently recall.

Nebeker:

What was your thesis work?

Flanagan:

It was in speech communication, because I had been assigned in the Acoustics Laboratory to work on an Air Force contract with MIT that related to bandwidth compression of speech communications. This is for narrow-band transmission over radio channels. There was one more thing that fueled my interest in communications, particularly voice communication. That was a contract that had been made with the Air Force Cambridge Research Center, which was located in Cambridge. Now I think it is located at Hanscom Field, near Lexington and Lincoln.

Nebeker:

Was this a vocoder-type compression?

Flanagan:


Audio File
MP3 Audio
(332 - flanagan - clip 1.mp3)


Yes, sort of. By virtue of work that had gone on with the sound spectrograph at Bell Labs and the pattern playback machine at Haskins Laboratories by Frank Cooper, the ways in which information is conveyed in the changing resonances of the vocal tract (the so-called formants) was beginning to be understood. There was speculation about whether it would be possible to make electronic equipment that would, in real time, take continuous speech and extract the formant functions for narrow-band transmission and, ultimately, synthesis at the receiving end.

My thesis was on an automatic formant tracker which involved building a realtime spectrum analyzer. This was essentially a custom designed filter bank to do the spectrum analysis, an electromechanical scanner that would convert the spectral envelope into signals that could be analyzed, and then an electronic logic set that would attempt to identify the resonant peaks in that spectrum. Part of the thesis was then to evaluate how well that formant tracker did and how well one could synthesize speech from that information.

Nebeker:

Was this actually built?

Flanagan:

Yes, over about a three-year period. It ended up being all vacuum tubes. This was before integrated circuits obviously, even before transistors. It must have been about four, or maybe five, six-foot relay racks of electronic equipment, with lots of heat generated.

Nebeker:

It would take some miniaturization before this could be put into an airplane.

Flanagan:

Oh yes. But we weren't worried about that at the time; we were worried more about finding out if it was possible under any circumstances.

Nebeker:

Did it work?

Flanagan:

Yes. I transmitted speech by formant vocoding this way. Ancillary to it of course was pitch tracking, which had to be done, and voiced/unvoiced detection, similar to what is done in the channel vocoder, which had been researched earlier at Bell Laboratories by Homer Dudley. So that was a reasonably successful venture into automatic formant tracking.

Nebeker:

Did this work of yours attract much attention?

Flanagan:

Yes, I think it did. I defended my thesis on it, and successfully. I wrote, I think, two papers that were published, one on the design of the system and one on the evaluation of the system. I believe they were published jointly in the same issue of the Journal of the Acoustical Society of America. The thesis was also published as a technical report from the MIT Acoustics Laboratory. So this generated some interest in places.

Nebeker:

In 1955 you got your degree?

Flanagan:

Yes, and then I stayed about a year and a half longer at MIT to finish up some reporting and evaluation tests of the work. I published some more of the thesis work on difference limens—just-perceptible differences in formant frequency—for which I still get requests.

That's a sort of interesting story. This was a third paper, fairly short. I sent it in to the Acoustical Society journal, and the reviewer didn't like it very much, saying there wasn't enough statistical analysis of the perceptual data. I had graphed just what I had measured from human listeners, that is, raw data without a lot of statistics. Since the reviewer said it was not acceptable for publication, I spoke to the editor. I said I found the data useful because for the first time it gave a fidelity criterion for transmitting low bit-rate speech information, mainly formant data. It gave for the first time an indication of what accuracy you needed to transmit it, particularly if you aspired to represent it in digital form. So he said, "Let's publish it as a letter to the editor," which he did. I have had more requests for that publication than any other thing I've published. I occasionally still get requests; I am amazed.

Flanagan:

During that time, recruiters came to MIT. I was thinking about what to do over the coming time, so I interviewed with a number of these folks, some from the west coast, some from New England. I had long studied the good work at Bell Laboratories, so when they offered me a job I was very pleased to accept it.

Nebeker:

That was about 1956?

Flanagan:

That was '57.

Acoustics Lab at MIT

Nebeker:

Before we go on with that, could I ask you to describe the Acoustics Lab at MIT in that period?

Flanagan:


Audio File
MP3 Audio
(332 - flanagan - clip 2.mp3)


The remarkable thing that was very ahead of its time was that it was a multidisciplinary laboratory. I do not think it exists any longer. Professor Bolt was a physicist, Professor Beranek was an electrical engineer, and Professor Newman, who was one of the senior leaders in the laboratory, was an architect. They were the trio that formed the Bolt, Beranek and Newman Company. I was one of their earliest employees, back when the company occupied one room at Harvard Square and had a part-time secretary. Now it's in a high-rise building in Cambridge and world famous.

The laboratory you ask about was multi-disciplinary with people from physics, mathematics, engineering, psychology and architecture. All these folks worked together in a very effective way. The primary thrust of the laboratory was physical acoustics and communications acoustics. Of course, acoustics is an interdisciplinary area, and it was rather enlightening and stimulating to have this mix of people under one roof. We occupied one wing of what was called Building 20.

I believe the building still exists, it was put up during World War II for the Radiation Laboratory (radar research) and I haven't heard about it being torn down. With this mix of talent we would get diverse seminars, with people coming from, for example, ultrasonics. Ted Heuter was a famous ultrasound person who joined the laboratory, and we started learning a little bit about biological uses of ultrasound. There were people that were interested in underwater sound propagation; the laboratory did some transducer work and signal processing work for the Navy.

Nebeker:

Was the work of this lab connected to certain application areas?

Flanagan:

Yes, I think you could say that, although I don't recollect any overt industrial attachment. The contract work was mostly for the Department of Defense. The Air Force interest was in narrow-bandwidth voice communications and how to achieve secure transmission of voice. Some of the Navy interest was in how to make high-intensity sound underwater.

Nebeker:

Some of the research was suggested or stimulated by contractors?

Flanagan:

Yes. There was a strong interest in architectural acoustics and how auditoria could be designed better. This was what Bolt, Beranek and Newman initially built their consulting business on. Much later, that business branched into information sciences, networking, and computing. We did things like mufflers for quieting ventilation ducts. One that I worked on was called the Soundstream Muffler, it had sinusoidal absorbers in the stream flow which achieved a fair amount of noise attenuation. So noise control was an interest.

We had some collaborations and some students who were in the so-called 6A Co-Op Program that worked with General Radio. General Radio was an instrument manufacturer in Cambridge that made things like electronic noise generators for acoustic measurements. This was primarily an amplified gas diode inside a magnetic field. They made wave analyzers, spectrum analyzers, and sound level meters. So we had an instrumentation activity where some of the 6A students did their co-op at General Radio, which was only a couple of blocks away.

Nebeker:

How large was the Acoustics Lab, with faculty, staff, and students?

Flanagan:

I don't know. I could dig up some old technical reports and look it up, but a ballpark figure would be fifty or sixty people. Half of those probably were graduate students in various phases of training. The other half were either postdocs or faculty members from the participating departments. That's probably about the size that it was. We had a large anechoic chamber, which was unusual at the time. I believe a Navy contract supported it. We had an interesting array of small loudspeakers, they must have been about 4- or 5-inch loudspeakers in a 16 x 16 array in a port in one side of this large anechoic room. So with the proper electronics you could simulate plane waves with any angle of incidence. You could also simulate cylindrical waves. And this worked pretty well as long as the loudspeakers didn't lose their calibration. Sometimes they would age and change a little bit.

Bell Labs

Employment

Nebeker:

Where were you initially at Bell Labs?

Flanagan:

I was recruited into the research department. Well, I actually had offers from two departments. One was concerned with underwater sound signal processing at the Whippany location, and one was the research division at Murray Hill. I elected to go into the research division, and that's where I spent my entire career at Bell Labs.

Nebeker:

Who was your supervisor?

Flanagan:

Ed David initially. In fact he came and interviewed me at Cambridge. He also was an MIT graduate, and he was one of the recruiters. He took me to a very fine lunch at Harvard Square to convince me what a good place Bell Labs was. I didn't tell him that I was already convinced and was dying to come to Bell Labs. I was very happy when I had the offer. I worked in his lab, I think it was called Speech and Image Processing initially—the name changed. But its work was primarily analyzing information-content in voice and image signals and how to transmit this content effectively.

Nebeker:

Was the image processing related to work on a Picturephone?

Flanagan:

This was prior to the Picturephone. It was related to telephonic transmission of video with good quality. The Picturephone was an experiment a little later to try to see if putting image on telephones was feasible.

Digital computation; spectral analysis of speech

Flanagan:

Of course, when I joined in '57 digital computation hadn't developed very much. Just as I was finishing up at MIT one of the first digital computers was being implemented, a computer called Whirlwind, which I believe evolved into the DEC PDP line. First it evolved into the TX-0, then into a spin-off that was essentially the PDP-1 of Digital Equipment Corporation. I got to Bell Labs in '57, and very shortly after that they got their first digital computer, an IBM 650.

Nebeker:

Was the first one at Bell Labs in the research division?

Flanagan:

I believe it was available to all research and development activity at Murray Hill. It was a vacuum-tube machine and was designed as a bi-quinary architecture, literally an electronic abacus. In fact, the display registers on the front looked like an abacus, 1-2, 1-2-3-4, and the lights and the logic looked like an abacus when you cycled it. It had a magnetic drum store, but no compiler, no assembler. I had to learn binary code for the machine. My first program was a Fourier transform program written in binary. Very tedious.

Nebeker:

Was it you that requested that Bell Labs acquire this computer?

Flanagan:

No. Bell Labs got it as a general purpose machine. The researchers could not operate it hands-on; you had to submit jobs from a punch-card deck.

Once I had my Fourier transform program, I took recordings of speech signals off an oscillogram, from a pen recorder that had sufficient bandwidth, or else photographed them from an oscilloscope and then measured the amplitude values with a pair of dividers at a rate that would satisfy the Nyquist sampling theorem.

Nebeker:

That was something you did by hand?

Flanagan:

Yes.

Nebeker:

And then you'd punch them in on cards?

Flanagan:

Then I punched the amplitude value on the cards. There were no A to D converters at that time, so I calculated spectra for speech signals in this way.

Nebeker:

Had such hand calculations of the spectral analysis of speech been done before?

Flanagan:

There were analog spectrum analyzers like the sound spectrograph. I don't know whether there were any digital versions.

Nebeker:

So this may have been the first digital spectral analysis?

Flanagan:

Well, it was probably among the first. But it's a slow way to do it. It's better to have a fast analog-to-digital converter.

Spectral shape of vocal cord sound; bandwidth compression

Nebeker:

What came out of this?

Flanagan:

One of the things that I was looking at, besides the formant structure of speech signals, was the spectral shape of the vocal cord sound. I generated waveforms for this first as area functions of the oscillatory vocal-cord opening obtained from high-speed photography and a dental mirror. This was from film that had been taken by D.W. Farnsworth. I think it was four thousand frames per second with a Fastax camera. I generated area functions and also tried to measure the volume velocity with a hot-wire anemometer. It was a little tricky putting that thing down your throat, but we did get some useful data. From the area function I was able to estimate the acoustic volume flow through the cords. So from that I could get some estimate of what the source spectrum for the vocalcord source was. And that's important for speech synthesis.

Nebeker:

That was your goal at that time, to do synthesis better?

Flanagan:

To understand the parameters and how they influenced synthesis in an analysis-synthesis system; especially in a low bit-rate transmission system where you had to analyze in terms of certain parameters and then synthesize.

Nebeker:

Was this kind of bandwidth compression a major activity in that Bell Labs group?

Flanagan:

Bandwidth compression of various types was. Others were looking at pitch-tracking techniques. As far as I know I was the only one looking at a vocal cord source.

Nebeker:

The big techniques that you were trying to improve were in speech compression?

Flanagan:

That was what was driving our speech effort up to about 1970, until computers developed enough to get signals in and out easily and until there were techniques for analyzing and representing information through digital analysis. Prior to around 1970 the driving objective was bandwidth compression, that is, making high-quality transmission as efficient as you could make it. Another element was getting a secure transmission over an ordinary voice channel—how to scramble it without multiplying the bandwidth.

Voice over telegraph cable; vocoders and voders

Nebeker:

One entry in your time-line is the goal of voice over telegraph cable.

Flanagan:

That's what started the whole vocoder business. The first telegraph cables that were put across the Atlantic, I don't remember when those were.

Nebeker:

Around 1860.

Flanagan:

The bandwidth of the telegraph cable was very, very small, perhaps one hundred to two hundred hertz. You cannot get conventional voice waveform signals over that. So the question was whether there was some way to compress the voice information and transmit it over a 200 hertz bandwidth, and that's what drove the original channel vocoder that Homer Dudley did at Bell Labs.

Nebeker:

Wasn't it also the prospect that you could put a large number of voice channels over a regular telephone channel?

Flanagan:

Yes, multiply the number of voices. Looking at the time-line, I see commercial telegraph dated at 1844, transatlantic telegraph 1858, and telephone shortly thereafter. The transcontinental telephone, 1915. But Dudley and Riesz and others were looking at analog methods of compressing, and they did the vocoder and voder. I had some old pictures the voder in operation.

Nebeker:

I've seen pictures from the '39 World's Fair.

Flanagan:

I got '39 World's Fair pictures out of some archive at Bell Labs. When I was appointed the department head, after a few years at Bell Labs, Homer Dudley was in my department, and I was very honored to have him in my research department. We helped get him an IEEE Society award several years following his retirement. I'm trying to read the date on the badge he has on here; it looks like about 1966. This is an IEEE meeting sponsored by the section originally devoted to audio and acoustics, I believe.

[Pointing to a photograph] Frank Cooper is right there. I'm sitting here. This is Ben Gold, John Pierce, Homer Dudley, Reg Kaenel. My recollection is this was an IEEE meeting in Boston or Cambridge when he got that award.

When Homer retired from Bell Labs, I found one of the original ladies trained to operate the voder for the World's Fair. She had retired earlier. We found one of the voders in the basement of Bell Labs and cleaned it up. We got some very antique vacuum tubes at one of the distributors on Route 22, from their attic or someplace. We got it working, and she agreed to come to Homer Dudley's retirement. We asked, "Do you think you can make this thing talk?" She said, "Oh sure. What do you want it to say?" "Say, 'hello, how are you?'" She sat down and gave a virtuoso performance on the voder.

Nebeker:

And it worked?

Flanagan:

Just incredible. I never thought I'd see it.

Digitally encrypted voice

Nebeker:

You have another note here that I'm curious about: "digitally encrypted."

Flanagan:

Besides the goal of sending voice over telegraph cable, World War II gave a big impetus to get encrypted voice over ordinary voice channels. I guess transatlantic voice cable did not come until much later, but there was the radio telephone. The vocoder with scrambled signals was put on radio telephones. Churchill and Roosevelt spoke over that.

Nebeker:

That was because when you have a signal digitally encoded, it is easier to scramble and unscramble it.

Flanagan:

It’s more secure. But I think the first encryption was not really digital. It was something like an analog one-time pad, where you synchronized scrambling signals with a disk.

Nebeker:

Disks that they use at each end?

Flanagan:

Yes. I'm always astonished that voice-over-undersea cable didn't come until 1956, the year before I went to Bell Labs. That was the first transatlantic cable.

Nebeker:

It was the difficulty of repeaters under the sea?

Flanagan:

Yes.

Expanding role of computers in digital signal processing after 1970

Nebeker:

This time-line is very helpful. If we look at this period, you've drawn a dividing line at 1970. What is to the left of that?

Flanagan:


Audio File
MP3 Audio
(332 - flanagan - clip 3.mp3)


To the left of that line the driver was a low bit-rate representation, transmitting high-quality signals as efficiently as you could make them, as well as eventually getting digital encryption over voice circuits. By 1970 computers were emerging rather rapidly. I should point out that to the left of 1970, if I have the time right, we went from the IBM 650, which was a drum machine, to the very fast IBM 704, which was a vacuum tube and magnetic-core machine, and then shortly thereafter to an IBM 7090, which was a discrete transistor machine. But in the late '60s and early '70s, we started trying to acquire small computers to use in the laboratory as dedicated signal processors. To get A-to-D and D-to-A conversion on those we used things like the PDP-8, the DDP-24 (that company doesn't exist anymore), the DDP-224, and the DDP-516, which was the first integrated circuit machine that I got in my lab. It looked like a washing machine. It was made by 3C Corporation, which was later bought by Honeywell. I don't remember who owned it right then, maybe Honeywell owned it, but it was the first integrated-circuit machine that I had ever gotten my hands on. It had a marvelous 8-K word store, half of which was used by the FORTRAN compiler, so you had 4-K of words of memory to compute with.

We were in heaven. It was a one microsecond machine. Now we have personal computers and signal-processor chips that are 200 megahertz or more.

Nebeker:

In looking at your 1965 book I was very impressed by how much you yourself used computers in those early years, both to do calculations and to simulate phenomena. You must have been very early in those applications.

Flanagan:

We found that you could get useful solutions for just about any signal process that you could turn into difference equations. That's the way we did most of the speech synthesis, with vocal tract simulation by differential equations, then solving them simultaneously. Not in real time, obviously. This early speech processing work required sampled data, and the understanding of sampled data signals. But it also needed all kinds of filtering and spectral analysis.

Influence of speech processing on digital signal processing

Flanagan:

I think the whole area of digital signal processing, particularly digital filter design, was driven by the speech processing community. I made a mark here.

Roger Golden and I did something called the phase vocoder in 1966. This required simulation of electrical filters. We had some infinite impulse response filters that approximated Bessel characteristics. We hadn't thought about finite impulse response filters very much then—they were developed a bit later—but we used them to good effect. The whole business of having to do filtration of signals, spectral analysis, and algorithmic operations on sampled data, of recognizing what happens when you square a signal or take a square root, or watching what happens to the bandwidth, this all drove the development of digital signal processing at that time. There might have been a parallel in image processing that I do not know about, but speech processing was a research activity that galvanized digital signal techniques.

Nebeker:

So as far as you know, these digital techniques were being developed there in that lab?

Flanagan:

I'm sure there were others that were doing similar things. I think of the people at Lincoln Laboratory, such as Ben Gold.

Nebeker:

Also in speech processing?

Flanagan:

Yes, he was doing vocoders.

Nebeker:

What about work elsewhere on digital filters?

Flanagan:

If it was going on at that time, I didn't know about it. I certainly would have used it if I had found it. But I'm sure there were others studying the problem. It was not easy to pull the filter design off the shelf.

Nebeker:

It must have been a fairly common thing in those years for people to work on ADCs in order to make sure of the computational capacities of the computer.

Flanagan:

Yes, although they were horribly expensive initially, not very fast, and not giving very many bits.

Nebeker:

Were you able to buy an ADC off the shelf, in those years?

Flanagan:

No. I believe we designed and built it ourselves. It was fairly expensive. We put it on a digital tape transport. We put the speech signal through the A-to-D, wrote a tape, pulled the tape off, took it to the computer center, and then handed them the tape and let them process it and write the result. Then we could bring it back. The D-to-A was easier.

Nebeker:

There must have been a lot of areas where they wanted to digitize data, but treat it as they had before. You wanted to be able to do with digital techniques things that were done before with analog circuits, like filters, and of course you needed the D-to-A conversion in certain applications.

Flanagan:

Right. Initially I guess we tried to duplicate the analog domain digitally, which was making work for ourselves. We were trying to duplicate elliptic filters, Bessel filters, all these infinite impulse response things. Some of this can be done easier with finite impulse response filters that have only zeros where the IIR filters are trying to put a pole constellation where you want it.

Nebeker:

In other notes you've made for this early period, you indicate the fast Fourier transform. What was the initial application of this in your experience?

Flanagan:

For spectral analysis of speech, to get formant information, spectral envelope, and pitch information. This tool was again, I think, driven by speech processing, and possibly by seismic analysis. There was an early workshop at Columbia's Arden House in upstate New York. It was organized by Bill Lang, who was at IBM at the time, heading their Acoustics Laboratory. He was very active in IEEE in the acoustics and speech section. He was instrumental in getting a lot of new application work in FFT.

IEEE Signal Processing Society; growth of DSP field

Nebeker:

Could I ask you about a characterization I read in Rabiner's article giving an early history of the Society [Editor’s Note: IEEE Signal Processing Society].

Of course in the early days the Society was the Professional Group on Audio of the IRE. He wrote that in the mid- or late '60s, the Professional Group attracted the speech processing community to that society, that it was largely outside of PGA earlier.

Flanagan:

I think that is right. Most of the players here belonged to a couple of societies, one of which was the Acoustical Society. The thrust of the Acoustical Society was basic acoustics work, a little different from what I like to call communications acoustics, which is why I labeled my chapter "Communications Acoustics." The communications acoustics side got a lot more into signal processing: how to represent information, transmit it, and reconstruct it. The Acoustical Society—I still go to most of their meetings as I go to those of the IEEE—was and is heavy on the physics of sound propagation: the behavior of sound in enclosures, transducers, how to make microphones, loudspeakers, how to calibrate them, sound absorbing materials for architectural acoustics, atmospheric propagation, and noise control. Also in basic psychoacoustics and physiological acoustics. Those things were somewhat outside the signal processing community, at least initially, so the people doing computer simulation and trying to develop signal processing techniques gravitated towards the Professional Group on Audio, which was later called Speech and Acoustics. An early stimulus was this banding together in the workshop on the fast Fourier transform. It brought all these people together to talk about how to represent filters and how to do spectral analysis.

Nebeker:

Is it right to say that in the '60s there wasn't a field thought of as signal processing?

Flanagan:

Yes, I think that’s true, in the sense of a recognized, established discipline.

Nebeker:

Here's what's curious to an outsider. If one looks at this huge 50th anniversary volume the IRE did in '62, the professional group on audio commissioned six or seven chapters, and there were all these things like loudspeaker design, magnetic recording, all audio things, and nothing in there suggests speech processing or signal processing as it's thought of now.

Flanagan:

No.

Nebeker:

Yet by the late '60s, the society has attracted or incorporated the speech processing community.

Flanagan:

My perception is that DSP (digital signal processing) grew out of this community that was doing mostly speech processing and especially the engineering applications of speech processing. There might have been a parallel in image and data too, but I don't know about it.

Nebeker:

So we have a picture of DSP emerging in the late '60s and finding a home in the Professional Group on Audio and Electroacoustics, and then expanding enormously in its many areas of application.

Fumitada Itakura and speech recognition; Bell Labs/NTT exchanges

Flanagan:

Well, we went on. Once we had computers that could do reasonably fast simulations and get signals in and out, we could design things like adaptive differential PCM, which is now deployed in the telephone system. This technique conserves bandwidth and doubles the capacity of PCM channels. We also started automatic speech recognition, which had been an interest earlier at Bell Labs. They had built analog digit recognizers. So we had computers that were a lot more sophisticated. One of the early techniques was dynamic time warping to match spectral patterns.

Nebeker:

Is that the work of Fumitada Itakura?

Flanagan:

Yes. After some effort, I got agreement to make some exchanges of scientists in our research organization at Bell Labs and those at NTT in Japan. The two companies would send scientists back and forth to work on fundamental problems. They would be paid by and remain employees of the parent company. Over the years, this has continued and has been a very interesting stimulus, getting new views into your work. Itakura, who is now a famous professor in Japan, was the first person I got to come from NTT.

Nebeker:

That was your initiative?

Flanagan:

Yes.

Nebeker:

Was it Itakura in particular that you wanted, or was the idea just to have some kind of exchange?

Flanagan:

Earlier I had been to the International Congress on Acoustics, which happened to be held at that time in Tokyo. It was the first time I'd ever been to Tokyo.

Itakura gave a paper on a low bit-rate coding of speech which was exceedingly impressive. He and a senior researcher named Saito developed a linear predictive technique that had an exceedingly fine result. So I was quite interested in trying to have him come to the US. We finally got that arranged and, by mutual agreement, he joined my department. Speech recognition had lain dormant with us for quite a while, and he came with an interest in trying some limited vocabulary speech recognition on one of our laboratory machines. It was a Data General Nova computer, as I recall. We later upgraded that with Eclipses and Alliants and other machines. Itakura made a so-called dynamic time-warp pattern-matching recognizer, using a distance measure that came to be called the Itakura Distance Metric that was very successful. His work, along with some speaker identification work that we had been carrying on at the same time, got us back into the speech recognition business.

Nebeker:

What is the explanation for the dormancy in work on speech recognition?

Flanagan:

It was a judgment on the part of our research management that our efforts would be more constructive elsewhere.

Nebeker:

That it wasn't a practical technology?

Flanagan:

That it was too early to be practical. The talker verification work did lead to some patents, as a matter of fact.

Nebeker:

Has speaker verification found much application?

Flanagan:

I don't know of any wide application, yet. There's a start-up company that has developed it for keeping track of incarcerated individuals. Instead of wearing the ankle bracelet they call in and are identified. I think there is a latent market for verification when used in combination with speech recognition, a latent market that has not yet been tapped. I think as we see speech recognition widely deployed, we'll see speaker verification included, since the same technology and machinery support both. There are also special applications of speaker ID. I was interested to see one of these patents written up by Stacy Jones, who used to write up patents in The New York Times. This one was from 1972 for an automatic verifier that ran on the DP-516 computer in real time. An interesting thing about this photo is that down here in the right hand corner, we see that the Dow Jones average was at 946.

Computers and speech signal processing

Flanagan:

About the same time, working in related directions, there was L.R. Rabiner and Ron Schafer, who is now a professor at Georgia Tech. We were doing formant synthesis for computer voice response, also on that early machine, the first integrated circuits machine that we had in the lab.

Nebeker:

It sounds like your research was very much influenced by what computers you had.

Flanagan:

Yes, indeed. It was dependent on what the computers were able to do. You didn't go into a huge project that had no chance of computing a result in your lifetime. I did get close to that, however, with a colleague named Kenzo Ishizaka, who is a professor in Japan now. He worked with us for a number of years. We tried to make a voice mimic system—it's still a current research problem—in which the computer models the articulatory system, and synthesizes a signal that tries to match arbitrary natural speech input. It looks at the spectral difference between the natural input and the synthetic version.

Then it tries to drive the error to zero by adjusting the parameters of the articulatory model. It's a gradient descent algorithm. We ran our first version of this on a Cray I, and the synthesizer ran something like several hundred times real time. That was on a Cray I. So now we have it down to five or six times real time on a C-90. So we're still a ways from putting it in a little box. But you have to find out if it's possible first, so you use all the computer cycles you can get your hands on to find out if the idea works.

Nebeker:

If computing power increases enough that that becomes economically possible in real time, what would the application be for mimicking the speech of an individual at the receiving end?

Flanagan:

In this case you speak into this articulatory mimic that models the vocal tract. It has a very small set of ortho-normal slowly-varying parameters, hence a very low bit-rate. So it's almost the ultimate in bandwidth compression.

Nebeker:

And at the same time you'd have something that really sounds like that person at the other end.

Flanagan:

Yes, that's the ideal. Also, if you could close this loop at a discrete symbol level, say printed text level, you've solved the voice typewriter problem. You speak into the microphone, and it prints out the text. Or you transmit the text and synthesize speech from it at the receiver. So if you solve that problem you've solved speech coding, speech synthesis, and speech recognition, because they coalesce into the same problem.

Nebeker:

How successful is it so far?

Flanagan:

Fair. I'll play you a tape before you leave. It's just vowel consonant sequences and one sentence, and it doesn't run nearly in real time.

Other speech research at Bell Labs and Rutgers; Defense Department project

Flanagan:

But I haven't touched on speech research dimensions other than signal processing. We did a lot of work in microphone systems, transducers, teleconferencing systems, microphone arrays—we're still working on that. We started some things at Bell Labs, including a system called HuMaNet, which was a distributed, collaborative conferencing system. I think it is at the frontier of multi-modal interfaces for human users that are trying to collaborate over networked computers. They may be geographically separated, they may be using a highspeed network connecting them, they have shared work space on their screens, but they need to communication with the computers and with each other in natural ways, hands-free, by sight, sound, and touch.

Nebeker:

Touch. How would that be?

Flanagan:

The tactile interface we are working on here at Rutgers is a force-feedback glove that senses joint motion. There is a Polhemus coil on the back of the wrist so you know what the absolute position is. It has pneumatic thrusters on the fingers to supply force feedback, and detect finger motion. So, you can program some virtual object on the computer. You put on 3-D glasses and see a three-dimensional display, put your hand in the glove, reach in, and feel the shape and compliance of it. Or, you can play handball with it, catch the ball; and feel when you've grasped the ball. Using this tactile force-feedback device, we're working with the medical school here on training medical students for palpating body tissue. They don't get to work on live patients very much, so we have the digital human, a digitized torso, and we can put abnormalities in different places. They put their hand in the glove and try to detect whether there is something not right.

Nebeker:

This multimodal system, is that HuMaNet?

Flanagan:

That was a first early effort at it.

Nebeker:

That was a particular project?

Flanagan:

It had hands-free control of a speech recognizer that you could use to set-up a conference call. Other participants could come up on the video; you could talk to them, and you could control the features of the system by commands to the speech recognizer. There were data displays and image displays from a central server. We're trying to elaborate more in that direction. We just got a research contract from the National Science Foundation to do multimodal interfaces for collaborative distributed networks and we are carrying this forward at Rutgers.

Nebeker:

The work on HuMaNet was at Bell Labs?

Flanagan:

Yes.

Nebeker:

When did you leave Bell Labs?

Flanagan:

1990.

Nebeker:

But you've carried such work to Rutgers?

Flanagan:

Some of the things I've done here on microphone arrays were things the National Science Foundation wanted done. We have one speech recognition project for the Defense Department; we have done speaker identification for the Defense Department; and we have a sizable contract on distributed computing for the Defense Advanced Research Projects Agency (DARPA). We are doing the multimodal interface work on a contract we just received from the National Science Foundation. We are continuing in the direction of integrating interface technologies based on image processing, eye tracking, gaze determination, hands-free sound capture, conversational interaction with a machine (speech recognition, speech synthesis answer back) and tactile interaction.

Imagine a conference system—the Defense Department is interested in this—where you have a mission planning session by participants who are geographically separated. They all see the same situation map, and you may have to move icons on the map. You can reach with the tactile glove and grab one, pick it up, and put it over there; or you can say, "Move object A to point X," and the speech recognizer does it, or you could use gesture and speech recognition, "Move this to there," pointing and speaking simultaneously. Then you've got to fuse the input of voice and gesture. All of these are imperfect technologies, so you need a software agent that will examine the sensor inputs—all of which are imperfect and error-susceptible—and try to make a reliable decision from it.

Nebeker:

I'm particularly interested in the recent period, because we have more sources for the earlier development of signal processing. Could we look at your timeline here for the last decade or so?

Flanagan:

One thing that deserves mentioning is that we did some of the first autodirective array microphones. If you want to see one in operation, go to the auditorium at Murray Hill. We have a large, two-dimensional array of microphones up on the front ceiling. Its 400 electret microphones are steered by a computer that you don't see. Four-hundred people can sit in the audience, and you can hold an interlocation conference. If somebody in the audience wants to ask a question or speak to somebody remotely, the array will locate the sound source, point the beam there, and capture a reasonably good signal. So you don't have to give everybody a hand-held or clip-on microphone.

Now we are trying to do even better than that. We have the new generation microphone system here at Rutgers, and we've got a lot more processing power now. We can afford almost a DSP chip per microphone, and we're doing much more than just beam forming. We're doing match filter processing of every sensor in a 400-microphone array. There are some great advantages to that, not the least being that you get three-dimensional spatial selectivity, rather than just the two-dimensional from a single beam. You can capture sound from a spatial volume.

Nebeker:

These systems are called autodirective arrays?

Flanagan:

Yes. There is another component of the system that finds the source. There are two arrays on orthogonal walls, little quads, for which the time delay of arrival is computed for every possible pair—giving six delay estimates for each quad.

We now find the XYZ coordinate of the source that produces that set of time delay of arrivals. Each delay estimate defines a paraboloid, where the surfaces intersect is taken as the source location. Generally this is not a perfect point, but it's usually fairly accurate. It's over-determined, but you need some extra reliability. The sound source may be a talker moving about the room and the system tracks the source. We have a video camera slaved to the source locator, as well as the beam-forming microphone array, so everything keeps pointing at the speaker. Thus, if you are in a video conference in a lecture hall and the talkers pace back and forth at the chalkboard or something, the camera and microphone follow.

Nebeker:

That technique of sound location was started in World War I to locate artillery, taking the very slight differences of arrival times.

Flanagan:

Right. It's the same principal as Loran, just a different medium. Except we take all of the time-delay-of-arrival estimates from the two quads and then do a gradient search. It finds the XYZ position by minimizing the square differences to all of the delay values. And that runs in real time on a single DSP 32-C chip hosted by a PC, such as a Pentium.

Nebeker:

Could you explain the hands-free speech recognition?

Flanagan:

Most speech recognizers presently are trained on close-talking speech, with a microphone positioned right at the talker’s mouth. It takes hours and hours to train the statistical model. The technique used is called hidden Markov model.

If you want to use that speech recognizer from some other location, say standing back from the microphone, from your speaker phone or cellular phone in a car, the recognition performance is diminished. The recognition technique is not robust enough. So we have one project for DARPA where we're using a microphone array and a neural-net preprocessor that learns the multipath distortion in the room by comparison with a very short, initial segment of closetalking speech, and then you can operate the speech recognizer reliably at a distance from the microphone array.

Nebeker:

Thanks very much for the interview.