Oral-History:Azriel Rosenfeld

From ETHW

About Azriel Rosenfeld

1905 - Rosenfeld.jpg

Azriel Rosenfeld is a pioneer in the field of signal processing and Computer Vision as well as a founder in the field of Digital Geometry. A self-described theoretician, Rosenfeld originally specialized in the field of physics before changing emphasis to concentrate on discrete mathematics. At the University of Pittsburgh he was on the committee that initially designed degree programs in computer science and was the first professor in the university to teach courses on pattern recognition, artificial intelligence, and image processing. Rosenfeld served as president of the International Association for Pattern Recognition and ran the first International Conference on Computer Vision in 1987, which was operated by IEEE. In addition to his work as a mathematician, scientist, and university professor, Rosenfeld is also an Ordained Rabbi who recently published a book entitled The World According to Halachah: An Introduction to Jewish Law which has been developed into an internet course. The interview ends with Rosenfeld’s predictions about the future of computer vision systems into the next century.

About the Interview

PROFESSOR AZRIEL ROSENFELD: An Interview Conducted by Michael Geselowitz, IEEE History Center, 12 July 1998

Interview #344 for the IEEE History Center, The Institute of Electrical and Electronics Engineering, Inc.

Copyright Statement

This manuscript is being made available for research purposes only. All literary rights in the manuscript, including the right to publish, are reserved to the IEEE History Center. No part of the manuscript may be quoted for publication without the written permission of the Director of IEEE History Center.

Request for permission to quote for publication should be addressed to the IEEE History Center Oral History Program, IEEE History Center, 445 Hoes Lane, Piscataway, NJ 08854 USA or ieee-history@ieee.org. It should include identification of the specific passages to be quoted, anticipated use of the passages, and identification of the user.

It is recommended that this oral history be cited as follows:

Azriel Rosenfeld, an oral history conducted in 1998 by Michael Geselowitz, IEEE History Center, Piscataway, NJ, USA.

Interview

Interview: Professor Azriel Rosenfeld

Interviewer: Michael Geselowitz

Date: 12 July 1998, Sunday

Place: College Park, University of Maryland

History of signal processing, computer vision

Geselowitz:

Have you seen the booklet that the IEEE History Center did for the Signal Processing Society?

Rosenfeld:

I’m not sure.

Geselowitz:

I should have brought it. I’ll arrange to send you a copy. Basically they formed a panel and our Senior Researcher, Frederik Nebeker, researched the history of both the Society and the technology. The Society named some prominent society members that were also prominent engineers in the signal processing field. They were interviewed, their interviews were incorporated, and Nebeker also wrote a monograph on signal processing based on that same research. It occurred to him after the dust had already cleared that there were areas of signal processing or tangential to signal processing that were not necessarily directly the providence of the Signal Processing Society, and that there might be some very important individuals who will have something to say about signal processing that had not been included.

Rosenfeld:

Yes. In the special issue of Signal Processing Magazine there was a huge article on the history of image processing/multi-dimensional signal processing. But I noticed that other topics like sonar are being treated in subsequent issues. This is evidently a multi-part effort.

Geselowitz:

Right. I think it’s supposed to be the whole year.

Rosenfeld:

The ASSP society's multi-dimensional signal processing committee got started in the ’80s. Mike Ekstrom, it seems to me, was the lead chair of it at the Schlumbergér-Doll Research Center. I was a member of that for a while and I even chaired their third workshop, which we held in Leesburg, Virginia.

Geselowitz:

So that was at the Schlumbergér. Were a lot of the geophysics people involved? Like Enders Robinson?

Rosenfeld:

No. This very broadly covered multi-dimensional signal processing. They had a number of specialties, and I’m not sure I can remember what all of them were. But the truth of the matter is, multi-dimensional signal processing is (at least now) peripheral to my interests so I’m really not staying plugged into it.

Geselowitz:

All right, and could you define your field?

Rosenfeld:


Audio File
MP3 Audio
(344 - rosenfeld - clip 1.mp3)


What I can do is recite the beginnings of the survey paper that I just handed you, which is going to be submitted very shortly to the IEEE Annals of the History of Computing. It purports to be the history of the field that is now called Computer Vision, but which in the early days was not called any such thing. It begins as follows. When computers began to become available people rapidly realized that you could use them to process images. This immediately split into two thrusts, one of which stays with signal processing, while the other breaks away. The one that stays with signal processing is where you can process images (meaning you do things to them and the result is a better image in some sense—more compact, prettier, whatever the buzz words are). But on the other hand, a vast amount of effort which also started in the mid ’50s as computers became available was the game of trying to extract other kinds of information from images. This was either a description problem or a classification problem as a special case of description. And it cut across a large number of application areas. Now the people who went into image processing, the image to image stuff, they had the enormous advantage that an image is just a two-dimensional signal, possibly discrete. Anyhow, a two-dimensional generalization of signal processing. The generalization is not entirely straightforward. There are no doubt significant complications that take place in 2-D that just weren’t there in 1-D, but there was a tremendous amount of commonality. So the image processing people were able to start right off with a major amount of underlying preexisting theory, which they just had to extend from one dimension to two dimensions, and perhaps from continuous two- dimensional signals to discretized ones since our images are usually digitized. The image analysis people, that’s what one would call the other half of the field, they’re the people that wanted to get something else out of images, not to just turn them into other images. Those people had no such luck. They had to feel their own way. They were motivated by a number of major application areas, which I will review briefly. In the initial days you could probably safely say that each application area was a culture unto itself. Nobody, or hardly anybody, looked across several of them to ask what is common to all of this? Is there an image analysis paradigm? By the mid ’60s it appeared, at least to me, and probably to others as well, that this was a field, and one could work at some kind of paradigm which is common to all the various application areas.

Image analysis applications and theory

Rosenfeld:

The application areas are so different that at first glance you wouldn’t think there is anything common to them. Let me touch on three or four of them. One is what has all these years been called optical character recognition. Not because it’s done optically, but to distinguish it from something called magnetic ink character recognition where the sensing is totally different. Optical character recognition simply means you have characters that you can see. The aim is to scan this page image and read the characters. Why was this considered in the ’50s to be a tremendously important thing to do? Well, for the reason that they wanted to abolish key punching. Instead of trying to input data into a computer by keying it in, if it already exists in hard copy form, human-readable form, it would be so nice to attempt to scan it in directly. So OCR, optical character recognition, was one of the early, mid 1950s areas.

There were two other that I know of that were quite popular in the ’50s. One, which you find mostly in reports that are very hard to lay hands on any more, was computer analysis of aerial photographs to determine what kinds of cultural features were present on them—building, roads, bridges, things like that. Why was that an interesting problem? Well, simply because the Defense Department was interested in it. And the Defense Department, as always, had a lot of money so there naturally then began to be people studying this problem.

Geselowitz:

Is this what we might call remote sensing?

Rosenfeld:

No not yet, because the satellites didn’t go up until the end of the ’50s or the beginning of the ’60s, which is when the term remote sensing came into existence. Someone at the Office of Naval Research Geography Division coined the term remote sensing of the environment. I do not remember when the remote sensing of environment conferences got started, but as far as I know they’re still going on and they’ve been held, for decades. That area really blossomed when the satellites went up. With aerial photos there was much less immediate interest on the part of the mapping community. How did they map the earth? They flew and collected stereo pairs and plotted terrain relief, and then it used to take somebody a month, working on a 9 x 9 aerial photograph, to extract all the so-called cultural features that were present. This was tedious.

And although the Defense Mapping Agency finally did get interested in going digital, and they have indeed gone digital and are now part of the National Imagery and Mapping Agency, the trouble is that they also tend to use so-called National Imagery which is so highly classified that you are not even allowed to say what the channels are that you need to access to see it. Aerial photos were far more innocuous. In the ’50s they were flying airplanes, the airplanes flew aerial cameras, the cameras took pictures, and the question was, what can you get out of the pictures automatically? Just as early was the microscopy community, the people who were looking at chromosome spreads, blood smears, pap smears. In the United States alone, there must be on the order of hundreds of millions of these images looked at a year, because almost everybody has at least one such image looked at every year. If it isn’t a microscope smear of some kind, it’s a chest x-ray or a dental x-ray. Strangely the radiology community didn’t get turned onto this stuff until the ’60s and there was very little literature until the ’70s. But in microscopy there was a lot of activity as far back as the ’50s.

The National Institutes of Health, I dare say it was another rich agency, began to fund black boxes that were supposed to scan microscope slides and count cells, with limited success. But today there are companies making a living doing that sort of thing. It still needs some post-editing, but it’s still cost effective. With the biomedical images it’s perhaps a little trickier because you can’t afford certain types of errors. Those are some of the big areas. There is another one which really blossomed in the ’60s (and probably even earlier), but seemed to taper off rapidly in the ’70s. Sometime around the middle of the century various devices were invented to allow you to track nuclear particles. The first one was the Wilson cloud chamber, but it was superseded some years later by the so-called bubble chamber (I forget who invented that one.) and the spark chamber. So there were a number of devices out there such that if you allowed sub-atomic particles to pass through them, they would leave tracks. And what could you tell by looking at them? The tracks were invisible, but they left streams of bubbles trailing behind them. If you put the chamber in a magnetic field, then the trajectory of the particle has a certain curvature tells you something about its mass and its charge. And if a particle track suddenly changes, say it suddenly appears to have bounced off of something, you know that an "event" has taken place.

Many millions of bubble chamber pictures were being processed, and every major physics establishment did its own thing trying to develop an automatic bubble chamber picture scanning system. This application area is again quite different. You’re looking at particle tracks. You probably have to do it on stereo pairs of images because the tracks are three-dimensional. It’s quite different from aerial photographs or microscope slides or document pages. So the first question is, is there anything in common to such diverse motivations for trying to analyze images? Do these problems possibly have any common threads?

Geselowitz:

So you are saying you were one of the people who began to look at them to see if there was a common thread?

Rosenfeld:

Let’s just say that since I was trained as a mathematician, I was looking for what you might call a general theory. Being a mathematician I had to painfully learn that there was already a general theory of image processing. I managed to acquire some rudiments of two-dimensional signal processing theory, but my real interest was in the other side of the business: How do you extract information from images? Well, what kind of information might there be in images? It really is quite varied, but let’s do it in terms of the four or five kinds of examples that I just described. On a printed page you might want to read the characters: once you can successfully extract and recognize them, you can represent them by ACSII codes. So it extract them and recognize them. On a microscope slide, you want to count various kinds of cells. You’re looking at a blood smear that contains lots of red cells, platelets, occasional white cells and so on, and you want to scan the smear and count how many of each king there are, because that’s what the hematologist does, or the cyto-technician.

Geselowitz:

But you also want to look for irregularities?

Rosenfeld:

Conceivably. Much of the information you want to get is straight counting of things. But you’re undoubtedly right. If the red blood cells look crescent-shaped, you’re in trouble, so indeed some shape analysis is needed. So that was a second example. Let’s just stop at these two examples and ask what they have in common. First, you have to extract certain pieces of the image. In the case of the piece of paper, the document, the pieces happen to have ink on them and they are funny shaped, they look like characters of some alphabet that we can read. In the blood smear case, the image is more cluttered, and the pieces overlap. It's considerably more complicated but your goals still is to find certain pieces, recognize them, describe their shapes. If it’s an x-ray, maybe you’re looking for a tumor. In the bubble track case too, you need to extract the tracks and measure their geometric properties. Notice that common to all this is a step that says there are interesting parts in this image. The image is after all just a big square 512 x 512 if it’s the standard binary number TV kind of resolution. The image is 512 x 512 pixels.

Geselowitz:

The pixels are either off or on individually.

Rosenfeld:


Audio File
MP3 Audio
(344 - rosenfeld - clip 2.mp3)


Yes, but what you’re looking for is certain pieces of the image in which the pixels are meaningful in some way. A set of pixels might be dark, and you want to look at the shape of that set and see if it resembles some character. And similarly for the other situations. Of course, I said "dark," which applies to the document; it’s not quite as easy for other situations. Curiously, the first theoretical treatment of extracting pixels because they had characteristic ranges of gray levels was done for a cytology application. The character recognition people were doing it already, but the first one who raised it as a scientific question and posed it as "what's the theory of what we’re doing" was somebody in the cytology business. When you look at stained cells on a microscope slide typically the background is very light because it doesn’t pick up any stain. The cell body contrasts somewhat with the background, and there are certain nucleotides that pick up the stain and become very dark. So what you end up with is the nucleus is quite dark, the cell body is medium, and the background is light. You end up with an image that can be in fact sub-divided into parts based on the concept that the parts have different ranges of pixel gray levels. That was a piece of science done in the late ’50s. It illustrates the fact that the same piece of science applies to different domains. I’m not saying how a tumor looks different from the rest of the lung in an x-ray, because radiology has its own set of criteria. Certainly the bubble tracks are different from the background, so extracting them is once again a simple matter of selecting pixels that have a characteristic gray level range. The common thread that leads to a paradigm, appears to be: one, extract meaningful parts from the image: two, describe the parts, and then (possibly) describe relationships among the parts. This gives you some sort of structural description of the image content. That is the classic paradigm.

This paradigm makes sense only if you assume that the image is a faithful representation of what it’s an image of. In the case of a document, the document is two-dimensional to begin with and so is the image, so the image is "isomorphic" to the document itself. If you can find meaningful parts in the image they correspond one for one with the parts of the document. The same is true about a microscope slide because microscopes have something called depth of field that is very shallow, so in fact a microscope image is referred to as an optical section because it’s essentially a slice through something. Depending on how the microscope is focused, you can select which slice you’re looking at. But again, the slice is two-dimensional. So the image that you collect by pointing a TV down a microscope tube is a faithful representation of the section. Of course there is a lot of out of focus stuff that’s also corrupting it, but that’s a different issue. At least the in-focus section is something you can hope to analyze from a single image. Now, take an aerial photograph. If you are flying high enough terrain relief become negligible. If you go up to satellite altitudes, all the more so. An awful lot of analysis can be done ignoring terrain relief, and so you’re essentially looking at a two-dimensional scene and you’re getting a two dimensional image of it. Another thing about the aerial photo in that, you also know which way your camera was pointed, for instance, straight down, so you not only have something that is fairly free of relief, but you are also looking at it from a known viewpoint. The same is true for most x-rays. You know where the x-ray source was and where the film was, and so you know which projection of the tissue is casting its shadow onto the x-ray film. So you’re looking at the tissue from a known viewpoint, even though the tissue itself is three-dimensional.

Geselowitz:

Which would be the same with the cloud chamber. It is three-dimensional, but you know exactly from what angle you’re capturing the image and can make some calculations.

Rosenfeld:

True. But in the case of the cloud chamber there is also the fact that you supposedly know which way the beam of particles is supposed to be going. It’s clear that there are other domains in which you cannot afford this luxury of assuming either that the Earth is flat, that the scene is two dimensional, or that you know from which direction you’re looking at it. In the ’60s the artificial intelligence community got interested in doing robot vision. That was one of their targets, thanks to decisions made by the fathers of AI. So they immediately posed the question “If I have a robot manipulator and I give it a camera and it’s looking at objects in a workspace, how can it make sense of what it sees so that it can tell the manipulator where to reach and where to grab and so forth?” Or, if you’ve got a mobile robot, and those are almost as old, and it’s cruising around on the floor of your laboratory and it sees various things, you want to get it to make sense out of them, detect them somehow, steer around them. What was the classic example? The robot cleaning machine, which has to distinguish between things that should be swept up, things that should be picked up, and things that should be steered around. The old joke is that this resembles the advice that they used to give to recruits in the Army: if it moves, salute it; if it doesn’t move, pick it up; and if it’s too big to pick up, paint it. You notice that these are almost the same three cases. It’s unfortunately not as simple as that. Anyway, when the AI community came along it was no longer possible to make this very simple two dimensional paradigm apply. Actually, if you’ve got my paper there across the table, you might flip to the back page, which has some figures.

Figure 3 in the paradigm that we’re describing: the image comes in and gets segmented into parts. The parts get described by properties and perhaps relations. That goes into detail that I don’t know if you want to hear. That is the 2-D image analysis paradigm. The 3-D scene analysis paradigm is considerably more complicated. Before I move on into it, let me recapitulate by saying that in the 2-D image analysis paradigm, what we discovered was the important topic called segmentation. Then there began to be all sorts of specializations within that topic. What are the methods used for segmenting images? What right do you have to believe that they’re going to work? I could give a mini-lecture on the history of the early days of segmenting images. Given that you have extracted parts from an image, now you can say all sorts of things about them. What kinds of things? That is, what is the taxonomy of image part properties? That leads to another lecture.

Picture Processing by Computer, 1969

Rosenfeld:

The earliest book on image processing and analysis happens to be by me. It’s called Picture Processing by Computer. It appeared in 1969. It has also got image processing in it, a topic that I probably had no right to write about.

Geselowitz:

I also noticed that that’s the same year that Gold & Rader published Digital Signal Processing.

Rosenfeld:

Oh well, but after all that’s signal processing.

Geselowitz:

Right. It’s the same year though.

Rosenfeld:

The fact is that I beat Harry Andrews by six months or because six months or so later, with the same publisher in fact, he published his image processing book, which covers only image processing and coding but not image analysis. So I stuck my neck out and invaded his turf somewhat. I probably mucked it up too. But in my book, there is a chapter on image property taxonomies and descriptions of images, and of course on segmentation. I don’t know who invented the word segmentation. I notice in the old literature that there are two or three application-oriented papers that use the word segmentation in their titles. Probably they were in the character recognition area. It seemed to me at that time that segmentation was an unfortunate word because it reminded me of line segment. It sounded very one dimensional. In fact, curiously, the signal processing people who deal with time signals, which are one-dimensional, don’t use the term segmentation. An attempt was made at one point to get them interested in it, but they have their own analysis tools, which tend to be things like Fourier analysis. They graduated from that to still hairier sets of basis functions and they tend to ignore the time domain. We don’t ignore the space domain because that is what we’re looking at and it’s a finite piece of space.

Two-dimensional Fourier analysis has its limitations because you’re looking at a space-bounded signal. In any case, I've drifted off from the point I was trying to make. But I didn’t like the word segmentation because it smelled one dimensional to me. At the last minute, in the galley proofs in fact, in the Picture Processing by Computer book, I re-titled one of the chapters segmentation, and of course that has been the standard word ever since. But I didn’t invent it. It was used in a couple of papers that appeared in the ’60s. So segmentation is a fundamental topic. But when you go to computer vision kinds of situations, which involve scenes that are not very two dimensional, segmentation is highly questionable. I’ll warm up in a few minutes and start talking about that.

Geselowitz:

Would this then be a good place to go back and talk about your background? Your earlier background, because you mentioned a couple of times about being a mathematician. And so I was wondering if you could tell me about your early education. Before you got to the university, what you did when you first got to the university, and how you drifted into this field.

Educational background; physics, mathematics, and digital geometry

Rosenfeld:

Well as a matter of fact, as an undergraduate I was a physics major. I decided that physics wasn’t to my taste. It may have been the fact that I took all sorts of fairly advanced physics courses, like Methods of Mathematical Physics, when I barely knew calculus. And so it really raked me over. I was a physics major, but I ended up taking more mathematics than physics, and I sort of like it better. So when I went to graduate school I switched to mathematics.

On the other hand, antithetical to the background for mathematical physics, I was more interested in things like abstract algebra. So I was more interested in what nowadays is called discrete mathematics rather than the analytical kind. In fact, I guess I’ve been on a crusade for 30 or 40 years in the image business to argue for the fact that since our images are digital, we ought to be thinking in terms of the analysis of discrete arrays of data rather than pretend that the images are analytic two-dimensional signals. The question is do our digital images simply serve as good approximations to continuous signals? In some situations no, that is absolutely not the case. In some situations we are interested in things in our images which are small or thin, which in other words are comparable in size to the pixels. And treating properties of such digital objects as though they were approximations to properties of continuous objects is just not adequate.

I’m a founder of a field which is sometimes called Digital Geometry, which is the study of geometric properties of subsets of digital images. This isn’t quite as prominent as computational geometry, which has a totally different slant on discrete geometric things. Computational geometry tends to still deal with points and lines. It deals with finite numbers of them. But it pretends that they are still points and lines in the mathematical sense. It still pretends that two lines intersect at a point. You notice that in a digital image a line is a stream of pixels. It’s one pixel thick that’s as thin as it can get. How do two lines intersect? Not necessarily in a pixel. They can actually cross and miss each other if you use the right kind of connectivity for the lines, or the wrong kind. Otherwise, in general if two lines intersect they run along together for a while and then separate. Think about it. If they met head on then they intersect in a pixel. If they make a more oblique angle, they don’t intersect in a pixel. So understanding that kind of thing becomes essential. The computational geometry people still pretend that they’re dealing with finite sets of points and lines in the continuous Euclidean plane. And although they do worry occasionally about the effects of limited precision on their computations, they still think that their computations can be made as precise as necessary. In the digital geometry business, which some people got interested in, you worry about how to study subsets of digital images. Or in other words, you study the geometry of lattice points in the plane, if you want to be that fancy. In any event, that was a tangent from what my own background was. I was kind of interested in discrete mathematics. I wrote my thesis in a very odd area. I can’t understand my thesis anymore, so there is little point in trying to persuade you of what the area was. Algebraic geometry is the study of solutions to sets of algebraic equations. Although these are algebraic equations with finite numbers of terms and so on, right, it’s sums of powers of variables. But the space in which these things are operating is still continuous Euclidean space. It gets even worse when you go to the field that I was into, which was called differential geometry, which was the study of sets of solutions of algebraic differential equation. An algebraic differential equation, is like an algebraic equation, but instead of the variables being x, y, z, the variables are x, x', x.

Geselowitz:

The xdy and the xdz?

Rosenfeld:

Well, you have the variables and their derivatives. So you’ve got differential equations, but of a nice kind. You don’t allow something like sine of y. It’s an algebraic differential equation and you can say some things about the kinds of functions that are solutions of such things. But as I said, I can’t understand my thesis anymore.

Jewish Studies education and publication

Geselowitz:

I’ve observed two things about your resume. First of all, that you seem to have pursued the study of Hebrew and Hebraic literature in parallel with the physics math story, and that you got smicha in 1952. That’s one thing I would like you to maybe remark on. And the other thing is that despite being a physicist turned mathematician, your first job seems to have been as what might be called an engineer.

Rosenfeld:

Well, if you’re working out in the world, you have to be called an engineer, very few companies can afford the luxury of employing physicists and mathematicians. Let me touch on all the things that you just mentioned. Yes, in parallel with my education in physics and mathematics I was also educated in Jewish studies. And if you will read my resume carefully you’ll also discover that I have another doctorate in Rabbinical Literature. So aside from Rabbinical Ordination, I also was extremely interested in the historical sides of the business. It happens to be a field that I continue to work in. So if you count my embarrassing number of publications, you’ll discover approximately 5% of them are on Jewish subjects, but you’ve got to search for them. Near the beginning they were probably a higher percentage.

Geselowitz:

Well I noticed just last year you seem to have had two major publications. One was article in that book Hal’s Legacy that Stork edited, and the other was The World of Halachah, An Introduction to Jewish Law.

Rosenfeld:

That is actually a course, which is being given on the internet.

Geselowitz:

On the Project Genesis part of the internet.

Rosenfeld:

That’s right. Project Genesis is run by my son-in-law. I am now one of his faculty. But the fact is that this Halachah overview thing that he is offering, he is serving up a chapter a week, that’s a thing that I wrote for one of my kids a couple of years before his Bar mitzvah. So it was sitting around. In fact how we got it onto the web was an interesting odyssey, we scanned it in. Speaking of optical character recognition, we scanned it in and post-edited it. I still occasionally find a typo that we missed. But I can guarantee that the OCR was pretty good. And it came from plain ordinary electric typewriter typescript. Probably a xerox copy. No, I think we actually used the original for scanning in. That was an interesting adventure which occupied a few months doing the scanning and post-editing and marking up for HTML and so on.

Geselowitz:

We actually use it a great deal. Surprising in this day and age, but somehow it’s easier for someone to send a fax and for us to scan it than to send the electronic file. Thanks, I don’t want to blame the computer industry for lack of standards or what have you.

Rosenfeld:

That’s interesting. At least once a week I get a piece of e-mail that we really scratch our heads over how to turn into something readable because people have their favorite software packages that they encode things in. This too shall pass. It used to be that just the process of getting the stuff in and out was so unreliable, but it’s become pretty stable. But now we have a tremendous cacophony of different languages being spoken out there, and I guess eventually that will settle down so that automatic translation, so to speak, will take place.

Early employment

Fairchild

Geselowitz:

The other part of my question was do you want to say a few remarks about what you did for the Fairchild and so forth, did it have anything to do with your later career, and if so how?

Rosenfeld:

Well, it seemed right not to go to school forever. Although it seemed for a while as though I was just going to earn degree after degree. But it seemed like a good idea to go out and work for a living. So the question was work in what? The first job I happen to have found, I was basically working as an optical engineer because I had a course in optics as an undergraduate. I was tinkering with ultrasonic light modulators. I was doing lens design. This was before software for lens design was available, so you had to sit and pound it out on a desk calculator. And I also did some analysis of fancy mirror systems.

Geselowitz:

This was at Fairchild in the ’50s?

Ford Instrument Company; automatic stereo plotter patent

Rosenfeld:

Exactly. I was there for a couple of years. I moved on from there to Ford Instrument Company; I doubt that it exists anymore. It was a division of Sperry-Rand, and I suppose that Sperry-Rand doesn’t exist anymore. That’s the way it goes. But I took the job at Ford Instrument again because I had these optic credentials and I was involved in the design of optical radar simulators. That’s actually what began to get me into the present field because I moved from that into stuff having to do with aerial reconnaissance interpretation. At one point I had a couple of photo interpreters working for me and we were making photo interpretation keys for various kinds of images of non-photographic natures. Some of the stuff was even classified. The last year I was there I got into Doppler inertial navigation, which didn’t enormously interest me. So I guess for one reason or another I moved away from there and joined what became the electronics division of the Budd company. Budd bought out an outfit called Lewyt Manufacturing. Lewyt used to make vacuum cleaners, but they also had a defense electronics division. But Lewyt Electronics was into this or that having to do with-- one of their areas was weather radar. I’d have to look at my files to see what I was doing there the first couple of years. But the research manager there hired me basically as a general purpose guy who knew a lot of mathematics and physics and so on, and who would therefore be valuable in various aspects of things including proposal writing. After a year he left and I inherited his job, so I became the research manager there. Then I was a little more free to go in directions that I was interested in. The company was making displays. Well, displays are just the opposite end of image processing. Image display is the output end if you wish. It made it justifiable for me to look into various things having to do with image display and with processing of images before they get displayed.

Geselowitz:

I see it was 1960 when you got promoted to research manager. Right around the time you arrived, but I see that you received your first patent which was for an automatic stereo plotter.

Rosenfeld:

That was a Ford Instrument patent.

Geselowitz:

It came out of Ford Instrument.

Rosenfeld:

I had a couple of patents there. Ford was interested in patenting things. It is amusing to have a 1950 patent on an automatic stereo plotter because automatic stereo is still very much an area of great interest to the computer vision community, except that nowadays they can do it at frame rates. And of course in our days, well, the photo geometry community was doing automatic stereo very slowly. Basically looking at a stereo pair through a binocular eye piece and basically trying to track the terrain relief using what they used to call a floating dot. You moved this floating dot within the stereo model which you could see through binocular eye pieces, and try to move it so that it stays right on the surface, and in that way plot terrain contours. That’s the sort of thing that in those days people were trying to automate. So in fact, in the ’40s, ’50s, ’60s, there were automated stereo plotting systems out there and companies that specialized in the business. The AI community didn’t get interested in this stuff until the ’70s. I was able to pin down who wrote the first thesis on automatic stereo. In the AI community it was Marsha Jo Hannah at Stanford University, and I don’t think her thesis was ever even published. But it’s on record as Stanford AI Memo number so-and-so.

Pattern recognition and image modeling research, 1960s-1970s

Geselowitz:

So that was in the early ’60s when you started paying attention to more of this sort of mathematical aspect.

Rosenfeld:

Since I was into navigation a bit it occurred to me that it ought to be possible to do stellar navigation by recognizing star patterns. That is a simple point pattern recognition problem. So I had some papers on that sort of thing.

Geselowitz:

I see in 1962 you had a paper entitled "Automatic Recognition of Basic Terrain Types."

Rosenfeld:

That’s the other side of the coin. That’s looking down instead of up and there I was getting quite interested in that.

I was thinking that they were slighting the background, looking only for the targets. In the target recognition business, which continues to be a controversial and highly pursued area, there has been considerable progress using techniques of 3-D computer vision. There has been considerable progress in modeling the target and what it’s going to look like on the image. But we are far less well established in understanding what the background looks like. This can lead us into the next big area of discussion. As I mentioned, toward the end of the ’70s I ran the first workshop on image modeling. That’s a reasonable topic these days, and there have been books on it. The first one was the proceedings of my workshop, which had a whole flock of interesting people coming around and giving talks.

Geselowitz:

And where was the workshop?

Rosenfeld:

Chicago. It was attached to one of the Image Processing and Pattern Recognition meetings. It was a satellite of a conference.

Geselowitz:

What year was that?

Rosenfeld:

1979. But the fact is, image modeling has not, and maybe never will, provided sufficient understanding of what backgrounds look like in the kinds of images in which we’re trying to find targets. The target recognition community uses this expressive term, “clutter.” The image contains clutter. Clutter is anything that isn’t a target. Now if you’re going to tell me that the best way to model that is IID Gaussian, that is clearly insufficient. You’re not going to be able to use simplistic, even two-dimensional Markov models and hope that you can characterize what terrain looks like. And if you don’t like the example of terrain, go to any one of the other domains, like the chest x-ray with the tumor in it. You know, with man-made objects it’s not hard to model pretty thoroughly. In fact in industrial computer vision it’s just been a matter of how much computer power you are willing to throw at the problem, because they still want it to cost pennies, and it’ll be another few years before what they need costs pennies. But after all, you know what the objects look like, you know all about their geometry, all about their surface finish, what kind of work space they appear in. If there is any difficulty it’s your fault. You really have a complete understanding of the three-dimensional nature of the scene. The surface properties, the object geometry.

In many of the domains in which we are still naively trying to think that we can understand images, we’re dealing with classes of problems that are not mathematically well defined. We don’t have good models for tumors in chest x-rays. We don’t have good models for the lung and the tumor. The literature, which studies these things, doesn’t attempt to formulate such models. They would be of a highly specialized nature. You would have to understand the microstructure of the lung tissue, which is pretty complicated tissue. You'll need a model for how the tumor grows. And in fact tumor growth depends an awful lot on the environment in which the tumor grows, so the growth of the tumor itself is influenced by the microstructure of tissue in which it’s growing. It gets complicated and highly specialized, and it’s perhaps beyond the state of the art at the moment to teach a tumor recognizer as much as it would need to know in order to become an optimal tumor detector.

Geselowitz:

I guess an opposite approach would be to take an AI approach, which hypothetically if you could develop a recognizer that could learn and in some way novice the way that a lab technician learns, then you could just let it observe a lot of tumors and categorize them.

Rosenfeld:

I heard an interesting story just last month. I can tell you what the story was and what conclusion I drew from it. Someone ran a set of target recognition algorithms on a very large set of images. I forget whether this was visible or infrared or what. They ran the algorithms on tens of thousands of images where apparently the truth was known. It’s called "ground truth" in this business. The truth was known about what was there on the ground, which images had targets and which ones didn’t. What they found was the so-called model-based target recognizers consistently performed not as well as several of the neural net based algorithms.

The reason for this is possibly the following. When you train the neural net and it had better be a big enough neural net, when you train it on tens of thousands of images, it actually begins to learn what the background looks like. I was saying the background does not satisfy any mathematically or statistically simple models and those are the only kinds that we like to use in our papers with all the equations in them. The neural net doesn’t know from that and it is apparently quite capable of learning grotesque models, which of course, who knows if they generalize to the next 70,000 images. But it is apparently quite capable of learning grotesque background models and maybe that’s what you need. Then on the other hand I can point to something, which I called in one of my papers the Principle of Diminishing Returns in Modeling. You need lots of samples and the models have to get quite complicated. A model with a few parameters is not going hack it. You have vast amounts of data and you’re going to try to learn a model that has many parameters. Well, the more parameters the larger amount of training data you need. And there is some diminishing returns issue here. The model gets sufficiently complicated; it’s impractical to ever train it. I’m rather impressed that these neural nets did so well on the target problem. Undoubtedly they still didn’t do anywhere near 100%.

I think what we’re seeing is that neural nets are beginning to learn non-standard models for things, but I would say that the chances are the world is still too complicated even for the neural nets. On the other hand, they get bigger all the time. No doubt that they can be trained faster all the time. So things are eventually going to start to stabilize. We’re constantly creeping up in our levels of performance.

Computer vision and robot vision, 1960s-1970s; texture gradients

Rosenfeld:

So I would think the next issue to talk about is what happened to computer vision when the AI labs at MIT and Stanford and SRI began to get interested in robot vision. One of the things that happened was a small fiasco in the mid ’60s called the Summer Vision Project at MIT. They were teaching the robot to stack blocks. This was called Blocks World Vision in those days. The world consisted of simple painted blocks on the table top, and they were managing to teach the robot to look at a scene consisting of stacks of blocks, figure it out, decide which was the top block, reach out, pick it up, put it some place else, etc. So that was beginning to be under control. There was a very nice demo at MIT called the Copy Demo in which the job of the robot was to copy a stack of blocks that it was shown.

Geselowitz:

Whose lab was doing that work?

Rosenfeld:


Audio File
MP3 Audio
(344 - rosenfeld - clip 3.mp3)


This was the MIT AI lab. And the people who participated in the Copy Demo are all quite respectable senior people these days. I think Pat Winston, the head of the AI lab, was one of them. Anyhow, one summer, summer 1966 I think, Seymour Papert, who was at the MIT AI lab, decided they were getting a bunch of high school kids in for the summer to work at the lab and they love to program. Let’s just get them to write a program that will recognize a large number of common objects such as hand tools. So I guess they acquired a tool box full of hand tools, and attempts were made to write such programs. Anyway, it never happened. The reason why it never happened that I was told verbally at one point was they didn’t have enough computer memory. Could be, but the point is it never happened, and it remained a challenge. What I said a little while ago was if you really know your hand tools, they’re not too battered and dirty and whatever and if you have good manufacturer's specs on them, nowadays there should be no reason why we could not recognize hand tools, but the attempt to do it in the ’60s was quite premature.

Why was robot vision a new kettle of fish? Why was it a new level of difficulty? For the two reasons that I suggested earlier, you don’t know which way you’re looking, you’re looking at objects from unknown viewpoints. Furthermore the objects are truly three dimensional. You can’t by any stretch of imagination say the Earth is flat or the microscope slide is a section or the document is flat. You must come to grips with the 3-D nature of the world.

This could lead you to range sensing. In the late ’70s range sensors began to become very popular for robotics applications, and there are now many kinds of them. The analysis of range images is a very nice branch of image analysis. A range sensor gives you an honest to goodness two and a half dimensional image. That is to say you have two coordinates angular or Cartesian kinds of coordinates, that say which way you’re looking, and then what’s the distance to the nearest surface in that direction. This is two and a half dimensional because you still cannot see the backsides of things. You see things from your viewpoint. You can get topography, but only your side of the topography, not the backside and so there are going kinds of things that are occluded. So range sensing is a nice art, and the analysis of range images is a nice art, and all of that blossomed in the last 20 years. Stereo, which had been crawling along with some limited degree of automation in the aerial photography applications, finally reached the robotics labs in the ’70s and they began to think about how to do close-range stereo. Now it’s going great guns at frame rates and so on.

But there was an initial challenge formulated in the ’60s, which was, we do well even with one eye shut. We can see things monocularly and we don’t have trouble reaching out and picking them up and so on. So there ought to be a way of recovering some of this two and a half dimensional information from single images. And so there grew up a specialty which went by the name of recovery, which was short for recovering three dimensional information from images, from single images, more specifically. This blossomed into or fragmented into a number of subspecialties, depending on what kind of clues in the original image you were using in order to extract your two and a half D, your depth information. They go by names such as shape from shading, shape from texture, shape from contour, or more generally shape from X where X is practically anything. Let me say a few words about the histories of these specialties. This will then in turn lead me to Figure 4 of that historical paper, which has a block in it called recovery. So a few words about shape from X. It was suggested in 1950 by James J. Gibson, who was a perception psychologist. Gibson said that one of the ways we see depth in the world is through so-called texture gradients. What this amounts to is the following. Suppose you have an object whose surface is regularly textured. No, I don’t necessarily mean periodically, just that the texture is spatially stationary. So there is detail on the surface of this object, and the detail doesn’t alter systematically as you move across the surface. The surface is uniformly textured, whatever that means. If you look at the surface front on then you see a uniformly textured region in an image.

If on the other hand the surface is slanted relative to your direction of view, then you get what is called a texture gradient. Why? Because the texture contains all this little image detail. The detail is big when the surface is closer to you and it gets smaller as the surface recedes from you, and on top of that the detail is foreshortened in the direction of the slant. So there is foreshortening so it’s non-isotropic because of the slant, and there is also a gradient because of the change in range. So Gibson, who was not very inclined toward speaking that kind of language, talked about texture gradients. He never even said what it was about texture that you could have a gradient of. After a while I guess people put words in his mouth as to exactly what it meant. But the texture gradient business is one of the important clues for a two and a half dimensional depth in a scene that you can hypothetically extract from a single image.

The community was actually sensitive to the fact that Gibson was a smart guy, so as early as the early ’60s there was a group at General Electric that attempted to build a black box that would detect texture gradients—not completely successfully. By the ’70s, however, it was being done reasonably well, so I’m not going to go into the detailed history of all the tricks, but this is an illustration of shape from X. Also, about that far back, at least late ’60s, apparently someone suggested in Marvin Minsky’s lab at MIT that it ought to be possible to derive shape from shading. If you have a featureless object, not textured but a smooth-surfaced object, when you look at it you tend to be able to get an idea what its relief is. Why? Because you see that the brightness of the surface changes gradually as you move across the surface, so there is what you would call a gray level gradient. This gradual change of gray level is called shading. Somehow from the shading it ought to be possible to infer something about the three dimensional topography. So Minsky gave that problem to one of the smartest students in the AI lab, Berthold Horn, who is a full professor at MIT now. Curiously Horn never published his thesis. His thesis is a MIT AI memo, but the first citable publication that I was able to find on Horn’s work was the chapter that he had in Patrick Winston’s book The Psychology of Computer Vision. Winston edited a book containing half a dozen chapters on important things related to computer vision, all done of course at MIT. One of them was Horn’s work.

Geselowitz:

And that book was when?

Rosenfeld:

1975 or 1976. It’s five years after the thesis itself. It’s strange that Horn didn’t publish the thesis, but by the later ’70s he was publishing major pieces of research on understanding of shading and reflection and so on. Horn became the outstanding authority on that stuff. Later on he went into time varying imagery. In fact there is a tremendous blossoming of literature at the beginning of ’80s on recovery and time varying imagery. That year, in ’81 there was a special volume of the journal Artificial Intelligence, just about every paper of which is a major milestone in three-dimensional and time varying computer vision.

Geselowitz:

Now I notice in looking at your CV, I first see the use of the name Computer Vision in 1973 was the first time I could find it.

Rosenfeld:

I wonder which particular use that was.

Geselowitz:

I was only going by the titles.

Rosenfeld:

Of the papers.

Geselowitz:

I can find it here, if you’re curious which one it was, but I’m wondering when that emerged? Was it right around then when the AI people got interested that they started talking about computer vision? Because they viewed what computers was doing as analogous to what you were doing?

Rosenfeld:

Actually it isn’t clear that the word computer vision was used that early.

Geselowitz:

Here we go. You had a paper Non-Purposive Perception in Computer Vision.

Rosenfeld:

That paper was actually aimed at the AI community and that’s one reason why we used that particular language. But look, computer vision is, as the old catch phrase goes, providing eyes for computers. Computer vision is providing visual input to computers that it can make the same sense of that people and animals do. But in the earlier days of even the robot vision work, what they were talking about was scene analysis, and they made a careful distinction (or they could have or eventually did) between image analysis and scene analysis. The image analysis business really blossomed in connection with these rather two dimensional applications. But when you have to recognize that you’re looking at an image of a three-dimensional scene, possibly from an unknown viewpoint, meaning your viewpoint is unknown or the objects are oriented in unknown ways, which is what robots run into, then you have to figure out how to analyze the scene. Analyzing the image is not the be all and end all. You’re trying to figure out what’s out there. Computer vision later came about as a generic term for the thing. There was another thing that happened in the mid ‘70s called the ARPA Image Understanding program which succeeded an earlier ARPA program called Image Processing. And what was image understanding? Well, yet another synonym for computer vision or scene analysis. But you always need new buzz words in this business. When the image understanding program was started it was suggested that what we’re looking for is some way of getting from the image to a description of the image. The description of the image is a symbolic description. It’s a kind of description in language or whatever. The image, on the other hand, is a two-dimensional signal. So the challenge that was posed at the beginning of the ARPA Image Understanding program was, how do you do the signal to symbol mapping. In fact, they took a number of groups that were already into robot vision and they threw a couple of people into the pot who they thought might be able to contribute to the understanding of how you map signals into symbols. These were people who had some image processing or more likely image analysis credentials. In fact specifically they got the idea that the work that was being done on syntactic analysis of images, an idea actually suggested by Minsky in a 1961 paper, would lead to a way of translating the signal into the symbolic description. And it’s not clear that worked. Minsky suggested that just as you can parse sentences as consisting of clauses which are made up of phrases which are made up of words and so on, you parse an image as consisting of parts which are made up of subparts and so on. That was an interesting suggestion.

Geselowitz:

And this was done by ARPA?

Rosenfeld:

No, this is still the ’60s. In the ’60s a lot of people got excited about the thought of generalizing formal languages from strings, which is what formal languages deal with normally, generalizing them to deal with two dimensional arrays or even fancier kinds of constructs. So in fact that’s one of the major areas that I got into when I came here from industry, looking into how you can generalize formal languages from one dimension to higher dimensions.

Budd Company; University of Maryland

Geselowitz:

Maybe this is a good place to go back and catch us up on your personal career on this.

Rosenfeld:

What it amounts to is, the Budd Company decided to consolidate its operations. In effect it shut down its electronics division and it opened up something in McLean, Virginia, called the Information Sciences Center. And they offered a limited number of people the opportunity to move and work there. Since I was their research manager it’s not surprising that I was offered this opportunity. And I didn’t really want to leave New York, but since I couldn’t find a job as good as the one I had with the Budd Company I decided to move to this area and went to work for them. Since it was a corporate center, it had very little ongoing actual research. I know it had a couple of small contracts on analyzing image patterns on visual pattern perception, on analyzing satellite cloud images. Some of that stuff came with me to Virginia. A lot of it was proposal writing.

In any event, when I was in New York, I was also teaching part time. I was teaching mathematics at Yeshiva University nights. So when I moved down here I started looking around for how I could get a part time teaching job. I guess some people just like to be the center of attention and get in front of a class and talk. An old friend, Russell Kirsch, at the National Bureau of Standards, now the National Institute of Standards and Technology, was one of the few people in the image field who is senior to me and who has some really major papers in the early years. He has a very important 1957 paper on digital image processing. That’s the year I got my Ph.D., so clearly I wasn’t writing important papers in those days. He suggested that I contact a guy at Maryland who was trying to start a computer science department. That is, even prior to offering any degrees in computer science he was trying to bring together faculty that would launch the field of computer science here. We, like many other universities, tried to get into it.

Many universities got into it by just branching out from electrical engineering. We’re one of the places where the people who started it were mainly mathematicians who decided that they wanted to move into this kind of stuff. So I managed to get a part-time job here. I was basically being paid for having the privilege of having access to the computer here. And after a while I decided I might really like it better if I came here full time, so I took a pay cut to come here full time and started working with, it wasn’t even students exactly, I started working with research assistants of various kinds, writing papers on the theory of image processing or image analysis.

Geselowitz:

I don’t know if you saw this op-ed piece in the Times on Wednesday by a software engineer claiming that when she went into computer engineering everyone had Ph.D.'s in anthropology, sociology, political science, anything except.

Rosenfeld:

Well obviously, because the computer science faculty at any university were all initially people who had degrees in other things. Most of those people are by now retired, because people started coming out with Ph.D.'s in computer science even before 1970. That’s long enough ago that computer science faculty are now themselves computer scientists.

Geselowitz:

And she’s not necessarily in favor of it. I don’t know how you feel about it.

Rosenfeld:

I don’t know.

Professional societies

Geselowitz:

Before you then get back to the work you did when you came to Maryland, let me just ask you to clarify a couple more points in your resume. Because again, this is sort of a two prong interview, you’re telling me about the field which you’re probably the most uniquely qualified person to do, and you’ve just written or in the process of writing this review article, but we’re also interested in you as your individual experience as a computer scientist as an engineer. And since I work for IEEE, I observed that it was around the time that you were promoted to manager of research at Budd that you first joined the AIEE.

Rosenfeld:

I don’t remember how many societies I joined. I was a member of a peculiar collection of societies when I was working at Budd.

Geselowitz:

Now was that a result of they were willing to pay for any memberships you wanted that you felt the journals would be of use to you?

Rosenfeld:

I don’t know. As a matter of fact, you said the AIEE, but it was more likely the IRE, the other predecessor of the IEEE.

Geselowitz:

They had a technical committee on computing at that point as I recall, a professional group.

Rosenfeld:

What is now the Computer Society, which I guess was the Computer Group, had a bunch of technical committees and there was I think a processing committee that spawned off something called the Pattern Recognition Sub-Committee, and I was a charter member of that. So we ran the first Pattern Recognition Workshop in Puerto Rico in 1966. That’s already after I’m here, but I guess I got asked to join it. I’m not sure if I was even quite here yet. But I knew a lot of the people who were in the business. I mean, people like the late Leon Harmon at the time at Bell Labs and Russ Kirsch of the Bureau of Standards. So I knew a number of people who when asked "would you like to serve on this committee," tended to recommend their friends so they could avoid this chore. You probably also noticed that by 1972 I was a Fellow of IEEE.

Geselowitz:

In essence you were a fellow of the then merged IRE and AIEE, and as you pointed out a number of other diverse societies that we may or may not come back to. But I also was curious, it seems to be around this time that you were involved with the formation of The Association of Orthodox Jewish Scientists.

Rosenfeld:

I don’t think I should take credit for the formation of that.

Geselowitz:

You were involved then.

Rosenfeld:

This takes us way back. The society was formed in the winter of 1947-1948. In fact although I did manage to get into it as an associate member, they didn’t want to take undergraduates. A couple of us did manage to get in. So I was an early member of that thing. But I guess the society preceded along its quiet way for twelve or fifteen years, and then decided it needed a shot in the arm so it tried to get more people interested and more people active in it and so on. So there was a bit of a turmoil in attracting new people into it. I guess that’s around the time that I became seriously active in it. In fact I was probably secretary one year and vice-president the next year and president the third year and chairman of the board of governors after another couple of years. The truth of the matter is although I’m still a member of that organization; I’m not very active in it. That society has continued, but it’s become kind of slanted. It was founded mostly by physical scientists in the mid ’40s, but it’s now overwhelmingly slanted toward health care. That is another reason why I’m less thoroughly involved in it.

Geselowitz:

But it’s still more or less, because this is their 50th anniversary then.

Rosenfeld:

That’s true. As a matter of fact they just ran a little symposium earlier this month in Jerusalem because a number of the founding members are in Israel now, probably all retired. Israel is one of the places people can retire to.

University of Maryland teaching; writing algebra text

Geselowitz:

For those who choose to retire. So now we have set you back at the University of Maryland and you started teaching here full time.

Rosenfeld:

Well, when I came here I voluntarily taught some math courses because I was writing a book on modern algebra, so I taught algebra for a couple of semesters just to de-bug the book.

Geselowitz:

And that was An Introduction to Algebraic Structures?

Rosenfeld:

That’s the one. That’s my first book. By that time I was trying to put together what the story is on this computer vision business, or I should say image analysis business. So I made a number of attempts to organize the several hundred papers that I had seen into some consistent story. Gradually I recognized that there is image processing and there is image analysis, and that image analysis has some sort of a standard paradigm. So by the time that the literature reached the enormous size of 400 papers I managed to get a book written, and almost simultaneously a survey paper and bibliography with pretty much the same references, but also very concisely describing what the field was about in my opinion. I had made several previous attempts to interest various journals in taking some sort of a paper that says here is an overview of what people are doing with images and computers.

It was hard, but I finally found someplace that would sit still for my particular slant, which I used to think of as a mathematicians eye view of what people are doing with images because I was extremely interested in defining some terms. Just what are the basic processes? What are the alternatives? What is the taxonomy of possible things you can do? The taxonomy of image properties is, for example, a question I was particularly interested in. I wasted-- I won’t say wasted, but I devoted a lot of thought to such a profound question. I guess a lot of that thought has left some sort of lasting impact on the field either because my thought reflected what the field was already doing or because what I codified had some influence on people who got into the field afterwards.

Governmental funding and research

Geselowitz:

I notice that soon after you arrived here, again you’ve kindly supplied such a detailed CV that there is a lot of clues there for interesting angles. Soon after you took your full time position here you record getting your first grant from the government, and it was for development of a parallel picture processing language. And I noted that it was not from the Department of Defense, which would have been my first guess, or from the NSF, which would have been my second guess, but it was from the NIH, the National Institutes of Health. I was wondering if you can recall given the different areas we’ve talked of application what people call what you were working on to cause the NIH to be interested.

Rosenfeld:

First let me correct you, because I don’t think that was the first piece of funding. First of all, the funding that is mentioned in my resume is only funding that I got here, because obviously I had funding before I was here. But that just doesn’t enter because this is a Maryland resume. One of these days I will probably take the trouble to add to this resume the pre-Maryland stuff. The first piece of funding I recall getting at Maryland was from the Office of Naval Research.

Geselowitz:

Find a book through each agency and look at the earliest date.

Rosenfeld:

You’ll find Navy, Unified Theory of Image Processing. That’s the one that started in February 1966.

Geselowitz:

So that was a year earlier.

Rosenfeld:

And in fact it was under that grant that I wrote that book, Picture Processing by Computer, which was published in 1969. Then there's the one that you just mentioned; that was from NIH and was in a very small amount of money. When we tried to process images we had to read them in from tape three rows at a time and perform some local operation on the pixels of the second row, then throw away the first row, output the first row, read in the fourth row, now you can do your 3 x 3 operation on the second row or third row or whatever I’m up to and so on. It was observed at the University of Illinois, where they were developing the Illiac 3 computer, that computers are capable of doing parallel bit operations on all the bits of a word. There used to be things called words. At IBM words were 32 bits, but we happen to have a Univac where it was 36 bits. No matter which computer you had, they had the data chunked into things called words, which were not the well known 8 bit byte, but something somewhat bigger in the range typically of 30 to 60 bits.

The way that the hardware was configured it was possible to do bitwise operations on all 30 to 60 bits of a word simultaneously. So it was thought this is a great way of using conventional computers to do somewhat parallelized processing of big data arrays. The Illiac 3 computer was actually designed to do honest to goodness parallel processing on 36 x 36 two-dimensional arrays, but they decided that they were going to simulate it on some mainframe one row at a time because you could take the 36 bits of one word (I forget what kind of machine they had at Illinois), and so you could process the 36 x 36 array one row at a time in parallel. So if you look at that title of that grant you mentioned, it was something to do with PAX. PAX stands for the Simulator of the Pattern Articulation Unit for Iliac 3. Iliac 3 was built to process Bubble Chamber Pictures. The creator of Iliac 3 was Bruce McCormick at Illinois; I believe he is no longer alive. He was in Texas for the last 10 or 15 years. McCormick’s machine intended to take a 36 x 36 piece of a bubble chamber image, which means a big enough piece so that you can read into it an event, particle tracks colliding, branching, whatever. So that you have enough bits to get a good look at such a piece of a bubble chamber image, so then you segment it into the individual tracks and describe the shapes of the tracks and so on. Since they were interested in processing literally millions of bubble chamber images a year, they wanted to be able to do it fast rather than hire armies of students to read the images by hand. So they were building Illiac 3, which I don’t think ever ran. They had some setbacks including a fire. But they built Illiac 3 to do this in 36 x 36 fold parallel. This 36 x 36 array was 1000 bits deep, so that you could store in this 1000-bit deep 36 x 36 stack large numbers of Boolean images representing information about the input image. The original input image, let’s say you threshold it so that it’s just a single bit per pixel, but then you do various processing to it because you want to discover the tracks, thin them, discover which way they’re pointing, discover the events where the tracks branch or collide or something. This requires many successive operations and you could imagine storing the results of these successive operations in a stack of bit planes, and that’s how the thing was designed. So someone there noticed that if you looked at the hardware architecture of conventional mainframes you could maybe not do 36 x 36 in parallel, but you could do 36 in parallel. That led them to create a language called PAX, the simulator for the pattern articulation unit, which was the parallel processing unit of Illiac 3. That having been done other people got interested. Gee, why can’t we do that too?

Geselowitz:

NIH was interested was what at this point? Would that be blood smears, radiology, all of them?

Rosenfeld:

NIH was interested in many different medical applications. In this case, it was more likely cytology. Not necessarily blood smears, but it was more likely microscope images. The Division of Computer Research and Technology at NIH, which is an inter-institutional entity, was providing computing facilities for all the different institutes, and there was a lot of interest at NIH in image processing because the numerous important medical applications. So they decided they would like to install PAX there. We also wanted it here, and we had already looked at it and they gave us some tiny amount of money to provide a version for them. Because I think we had Univac and they had an IBM, so we ended up writing several different versions of PAX. There was a Control Data version too. As a matter of fact there are still boxes of punch cards sitting in this office, which are the source decks for Control Data PAX and Univac PAX and IBM PAX. In fact there was a PAX users' newsletter for a couple of years because there were people in various parts of the world who were interested in picking this up . It seems silly nowadays because nowadays you can do all sorts of things in real time on your laptop.

University of Maryland departmental structure; graduate teaching and interdisciplinarity

Rosenfeld:

We began to hire faculty at the beginning of the ’60s here. Once you had faculty, they began to think " we ought to give courses", and so they began to give for-credit courses on campus and the first step usually taken in these matters is you start a masters program because that’s easier than starting an undergraduate program and certainly easier than starting a Ph.D. program. So the masters program started here in the late ’60s. The Ph.D. program I think around 1970, plus or minus, and then by 1974 they actually had a bachelors program. In order to have a bachelors program they had to form a computer science department. When they did so they put most of the faculty in it. But some of us had been hired with no teaching loads because we pre-dated all the teaching, and so they couldn’t force us into it. And in fact when the department was formed, there were three full professors who stayed out of it. But one left almost immediately to become director of one of the NASA Centers in Langley. Another one stayed around for about five years and then took a chair at the University of Pittsburgh. And that left only me. So I was sitting around with no teaching load, but lots of Ph.D. students. Most of my students were from the computer science department. That department gave me a zero percent appointment.

Geselowitz:

Did you continue to teach any math courses at that time?

Rosenfeld:

No, only long before then. I was on the committee that designed the degree programs in computer science and I was the first one here to teach pattern recognition, artificial intelligence and image processing.

Geselowitz:

At the graduate level.

Rosenfeld:

At various levels, but of course it started at the graduate level and eventually moved down to the undergraduate level. We have five faculty in computer vision here now. I’m the only one who is still not in any department. But we have one professor of Electrical Engineering and three professors of Computer Science.

Geselowitz:

But I notice that you’re affiliated with various departments.

Rosenfeld:

That has something to do with the fact that this Center was intended to be quite interdisciplinary. Maybe at some point even more so than it is now, but clearly we have interests in common with both electrical engineering and computer science. Do you see something in my bookcase that says Robotics at Maryland? It was an attempt to interest the administration in the fact that robotics is really neither engineering nor computer science. It went by this nice little definition: The intelligent connection of perception to action. I happen to be interested in robotics as a stand-alone discipline. Of course, I’m primarily interested in robot vision. We ended up with a list of about 45 faculty at the time that claimed an interest in robotics. Carnegie-Mellon started a Robotics Ph.D. program, which has not been welcomed with open arms by the academic community but it hasn’t flopped either. Carnegie-Mellon is of course famous for starting its huge Robotics Institute, and naturally that’s the logical place to start a degree program in robotics.

In my mind robotics falls so squarely between all sorts of other disciplines that it has no business being sucked up by any of them. But after all, at a lot of universities computer science was never let go from electrical engineering. This is not one of them, but you can point to MIT, Berkeley and whatever, at which CS is part of EE only, not stand alone. Computer scientists naturally would like to feel that there are branches of that field that are so unlike electrical engineering that it’s ludicrous to call them that. Many EE departments therefore had to call themselves Electrical and Computer Engineering because otherwise it would be a laughing stock. They would be doing so many things that are by no stretch of the imagination engineering. Anyway, from a conceptual viewpoint, the eyes and ears and arms and legs for computer business requires engineering, but it’s a subject conceptually big enough to be its own field.

Geselowitz:

I guess you could say that actually it makes sense from a robotics point of view to have a joint ECE department, because if you’re not going to allow for institutional reasons a separate robotics department, then it makes more sense to have computer scientists and electric engineers in the same building to work together on robotics, as if you completely separated the departments it would make that work more difficult.

Rosenfeld:

Well, a very nice thing of the last five years at this university is that computer science and electrical engineering are in the same building. But that was never the case until the last five years. It’s actually worked out quite well. It certainly works out well for areas that straddle the fields, such as this computer vision business.

Geselowitz:

Just for example, at Rutgers, with which we’re affiliated, there clearly is some tension between the computer science department and the electrical and computer engineering department. So actually the EEs have kept computer engineering and there is a separate computer science department.

Rosenfeld:

I think it’s very reasonable that there should be a field called computer engineering. Anyway, this is the politics of computer science if you will. I don’t know if we need to go into it. Computers were busting out all over. After all, how often is a new academic discipline started, and then one which grows like gang busters until, for instance, the majority of the people who get degrees here in the sciences are in computer science. Computer science was something you just couldn’t ignore.

Scene analysis and image analysis; disciplinary terminologies

Geselowitz:

So I guess that gets us up to the ’70s.

Rosenfeld:


Audio File
MP3 Audio
(344 - rosenfeld - clip 4.mp3)


I would like to add a few more bullets to why scene analysis is different from image analysis and in what ways. The first group of ways were called recovery and they were sometimes known in this cutesy way as shape from X, which meant obtaining depth information from cues in a single image. So I mentioned using texture as such a cue, and that goes way back. I mentioned using shading as such a cue, and that represents Horn’s milestone thesis on shape from shading. The third one that goes with the first two is sometimes called shape from contour. It’s the deduction from clues about boundary curves as to which things are in front of which things. It can give you relative depth information; this was pioneered by Adolfo Guzman at MIT in the mid ’60s.

Guzman came up with all sorts of clever rules for looking at a line drawing, at a blocks world line drawing. (Later attempts were made to extend it to general line drawings.) You look at the various kinds of junctions and you say does this junction appear to be just from a view of a single object which you happen to be looking at from an edge or looking at an edge or a corner of, or does this look as though there is one object occluding another? What kind of junction have we got here? One of Guzman’s early heuristics was if you find T junctions you should be very suspicious of occlusion. In a T junction you have the following sort of situation. Above the T is one region, below the cross bar of the T are two regions. The hypothesis is the region above the cross bar of the T is in front, and its edge is covering up something which itself consists of two regions, but the reason you have got a T is that the boundary between the two further back regions is partially hidden by the further front region. This is the heuristic of T’s suggesting occlusion, which is one of several rules that Guzman came up with.

Several people in the early ’70s took Guzman’s rules and put them on a more firm foundation. So there is a lot more understanding now of what you can infer or try to infer about three dimensional layouts of a scene from this kind of cue having to do with apparent interpositions of objects and their boundaries. So the so-called recovery business got to be a major sub-specialty of robot vision and scene analysis.

Computer vision is a good generic term. I might parenthetically say there are a lot of these semi-synonymous terms. A lot of people like to use image understanding because they wanted to feel a part of the ARPA funding even though they didn’t have any of the money. There are people who call it computational vision, people with different tastes in the phraseology. The people who call it computational vision seem to be mainly interested in studying models for biological visual systems. The people who call it image understanding maybe are interested in treating vision as an inverse problem of inferring information about a scene from an image. The people who call it machine vision maybe are interested in applications, but these are all maybes. These terms are used quite interchangeably by different schools.

Multi-dimensional analysis

Geselowitz:

I actually made notes to myself that you could almost make a grid that sometimes the word computer is used, sometimes machine and sometimes automated. And sometimes you’re talking about an image, sometimes a picture and sometimes a pattern. And sometimes you’re understanding, sometimes you’re recognizing, sometimes you’re processing, and sometimes you have vision. It’s sort of a cube. You know a three dimensional matrix.

Rosenfeld:

You could almost take one from column A and one from column B. For a number of years I leaned toward using the word picture because I didn’t like the idea that image suggested an optical image or a radar image or a sonar image. Of course you have to scan the document into the computer, but you’re not starting with something which was acquired by an imaging sensor. You’re starting with an honest to goodness flat object and you just feed it in and turn it into a digital image. So I didn’t like the idea, but image won out. The funny thing is, "pixel" also won out, and that is short for picture element. So in some sense picture survived. No one says "imel," it’s hard to pronounce. "Pixel," which was a Jet Propulsion Lab coinage, is the one that won out. But image processing and analysis is what the field is called.

Anyway, you have these various terminologies in the field. Because there are multiple motivations for being interested in how to find out about scenes from images. One is the challenge of how biological organisms do it. Another one is it’s got lots of good practical applications. But the third one in the middle is that there is a scientific problem here. It’s an inverse problem. The scene is three dimensional. We have a two dimensional projection of it, assuming it’s that kind of image—if it is radar it’s some other kind of imaging geometry. But speaking only of optical images, you have a two dimensional projection of the scene and the question is how much information about the scene can you manage to recover from the image. So recovery became a central scientific part of so-called image understanding. As I mentioned, there is a special volume, I think September ’81 of the Artificial Intelligence Journal, which was also published as a book edited by Mike Brady called probably Computer Vision. This was a collection of real milestone papers on things such as recovery processes.

Recovery, remember, only provides you with two and a half dimensional information about the world. It only tells you about the world from your viewpoint, but how do you infer what is actually out there. How do you know that the objects that you see aren’t just false fronts, and if you went around behind them they would be hollow shells? Of course you don’t really know it, but how do you at least attempt to infer it? You need to look at the appearance of what you can see and ask what out there in the scene could have given rise to this image. The generic name for that sort of thing is back projection. What is there back in the scene that could have projected onto our observed image? If you have some idea, some models for what might be out there in the scene, you can attempt to account for the image in terms of what’s out there. That’s the second half of the paradigm. The first half being can you get the depth of what you can see? It’s not always essential to do so, but it’s a challenge to do so. The second half is, given what has projected onto your retina, what was out there in the world that could have given rise to it? So the scene analysis paradigm is in several respects richer than the image analysis paradigm. As you gather, it began to take off in the mid ’60s, so about ten years later than the two dimensional image analysis stuff, and it developed a large art and a large science. In all of this we’ve been deliberately assuming that we’re looking at a single two-dimensional static image. There is the entire discipline of understanding depth images of various kinds, whether you get them from a single range sensor or by putting them together.

Geselowitz:

Slices?

Rosenfeld:

Well, slices is yet another thing.

Geselowitz:

You’re not talking about tomography.

Rosenfeld:

No, I’m not talking about tomography. That’s yet another development. Tomography is a tremendous triumph of image processing. You’re taking projections from various directions, massaging them and coming out with cross sections. So it’s a many image to many image kind of processing. And tomography has provided to the image analysis community a new dimension, so to speak, because now there are all sorts of people who try to understand three-dimensional images. But the understanding three dimensional images is not ferociously hard, just as I was saying processing two dimensional images is a lot like processing one dimensional signals. It’s up one dimension, but there are enormous analogies.

Analogously, analyzing three-dimensional images is a lot like analyzing two-dimensional images. I mean the pure 2-D images, not the 2-D images of 3-D scenes where you’re missing so much. On a microscope slide you can do a lot with this two-dimensional image, and that’s really all you’ve got. When you’ve got a 3-D MRI reconstruction of some part of the body, you have a three-dimensional image composed of three dimensional "pixels" called voxels, short for volume elements, and everything you can ask about boundary detection, segmentation, property measurement and so on, it all scales up quite naturally. Scaling up 2-D image analysis to 3-D image analysis was not a big breakthrough. It was just a matter of waiting until the 3-D images were adequately available. In fact, a particular pain in the neck was that usually when they did tomography they spaced the slices fairly wide apart, so the Z resolution was much lower than the X and Y resolutions. It took a while until they could afford the computer power to construct the 3-D images. For example, in the Visible Human Project, they’re taking those frozen cadavers and analyzing them very thoroughly as well as slicing them so they can see what the actual cross sections look like. Quite a lot of data. A real multi-terabyte project.

Geselowitz:

And who is working on that?

Rosenfeld:

The Visible Human Project is at NIH or the National Library of Medicine. It’s not as big as the Human Genome Project. They took a couple of cadavers out of the morgue that nobody wanted, one male and one female. First they x-rayed the hell out of them so that they could do all of their reconstructions, but then they also froze them and sliced them.

Geselowitz:

To compare that to the reconstructive sections.

Rosenfeld:

Exactly. This way they get a major human body atlas from real complete models, and also with ground truth. Let’s see. We’re a little bit side tracked. We were talking about single images and stereo pairs, and then we went off a bit into 3-D imaging, which is yet another important source. Again not very available until the last couple of decades so it’s a more recent kind of thing in our business. But we began very early to do generalizations of all basic 2-D operators to 3-D so that sort of thing was a fairly straightforward generalization.

The other thing I want to say is traditionally it’s all we could do to cope with a single image or maybe a stereo pair of images, a very small amount of information. At some point, I guess you have to say beginning in the earlier ’70s, people began to be challenged with the possibility of the fact that the world doesn’t stand still. Okay? You can look at the dynamic world with your camera and get time varying images. Worse than that, your camera itself can move around, so you can get time varying image sequences obtained by a moving sensor. So now you have a tremendously dynamic very high dimensional data challenge. At the first workshop on Time Varying Image Analysis, which was held at the University of Pennsylvania in ’75 or ’76, I was chairing one of the sessions and I made the cynical remark that you guys won’t give us any peace. For all these years we’ve been struggling to get enough computer power to process single images, and now that we’re more or less comfortable with doing that, you’re demanding that we process image sequences. Of course in those days it wasn’t in real time, but the challenge still was can we process image sequences? That became from the mid ’70s to recent times one of the richest areas. I mean that really blossomed from the mid ’70s until sometime in the ’90s as the hottest area in terms of numbers of papers being published. Another area, incidentally, from the late ’70s to the early ’90s was computer architecture for image processing. And the reason this is no longer such a hot area, though it continues a little bit, is that general purpose computers are now so powerful. We talked 30 years ago about how we could process images, we were struggling to do it with machines that were designed primarily for business applications, and were totally unsuitable for processing arrays of numeric data, and so we made do, writing awkward software to do these kinds of things, somehow managing to get the images into the computer. You know that in the ’50s they had to punch the images in, but by the late ’50s people were actually using flying spot scanners and drum scanners and so on. So one of the things that happened in the ’60s was at least we managed to get images into computers in a reasonably effective way. But then to actually get them into main memory and process, that was still slow and painful. There has been so much progress in this that recalling the old days is almost embarrassing.

I have a slightly cynical view. There have been absolute marvels of mathematical elegance and algorithmic beauty and all sorts of lovely things developed for and inspired by the problems of analyzing images and image sequences and scenes. But what it is that we actually have running at frame rates turns out to be old stand-bys, relatively brute force. The most you can say is some of the fancy stuff that is now in the literature will be regarded in 20 years from now as brute force and people will be doing it.

Professional affiliations in image analysis; IEEE

Geselowitz:

If I could just turn back to the Society affiliation kind of issue. I’m curious. So you would say that the image analysis people by and large, now there is a number of international symposia and so forth that are maybe small size relatively focused, but in terms of major societies, the image analysis people are what in the Computer Society or are they in the ACM? Both of those?

Rosenfeld:

It's very interesting that you should bring that up. I started publishing in the Journal of the ACM, in fact I was an associate editor for a few years, partly because this place called itself computer science. But the IEEE has had the greatest involvement with image processing and computer vision. The ACM has become identified with areas such as computer graphics and computational geometry, both of which have something to do with images, but not in the same sense that image processing and computer vision do. Those two areas, by the way, are coming a little bit closer together in the virtual reality business, particularly, in acquiring the models. Seth Teller at MIT has a mobile robot wondering around Cambridge and building a model of Cambridge. He is funded by ARPA. He’s part of what is now the successor to the ARPA Image Understanding Program. That program lasted 20 years, and finally ARPA got embarrassed by the fact that ARPA programs aren’t supposed to last 20 years.

There is now a program underway called Image Understanding for Battlefield Awareness, which is the same thing, but it’s expected to deliver. Anyway, there are some interfaces between the disciplines of computer graphics, visualization, virtual reality, image analysis, scene analysis, computer vision, whatever. But society-wise, the ACM certainly was one of the places that papers in these areas began to be published, but less and less and less I think. The ACM has rarely run an image processing or computer vision kind of conference. The IEEE does an enormous amount of that. On an international level, I realize that the IEEE is an international society, but that’s not exactly how it’s perceived by people in most of the world. In the early ’70s there was an effort get people from all over the world together, and at that point the International Association for Pattern Recognition was founded.

Geselowitz:

In 1978.

Rosenfeld:

Yes.

Geselowitz:

And you were very active in that.

Rosenfeld:

Yes, I was president of that too. I was also in the Society of Manufacturing Engineers.

Geselowitz:

Was that through the work here to do computer vision for industry, like we discussed earlier?

Rosenfeld:

It wasn’t so much related to what we were actually doing here. I'll tell you what motivated it originally. We in the IEEE Computer Society were constantly griping to ourselves, “Why is it that application after application comes out, grows and takes off to do its own thing? How come we’re not like Siggraph? Why don't we have a meeting with 10,000 people coming?” Of course the real answer to that is the computer vision meetings have been dominated by academics and therefore they’re guaranteed to stay small. But as to why the applications break off, we were thinking, “How can we stop this from happening?” So we asked ourselves, can we see an application that is about to emerge and try to hold onto it, and the one that we thought of was manufacturing, and then we discovered that the SME was beginning to start a Machine Vision Association. So what I decided to do was infiltrate. So I became a founding member of their board of directors. But I gradually dropped out of it.

They also run an annual Machine Vision conference under various umbrellas, sometimes as part of a robotics conference. In fact, it’s more or less at the same time of year that IEEE runs its annual conference. But it's very application oriented. They’re not interested in theoretical papers. They want to see reports on the effectiveness of working systems, working applications. That's quite understandable. The fact of the matter is, you can’t expect to have both. I don’t know how the computer graphics people succeeded in doing it. They succeeded in keeping control of the theory of how you do synthesis of realistic images, and at the same time they attract tens of thousands of people to their big annual show. It’s partly because they’re doing such beautiful art.

One of the things about computer vision is it’s very undramatic. In fact for years we used to gripe that it’s hard to get funding for what we do because any one year old child does it better than the computers do it, and the funding agencies can’t bring themselves to understand what’s so hard about it. If a dog can do it, why can’t a computer do it? Why should we pour all this money down a bottomless funnel to do to it better? But the investment has paid off because we can now do it much better. So it’s really at the point of usefulness, in many domains. Take the documents domain. Computer reading of unconstrained handwriting is still a major research issue, but there are large parts of the documents domain where OCR works quite well. They used to worry about how you do multi-font character recognition, that’s simply no longer a problem. It’s brute force. You throw a thousand templates at each character. No one knows exactly what the recognition logic is because it’s a trade secret of the company. But that’s the size of it. They’re not doing anything enormously different from what was appearing in the open literature in the ’50s and the ’60s.

Geselowitz:

What about the interaction with IEEE Societies other than the Computer Society? Particularly I’m thinking Signal Processing people, the people doing image processing. Now I notice, I looked at your Festschrift from a couple of years ago, from ’96 and the only prominent SPS person who was involved was Tom Huang.

Rosenfeld:

Tom Huang is both.

Geselowitz:

He was with you on the technical committee on pattern analysis and machine intelligence.

Rosenfeld:

Tom Huang, in other words, is active in both worlds. He started off as an image coding and processing person, but at some stage got very much into the image analysis/computer vision business.

Geselowitz:

So the people who bridge the two fields are the exceptions. These are pretty much separate worlds.

Rosenfeld:

I would say people tend to publish in one place or the other, but not both. When the multi-dimensional signal processing committee was started, as I mentioned, I infiltrated that too. Actually I was invited because I had some connection with Schlumbergér, so Mike Ekstrom got me in on it. I was their token image analysis person, or something like that perhaps. In fact, Ekstrom put out a rather nice book consisting of chapters on various aspects of image processing, and I wrote the obligatory Image Analysis chapter. But really, just as graphics and computer vision have certain interests in common, so do image processing and computer vision.

Geselowitz:

I see you also infiltrated the American Association for Artificial Intelligence?

Rosenfeld:

I don’t know about infiltrated. But there is another piece of politics there. The Artificial Intelligence conferences were asked repeatedly, would you mind letting go of computer vision? It was always the case that when you went to an AI conference there were a certain number of vision session, and no one but the vision people went to them, and vice versa. The vision people didn’t go to any of the others. There has been extremely little cross-fertilization between symbolic AI and computer vision. There were a couple of people who were active in the AI societies who stubbornly insisted that we have to keep vision, and so they did. But, we broke away and started the International Conference on Computer Vision. I ran the first conference in 1987, partly because we felt the quality of what was being done in pattern recognition wasn’t high enough and partly because we felt that it was time that the artificial intelligence people let go of computer vision, and let it be its own thing. But that didn’t work. AI kept hold of it. And at the pattern recognition conferences the majority of the papers are on images.

If you count all the applications, on the order of 90% of the papers in a major pattern recognition conference are on what you could call image pattern recognition, on image analysis, on pictorial pattern recognition. (It went by all of those names in the old days.) Pattern recognition never let go of image analysis, even though some specialties have broken off. AI doesn’t let go of computer vision either. I guess nobody wants to let go of things. Well, it’s nice for us in the sense that we can really hobnob with quite a diversity of other groups all of whom claim to be doing our kind of thing. I think you will find significant overlap in who is publishing under these different umbrellas. There is a certain pecking order, but you’re very much going to find that the good people in the field are able to give papers at the AI conferences as well as the pattern recognition conferences and the computer vision conferences.

The 1987 computer vision conference has become a regular thing, held every two years. The last one was held just about the first or second of January in India this year. But now there are also regional conferences. There has always been a US conference. When it was started it was called Pattern Recognition and Image Processing, but after a few years it changed its name to Computer Vision and Pattern Recognition. (People were griping why do you have to be fashionable? But the popular demand was to change it to Computer Vision and Pattern Recognition). Now there are also a European conference on computer vision and an Asian conference on computer vision.

Geselowitz:

Under whose auspices are these?

Rosenfeld:

I’m not sure. The International Conference on Computer Vision is an IEEE operation. And the European Conference is certainly not restricted to European papers. At the first one the two biggest countries represented were France and the US, and I forget in which order.

Physics education

Geselowitz:

Before I wrap up I wanted to ask you, going back to early to your career, when you were at Yeshiva and switched from physics to math. What got you interested in physics or made you think you might be interested in physics? And did being Jewish have any affect on that in terms of the advice that you received or how you were perceived?

Rosenfeld:

I don’t know.

Geselowitz:

Was it because of Einstein?

Rosenfeld:

I’m not totally sure why I decided to major in physics. I knew I was going to major in something mathematical. I’m not sure what I thought about in those days, but I was sure it was going to be one of the sciences. It may well be that it was because physics is a lot cleaner than chemistry and biology. Physics is more abstract. Physics was the cleanest science if I was going to get into something scientific.

Geselowitz:

My father told me that he was really always interested in mathematics, and was gifted in science and math in high school and was basically told by the guidance counselors that he couldn't major in mathematics. That he would never have a career that way. He ended up in electrical engineering, and was fortunate enough to get back to a branch where he was doing more theoretical mathematical type work. So I was just wondering if there was a similar story operating. Physics seems pretty abstract also. I mean engineering was something you could sink your teeth into.

Rosenfeld:

I think, knowing my tastes in mathematics versus physics, I am a theoretician, and I guess physics was interesting because there was theoretical physics. I certainly remember as a kid being interested in theoretical physics. So quite likely that may have gotten me interested in majoring in physics, but then I found that what I really loved was mathematics, which is even more theoretical and abstract. And of course Yeshiva University doesn’t have an engineering school. So once I was locked into mathematics I went to graduate school in mathematics.

Theory, practice, and progress

Geselowitz:

Is there anything else you would like to say about where you think the field is or where it’s going to wrap up?

Rosenfeld:

That paper goes through the ’70s, and then it’s got some concluding remarks, and I’ve actually already made them. It says something like there has been a lot of beautiful theory developed, but what is actually working in practice is very simple possibly even brute force things. On the other hand, I’m probably going to add another sentence after that saying, “But what is going to be done 20 years from now is quite likely going to be some sort of reflection of the fancier stuff that the field has been doing.” So what it amounts to is there is a time delay. There are certain things you can afford to play with in a university because they don’t have to be practical and they don’t have to be real time. You play with them, you get some ideas, and then you wait for the next couple of orders of magnitude of increasing computer power, so someone can actually put them out there and get them working on real images on real data in the real world. That’s very much the history of the field. Hindsight is easy. I forget what year this was; it’s buried in my resume somewhere. The AI vision people decided that vision is a very hard problem, and that they should get some people together for brainstorming. So they held some computer vision workshops in a beach house near Monterey. I was invited to the second one, and they asked me to be discussion leader on the topic of low level vision. When I got the floor, I said “I’m a little worried that in calling it low level vision you’re making a certain value judgment; I think I would rather call it front end vision, unlike what you guys are doing which is hind sight.”

Geselowitz:

It probably was a popular remark.

Rosenfeld:


Audio File
MP3 Audio
(344 - rosenfeld - clip 5.mp3)


It went over very well. But it’s also true. The AI community has always talked about low level vision, intermediate level vision, and high level vision. Low level vision is what we call here image analysis. High level vision is where you’ve already got some sort of symbolic representation so that you can reason about it; in some sense no one ever gets that far. So intermediate level vision grew and grew; all the recovery stuff and so on got to be called intermediate level vision because it’s clearly not high level. We’re still dealing with numeric data; yet it’s far above the two-dimensional image, so they called intermediate level vision. In those days, they were complaining about the “neck of the hourglass” phenomenon. You have all these bits of data, all these bytes in the two dimensional signals, and in the other half of the hourglass you have combinatorial search problems and trying to figure out what is going on from a symbolic manipulation point of view. So you’re going to run into all sorts of search problems in trying to understand the scene contents. But the neck of the hourglass is that somehow you have to take the vast amounts of numeric data and map it into a symbolic representation, after which you can try to make sense out of this representation. And the trouble has always been that this begs the question, because if you can correctly map it, you’ve already solved the problem. Segmentation has to somehow be meaningful in terms of the scene. We’re trying and that’s very hard. Everyone has always said how about top down, given that you have some idea of what’s out there, can’t you use that knowledge to control the image analysis processes? Some of the early work at MIT was designed to do just that sort of thing. If you know that what you’re analyzing is a blocks world image, then you don’t just go looking for edges in the image; you look for edges that connect up into line drawings that could be perspective projections of polyhedral objects. You can do this because you have a lot of knowledge of what constrains what the images could look like. So early work by various people, some names come to mind: Shirai, who is now at Osaka University. Griffith, I don’t know what ever happened to him, but he was one of the early MIT Ph.D.'s. A number of people looked seriously at how to use knowledge about the blocks world to control the analysis of the image. What you really want, and what nobody has produced, is a front end that finds chunks of possibly meaningful data in the image, and that immediately transfers these chunks somewhere where they trigger hypotheses about what might be in the scene. That sort of thing has not yet been done.

I wrote a paper ten years ago called "Recognizing Unexpected Objects." The point of the paper was something like this. Suppose I showed you a slide show in which the slides had no connection with one another whatsoever, they were just slides that I’ve collected in my career. Say the first slide is the Eiffel Tower and the next slide is an octopus. The point is they had very little to do with one another. It wouldn’t take you more than two seconds to recognize the Eiffel Tower and three seconds to recognize the octopus. You have thousands of models stored in your head including familiar objects like the Eiffel Tower, familiar classes of objects like octopi. Somehow or another you managed to fish in and you thought of the word octopus three seconds after the light hit your eye. This has been called the hundred cycle challenge. How many neural events take place from the time your retinal receptor cells are activated until you think of the word octopus. If this only takes two or three seconds there is only time for a hundred neuron firings. Somehow, waves of information are being triggered in your head, ending up with the activation of the word octopus over on the vocabulary side. That’s truly amazing. That is the kind of performance that no one has come close to attempting to implement: how do we map from the pictures to the names?

A fancier way of saying it is, how do you extract fragments from the images, which are distinctive enough that they trigger object hypotheses, but not all possible object hypotheses. We need to know why octopus tentacles are not easily confused with doorknobs. You don’t think of cats and dogs when you see an octopus, you think of an octopus. What sorts of primitives are sufficient for indexing into your database of object models? Children acquire the ability to recognize new objects almost every hour. When a child learns to talk, the child has a limited vocabulary. You show him a picture. Maybe he can tell whether it’s a pussy cat or a doggy. By the time the kid is 12 years old he knows the names of on the order of 10,000 different objects. And how do I know that? You can count the entries in a picture dictionary, where there are thousands of different pictures, all of which have different names, and a reasonably educated eighth grader can recognize them correctly. How do we learn to tell octopi from typewriters? We evidently learn objects at a tremendous rate. It appears that we are pretty good at doing this at certain ages. Probably if you had a blind adult whose sight was restored at age 25, he would have a lot of trouble learning to recognize 10,000 objects. So the brain is doing it in a very developmental way. But we in the computer vision business have never dreamt of implementing anything as powerful as that. The computer vision systems of the next century are still faced with this kind of challenge. Have I talked enough?

Geselowitz:

Yes, I think that’s great. Thank you very much.