riddle me this, Batman

So now we have a very easy way to understand the differential geometry.

Which, we need to understand the difference between the way computers do things, and the way the brain does things.

First of all, we need to establish the a vector is a geometric invariant. It looks the same in any coordinate system, it's just that the coefficients change depending on our choice of basis vectors.

For example - you're used to the Cartesian coordinate system, where the basis vectors are the same everywhere in space. Let's call the basis vectors E. In Cartesian coordinates then, E(x) is (1,0) and E(y) is (0,1), that kind of thing.

But in polar coordinates, the basis vectors change at every point in space. We can sée this from the transformation formula, x = r cos theta, and y = r sin theta, where the polar basis vectors are r and theta.

To determine the distance between two points on a curved surface, we need the metric tensor. Which is why we need differential geometry. In differential geometry, the basis vectors become the derivative operators, in the direction of the coordinates. This is why we use the tangent plane TpM at a point p, because in a small neighborhood around p the surface is "approximately flat", which means we can use calculus to obtain the tangent vectors in any given direction.

So in this case, if we have Cartesian coordinates and we want to translate them to polar coordinates, we use the Jacobian matrix (actually its transpose, but I'll just call it the Jacobian to keep it simple).
View attachment 1083391

And similarly, if we start with polar coordinates and want to convert to Cartesian, we can use the inverse Jacobian.

To get the metric tensor though, we need the dot product operation, and we can do that in one of two ways. If we know the lengths and angles, we can calculate the dot product from the norm and the cosine of the angle. Or, we can use the metric tensor to calculate the dot product using matrix multiplication, according to the formula g(v,w) where g is the metric tensor and v and w are any two vectors. This formula can also be rewritten as p = J'cJ, where p are polar coordinates and c are Cartesian coordinates, and J is the Jacobian matrix that converts between them.

So for example - in the retinotopic mapping from the thalamus to the first area of the visual cortex V1, we have a complex log spatial mapping that converts polar coordinates to Cartesian coordinates, as mentioned earlier. This means we lose our polar coordinates, and if we need them again we have to recalculate them. It also means that feedback from the cortex to the thalamus has to perform the inverse mapping if we want it to remain retinotopic.

It is certainly easier to calculate dot products in Cartesian coordinates, where the metric tensor is just the identity matrix everywhere in space - which is why the brain does it that way. The purpose of the dot products is to compute the projections of surface vectors into the coordinate system. We need this to calculate binocular disparity, and for subsequent calculation of surface boundaries in 3 dimensions.

But here's the twist: the complex log mapping doesn't change the motion information. It remains encoded in polar coordinates from the retina. For that reason, the orientation columns are not simple alignment vectors, they also process spatial frequency which is the co-vectors. From a tensor algebra point of view, if you have the vectors and co-vectors you can generate any tensor, which includes any linear map.

So you can see why things are the way they are: if we want to calculate the Jacobian and inverse Jacobian (which we need because they're our forward and backward transforms for coordinate systems), we need the projections of arbitrary surface vectors which means the sines and cosines. To translate these back to polar coordinates (to extract and map the motion information) we need the metric tensor. The reason we can't just scrape these from the retina is because we need the integrated "cyclopean" view in 3 dimensions. To get this "directly" in polar coordinates would be computationally difficult (and time consuming). So the brain does the clever thing: it first maps to Cartesian coordinates (using a hardware mapping which is computationally "free"), where calculations are much easier because the basis vectors are spatially consistent. Then it extracts the vectors and co-vectors using edge detectors and spatial frequency detectors. Then it uses those to build the surfaces by aligning the input from the two eyes (which can now be mapped by co-moving edges and spatial frequencies). Finally it assigns motion to the surfaces, the information for which has been multiplexed into the communication channels all the way from the retina, through every stage of processing.

This whole chain of computation is very quick, because it only uses matrix math. Everything that's computationally expensive is done in hardware, including change of coordinates, determination of angles and distances, and extraction of vectors and co-vectors. All of these things are done in a single alpha cycle, by V1 and V2 using TTFS (discussed earlier). A second alpha cycle is then required to calculate surfaces ("objects") from binocular disparity. A third alpha cycle is only needed for object recognition - so it makes perfect sense that a P300 should occur at the third alpha cycle (and not before) when the object information is nonsensical or surprising.

What is missing from this description is the role of synchronization - or more specifically its inverse, desynchronization. The short story is we just don't know. So far it looks like it has something to do with attention, and something to do with memory. We do know that visual hot spots (important stimuli that require attention) can drive the cortex into criticality (extreme desynchronization). And, the amount and precision of information being processed in such a state is 100x greater than normal. No one knows what this means, yet.

But the rest of it is becoming cut and dried. Research on visual processing by neurons started in 1959, so it's taken 65 years to get this far. Thousands of rats, cats, and monkeys had to give their lives to make it happen. Now we can do it with machines, on sub-nanosecond time scales. The next frontier is photonic computing using micro-ring resonators which requires practically no energy, and when combined with quantum memristors the memory can be made permanent at zero processing cost.
Sure.
 
Are we supposed to proof red your first draft of________?

You're supposed to read and digest and learn. This is the science forum, ain't it?

It's not "my" science, it's SCIENCE.

Let's try something different. I will give you links to the science.

Let's talk about why we have micro-saccadic eye movements. In a nutshell we have them because of rapid retinal adaptation.

But this has a profound influence on the subsequent architecture. Time to first spike encodes luminance, whereas everything else is encoded later during bursting.

After we fixate an image (after an eye movement), retinal adaptation occurs very quickly - in two phases, one with a time constant of 0.1 seconds and another with a time constant of 10 seconds.


Micro-saccades occur 1 to 3 times a second with a duration of 30 msec "or so".


So basically, as soon as the rapid phase of retinal adaptation dulls the contrast of a visual image, a micro-saccade occurs to bring it back again.

Micro-saccades are associated with bursting in visual cortex V1.


The size of a human micro-saccades is in the range of 0.03 - 2 degrees, typically it is less than 15 minutes of arc.


This means the saccade traverses several retinal cells in the fovea, but less than an entire receptive field in the periphery.

Bursting in V1 occurs preferentially (7 times more likely) when a visual stimulus turns on during a micro-saccade, rather than when the stimulus is already present.


So, we would like to know specifically what information is being processed during micro-saccades, relative to ordinary major eye movements. In context,


Based on previous posts, you will realize that the 100-200 msec window cited in this link is perceptual - the neural signal from the retina to V1 arrives much more quickly, in the 40-80 msec range.

To understand what the micro-saccades are doing, reference this simple model of the V1 visual system:


Note the very (trivially) simple extraction of sin and cos from the visual signal, and note especially the construction of the arctan which happens to be one of the key conversion factors for regenerating polar coordinates from Cartesian coordinates.

I wish to point out to you, that NO MACHINE FUNCTIONS THIS WAY. There is absolutely no equivalent for this is any kind of machine learning so far - not in artificial neural networks, not in AI, and not in robotics.

There does not exist an algorithm today, which can extract a metric tensor field in 200 msec or less, from a pair of cameras. But the brain can do it, it does it 10 times a second, and from scratch after every eye movement.

HOW it does it, is the point of this thread.

It does it using an algebraic form of differential geometry, where the calculations only occur on the most essential time varying information, and the rest of the mundane computations are performed in hardware.

Think of this task: you are shown a map, on a piece of paper. A location is identified, where a reward will be found. (X marks the spot). Now you are placed in the actual maze that was shown on the map, and you are told to go recover the reward.

To do this, you must extract the 3 dimensional structure of the maze from its 2 dimensional representation on the piece of paper. Then you must translate from the allocentric coordinates shown on the map, to egocentric position when you're actually navigating the maze. And, you must take into account body position and head angle whenever you're referencing any landmark shown on the map.

Humans can do this very easily. Waymo can not. Waymo's allocentric coordinates come from a GPS, not from a metric tensor. Same for the self driving cars. Ever listen to the Uber instructions? "In a quarter mile turn right on Main St". They're using your GPS coordinates to determine where you are, and matching them against the GPS coordinates of your destination to determine the shortest route. They don't know or care if you have to climb a hill to get there, or drive through the middle of a lake. Same for Waymo, it doesn't do stairs.

The human processing ability is uniquely impressive. Still. You can memorize a route after traveling it ONCE. No computer can do that. Artificial neural networks require thousands of training cycles to memorize a route, a face, or an equation. Even a "physics informed network" will not give you a metric tensor, it'll only tell you if you're doing proper math.
 
So hopefully you're getting a very clear picture of what successive retinal frames actually do.

Surfaces are natively 2d entities living in 3d space. The third dimension is reconstructed from binocular disparity. In doing so, the visual cortex generates a complete description based on differential geometry. This includes tangent and normal vectors used in subsequent matrix math with the metric tensor (or equivalently, it's inverse).

The rapid micro-saccadic activity generates successive frames that are in slightly different positions. They're not the same, the positions vary by a few minutes of arc. I will call this activity "micro-scanning".

What good is micro-scanning? The answer takes us right back to the convolutional neural network architecture. Micro-scanning forces the convolutional layers to extract the invariants. This is how surface motion is distinguished from retinal motion. The invariants can be related to a Mobius transformation based on four points of reference. This is why it helps to have eye position "known", which is why there are collaterals from the eye movement system into the visual system.

Micro-scanning occurs within the larger context of saccadic eye movements. After a saccade the visual system requires a reset, whereas this is not the case with micro-saccades. With the first visual frame after an ordinary saccade, the visual system generates an expectation matrix, which is carried back to the thalamus (LGN) via the projection neurons in layer 6 of V1. In cats, 100% of L6 projection neurons receive direct input from LGN. In monkeys and humans, only a small distinct subpopulation (50%) receives direct input. This subpopulation is very fast, considerably faster than the rest of the L6 output. It causes LGN neurons to acquire tuning properties similar to V1 (orientation, spatial frequency, etc). This situation only obtains on SECOND AND SUBSEQUENT micro-saccadic frames, specifically not on the first. Thus, the second and subsequent frames process more than simple luminance, whereas the first frame is mainly about contrast. The first frame "primes" the convolutional network for subsequent expectation.

And note that the complex log projection mapping from LGN polar coordinates to V1 Cartesian coordinates is automatically reciprocated by the retinotopic feedback path.


We can surmise with confidence that the pyramidal cells in layer 5 of visual cortex V1 behave the same way as the pyramidal cells in the hippocampus. That is: TTFS generates an initial spike, which travels in both directions, orthodromically down the axon and antidromatically backwards into the dendritic tree, where it temporarily inhibits dendritic spiking.

Upon reactivation the pyramids burst, based on dendritic spiking and any additional inputs. After the burst there is another phase of quiet based on lateral inhibition. All of this occurs in the rising phase of the alpha cycle, which means layer 5 pyramids can phase-encode an additional parameter of the visual signal. We don't know what that is yet. This is as far as we've gotten with research so far.
 
Last edited:
Let's consider for a moment the development of the human visual system.

Unlike deep learning machines, the visual system has no trainer, no supervisor, and no "desired output" for comparison with the input. It is a self organizing map. It organizes itself in two ways: before birth, it uses spontaneous rhythmic and non-rhythmic activity by neurons. After birth, it uses the real time visual input.

Evidence shows that retinotopy is fully developed before birth, as is a rudimentary form of orientation sensitivity in V1. Since there is no visual input before birth, this can only come from chemical markers and/or spontaneous neural activity.

There are multiple chemical markers in the pathway between the retina and V1. Some of the same markers used to organize cells in the retina are fed forward through the entire pathway. A simple example is the predominance of Y ganglion cells in the periphery, and X ganglion cells around the fovea.

And, before birth, there are multiple types of spontaneous neural activity in the retina. Individual neuron firing is stochastic and mostly uncorrelated, however there is also correlated wave-like activity that travels from one side of the retina to the other (and depends on the neurotransmitter acetylcholine).

As mentioned the visual pathways are locally convolutional - that is, not "all to all". However the basics of neural network self organization are detailed in this introductory summary:



During development, learning ("plasticity") is turned on and off at well defined times. It turns out that the spontaneous activity in the optic nerve is sufficient to create orientation selective patches in V1. Retinal amacrine cells are crucial for the generation and maintenance of three types of spontaneous rhythmic patterns: concentric and circular like a drum head, and side to side from the outer edge towards the nose, as if the drum we're being tapped at its outer edge.

A key player in the development of orientation specificity is lateral inhibition, which is non-directional. Lateral inhibition determines the width of the orientation response.

After birth, stereopsis and binocularity depends on visual input, as per the famous (infamous?) Hein-Held experiments. It turns out this is not due to V1 at all, rather it is due to V2 as previously described. Things must be this way, because the nuances of eye position are specific to individual organisms. (Whereas orientation selectivity is merely a grid superimposed on the visual input - eye position is only needed to align the two grids). However binocular disparity is also a self organizing process - but instead of depending on internal rhythms it depends on visual input (which consists mostly of edges and surfaces).

You can calculate approximately how many frames are needed for the learning of binocular alignment. At 10 frames a second times 3 months, it is approximately 10*60*60*24*90, which is in the neighborhood of 10.million frames.

As detailed in the cite, noise (micro-saccades) is helpful for the learning process. Once learning is complete, plasticity is turned off. A little of it remains, but there is no further axon death or dendrite pruning.

Here are some details about human visual development, and some references:

 
Back
Top Bottom