[toc]
Problem
Many predictive domains involve weak, noisy, and highly entangled signals, where the relationship between input features $x_i$ and outcomes is difficult to formalize. Standardized validation procedures are often lacking, making systematic hypothesis testing and knowledge accumulation challenging.
Astrology provides a representative example of such a domain. Historically practiced in scholarly contexts, including 16th–17th century European universities, it combined observation, computation, and structured predictive methods. Today, however, the field largely relies on heuristic reasoning, and few methods exist to empirically evaluate combinations of features $x_i$. Existing studies [1-2] focus on isolated correlations, leaving the broader problem of predictive performance unaddressed.
This paper proposes a machine learning framework for systematic hypothesis testing in weak-signal domains. By applying this framework to astrological data, we show how traditionally qualitative systems can be reformulated in terms of measurable inputs and probabilistic models, enabling a transition from interpretative practice to empirical analysis. In particular, the accumulated techniques of Renaissance predictive astrology [3] provide a rich set of features $x_i$ and procedural frameworks, which serve as an ideal case study for demonstrating the methodology.
Over many years of observation, these techniques—encompassing Primary directions (predicting the year of an event), Revolution charts (narrowing down the month and details), and Transits (pinpointing the day)—form a structured predictive system. In this article, we formalize the mathematical apparatus required to build computer models capable of making predictions, deriving all formulas from first principles. The result is a ready-to-use algorithm for training an AI system.
Methodology
Event Prediction and the Bernoulli Formula
Let us start with the simplest case. A predictive technique forecasts whether a clearly identifiable event will occur within a given time period (usually a day or a month) — for example, the birth of a child, a legal marriage, or emigration to another country.
The expected event has two possible outcomes: it either occurs (label $y = 1$) or it does not (event label $y = 0$).
Suppose the probability that the event occurs in the specified period is $p.$ Then the probability that it does not occur in that same period is $1 - p.$
We can express both outcomes in a single formula:
$$ P(y) = p^y \cdot (1-p)^{1-y} $$
We read this as: the formula gives the probability of occurrence (label $y=1$) or non-occurrence (label $y=0$) of a specific event.
Indeed, if the event occurs, the probability of that outcome is $P(y = 1) = p^1 \cdot (1-p)^{0} = p,$ and if the event does not occur, the probability is $P(y=0) = p^0 \cdot (1-p)^{1 - 0} = 1 - p.$
This is the Bernoulli distribution — a probabilistic model for a binary outcome.
Conditional Probability
Now suppose the observed event depends on a number of astronomical factors. For example, the probability of a violent death on a given day is stronger the closer Mars forms an angle of $x=90°$ to the degree of the zodiacal circle that was rising at the moment of a person's birth.
We can then write the probability of violent death as $P(y = 1 \mid x)$. Here $P$ denotes probability, the label $y=1$ denotes the positive outcome of the event (in our example — a violent death in a given period), and $\mid x$ is read as "given that we know the parameter $x,$ which influences the outcome." In our example, $x$ is the angular distance between Mars and the degree of the ecliptic rising at birth.
Let us denote the conditional probability of the positive outcome as $\hat{p} = P(y = 1 \mid x)$. The Bernoulli formula then becomes:
$$ P(y \mid x) = \hat{p}^y \cdot (1 - \hat{p})^{1-y} $$
Usually an event depends not on one, but on several astronomical factors or features. For example, a violent death might depend on $x_1$ — the angular distance between Mars and the rising degree of the ecliptic at birth, on $x_2$ — the sector of the sky (so called house) occupied by the so called ruler of the ascendant, and so on.
We thus obtain a set of astronomical features ${x_1, x_2, \ldots x_d}$ on which the probability $P$ of the future event presumably depends.
It is important to understand that the feature set ${x_1, x_2, \ldots x_d}$ is simply a collection of numbers. These numbers are often normalized to lie in the range from 0 to 1. For example:
- The ecliptic angular distance between planets, known as an aspect, can range from 0° to 180°. So an aspect is expressed as a fraction of 180°. For example, a square aspect (90°) equals 0.5, and we write $x_1 = 0.5$.
- A planet's celestial condition in astrology can range from −5 (detriment) to +5. To normalize, we add 5 to this value and express the result as a fraction of 10 (scale from −5 to +5). For example, a planet in exaltation (corresponding to 4 points on the −5 to +5 scale) has astrological feature $x_2 = 0.9$.
- A planet's position in a sector of the sky (house) can be expressed as the house number divided by the total number of sectors.
- And so on. Each feature $x_i$ is simply a number between 0 and 1.
We can represent the feature set $x_i$ as a vector with coordinates:
The general Bernoulli formula then takes the form:
$$ P(y \mid \vec{\mathbf{x}}) = \hat{p}^y \cdot (1 - \hat{p})^{1-y} $$
which is read as the probability of occurrence ($y=1$) or non-occurrence ($y=0$) of an event given the astrological features $\vec{\mathbf{x}}.$
Probability of a Single Prediction
For a computer to make a successful prediction, it must be able to compute the function $\hat{p}^\text{pred} = f(\vec{\mathbf{x}})$, which returns a number between 0 and 1 (the probability of a single event occurring):
$$ \hat{p}^\text{pred} = f(\vec{\mathbf{x}}) \in (0, 1) $$
Here the notation $\in (0,1)$ means that the result of the function lies in the interval from 0 to 1.
Computing this function reduces to two steps:
- Transform the collection of astrological features $\vec{\mathbf{x}}$ into a single real number $z$.
- Express $f(z)$ as a simple mathematical construction that maps any number $z \in (-\infty, +\infty)$ to a value (a probability) $\in (0, 1)$.
Step 1. Affine Transformation
To transform the vector $\vec{\mathbf{x}}$ into a single real number $z$, we take a weighted sum of the values $x_i$ (i.e., $w_1x_1 + w_2x_2 + \ldots$) and add an extra number $b.$
We can write this as:
This type of transformation of the vector $\vec{\mathbf{x}}$ into a number $z$ is called an affine transformation. The values ${w_1, w_2, \ldots}$ are called weights, and the parameter $b$ is called the bias.
If the weights ${w_1, w_2, \ldots}$ are viewed as coordinates of a vector $\vec{\mathbf{w}},$ the transformation can be written in vector form (as the dot product of two vectors):
Here the letter T denotes the transposed (or conjugate) vector — a column of values $w_i$ rotated 90° to form a row.
Upper and lower indices
For future convenience, we will use lower and upper indices. Specifically, quantities denoted by ( a_i ) (lower index) represent numbers — that is, the coordinates of a vector arranged vertically (a column vector), while ( a^i ) (upper index) denote numbers arranged horizontally (referred to as a transposed vector, dual vector, or row vector).
Then the product ( \vec{\mathbf{w}}^{\text{T}} ; \vec{\mathbf{x}} = \sum_{i=1}^{n} w^i x_i ). In this notation, our affine transformation takes the following form:
$$ z = \sum_{i=1}^n w^i x_i + b $$
Step 2. The Sigmoid
Recall that our goal is to express the prediction $\hat{p}^\text{pred} = f(\vec{\mathbf{x}}) \in (0, 1).$
In the previous step we expressed the collection of features $x_i$ as a real number $z = \sum_i w^i x_i + b.$ Now we need to construct a simple mathematical function that maps any number $z \in (-\infty, +\infty)$ to a value in the interval $(0, 1)$ — i.e., to a probability.
A linear function $f(z) = z$ will not work — it can be negative or greater than 1. We need to "squeeze" the output into the interval $(0, 1)$.
Let us build such a function. Consider the odds ratio of the event occurring versus not occurring:
For any $\hat{p}\in(0,1),$ the function $O(\hat{p})$ lies in the interval from 0 (as $\hat{p} \to 0$) to $+\infty$ (as $\hat{p} \to 1$).
Taking the natural logarithm of this function "stretches" it from $-\infty$ to $+\infty.$
Visually, for any value of $z$ the graph is bounded along the $\hat{p}$ axis between 0 and 1.
If we simply swap the vertical and horizontal axes, we get the graph we are looking for — for any value of $z$ the quantity $\hat{p}$ stays within $(0, 1)$. This is the mathematical construct that maps any real number to a value $\in (0, 1),$ i.e., to a probability.
We therefore need to express $\hat{p}$ as a function of $z.$
Taking the exponential of the previous formula:
We immediately find the desired function: $\hat{p} = f(z) = \frac{1}{1 + e^{-z}}.$ This function is called the sigmoid and is denoted by the Greek letter sigma: $$ \boxed{\sigma(z) = \frac{1}{1 + e^{-z}}} $$
Intermediate Summary
To summarize: for a computer to make a successful prediction, it must compute the function $\hat{p}^\text{pred} = f(\vec{\mathbf{x}})$ from the given astrological features $x_i$, returning a number between 0 and 1 (the probability that a single event occurs).
We can now write this as:
$$ \hat{p}^\text{pred} = \sigma(w^ix_i + b) \in (0, 1) $$
In this form the task is easy to formalize. We need to find numerical values of the weights $w^i$ and the bias $b$ such that for any astrological features $x_i$, the predicted probability $\hat{p}^\text{pred}$ is as close as possible to the actually observed probability. The weights $w^i$ and bias $b$ are the same for any feature vector $\vec{\mathbf{x}}$.
The numerical values $(w^i, b)$ are called the parameters of the model, and the process of computing them is called machine learning.
Total Probability over Multiple Predictions
For a single prediction with astrological features $\vec{\mathbf{x}}$ and event label $y$, the Bernoulli formula is:
$$ P(y \mid \vec{\mathbf{x}}, \vec{\mathbf{w}}, b) = \hat{p}^{y} \cdot (1 - \hat{p})^{1 - y} $$
where $\hat{p} = \sigma(\vec{\mathbf{w}}^\text{T} \vec{\mathbf{x}} + b).$
We read this as: the model with parameters $\vec{\mathbf{w}}$ and $b$ predicts the probability of occurrence ($y=1$) or non-occurrence ($y=0$) of a single event given the astrological features $\vec{\mathbf{x}}.$
Training a model on a single astrological example makes no sense. The occurrence of a single event under given features $x_i$ could be a pure coincidence. Such a model would be unable to predict the occurrence or non-occurrence of the same event on a different set of astrological features.
A good model therefore trains on many examples of both successful and failed predictions, in order to find optimal weights and bias such that for any astrological features $x_i$ the predicted probability is as close as possible to the real one.
The Likelihood Function
Suppose we have selected $M$ arbitrary horoscopes from the AstroDataBank database with documented relocation dates. For each horoscope, an astrologer computes a set of astrological feature vectors $\vec{\mathbf{x}}_1, \vec{\mathbf{x}}_2, \ldots, \vec{\mathbf{x}}_M$ at the time of relocation. Such features might include:
- Normalized angular distance between the ruler of the ascending sign and the cusp of the 9th house
- Normalized ecliptic degree of the ruler of the 9th house
- Normalized house number occupied by the ascendant ruler
- Other numerical values from 0 to 1, collectively forming the feature vector $\vec{\mathbf{x_i}}$
The astrologer also selects an equal number of negative examples — random horoscopes and dates when no relocation occurred — and computes the corresponding feature vectors with event label $y = 0$. In total we have $M$ positive examples $(\vec{\mathbf{x}}_1, y=1), (\vec{\mathbf{x}}_2, y=1), \ldots (\vec{\mathbf{x}}_M, y=1)$ and $M$ negative examples $(\vec{\mathbf{x}}_1, y=0), (\vec{\mathbf{x}}_2, y=0), \ldots (\vec{\mathbf{x}}_M, y=0)$, for a total of $N$ examples.
In the general case we can say that we have a dataset of $N = 2M$ examples: $(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N),$ on which the model will be trained to predict relocations from astrological indicators.
We assume the examples are independent of one another (a standard assumption). Then the probability of observing the entire dataset at once is the product of the individual probabilities:
$$ P(\text{all data} \mid \vec{\mathbf{w}}, b) = \prod_{i=1}^{N} P(y_i \mid \vec{\mathbf{x}}i, \vec{\mathbf{w}}, b) = \prod{i=1}^{N} \hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i} $$
where $\hat{p}_i = \sigma(\vec{\mathbf{w}}^\text{T} \vec{\mathbf{x}}_i + b).$
This product is called the likelihood function $\mathcal{P}(\vec{\mathbf{w}}, b)$. Its meaning: it shows at which parameter values $\vec{\mathbf{w}}, b$ the observed data are most plausible.
The machine learning task: find $(\vec{\mathbf{w}}, b)$ under which the total probability $\mathcal{P}(\vec{\mathbf{w}}, b)$ of observing the entire dataset exactly as it is comes as close to 1 as possible.
Log-Likelihood
A product of a million numbers between 0 and 1 rapidly approaches $10^{-1000000}$ — a computer quickly hits the limits of numerical precision. We therefore work with the logarithm of the likelihood function.
Why the logarithm?
- First, the logarithm is a monotone function: the maximum of $\ln \mathcal{P}$ is achieved at the same $(\vec{\mathbf{w}}, b)$ as the maximum of $\mathcal{P}$.
- Second, the logarithm of a product becomes a sum. And a sum of a million numbers between 0 and 1 does not approach zero, unlike their product.
where $\hat{p}_i = \sigma(\vec{\mathbf{w}}^\text{T} \vec{\mathbf{x}}_i + b).$
This is the log-likelihood. Our task is to find $(\vec{\mathbf{w}}, b)$ that maximizes this function.
By convention in machine learning we minimize a loss function, so we negate the log-likelihood and normalize by the number $N$ of examples so that the loss does not grow with the size of the dataset:
where $\hat{p}_i = \sigma(\vec{\mathbf{w}}^\text{T} \vec{\mathbf{x}}_i + b).$
This is Binary Cross-Entropy (BCE) — the loss function for the entire dataset. The smaller this function, the smaller the loss (i.e., the smaller the total deviation of the predicted event labels ${y_1, y_2, \ldots y_N}$ from those actually observed).
Gradient Descent
To better understand the machine learning process, let us suppose we have only one astrological feature $x.$ Then $\hat{p}_i = \sigma(w x + b),$ so the model has just two parameters — one weight $w$ and one bias $b.$ In this case the loss function is a function of two variables $\mathcal{L}(w, b),$ i.e., a two-dimensional surface. This surface has a depression: for some values of $w$ and $b$ the function has a minimum. Our task is to find the coordinates $(w_0, b_0)$ of that minimum.
Let us visualize this function and the process of finding its minimum.
We start the search from the point $(0, 0).$ We construct tangents to the surface along the $w$ and $b$ axes. The resulting vector (the sum of the two tangents) points in the direction along which we "roll downhill" toward the minimum.
Since the tangents along $w$ and $b$ are mathematically defined by the partial derivatives $\partial \mathcal{L(w, b)} / \partial w$ and $\partial \mathcal{L(w, b)} / \partial b$ (the slope of the surface along $w$ and $b$), the resulting vector is the negative gradient (direction of steepest ascent) of the surface. We take a small step in the opposite direction (against the gradient).
At the new point we again construct two tangents and move downward along the resulting vector. We repeat this procedure until the tangents become nearly parallel to the $(w, b)$ plane — this signals that we have arrived at the minimum, i.e., we have found the coordinates $(w_0, b_0).$
This procedure is known as gradient descent.
In the general case with $n$ weights ${w_1, w_2, \ldots, w_n}$ and one bias $b$, visualizing the surface in $(n+2)$-dimensional space is harder. However, the algorithm remains the same — we construct tangents along $w_1, w_2, \ldots, w_n, b$ and, step by step in the direction of the gradient, converge toward the minimum $(\vec{\mathbf{w}}_0, b_0).$ The coordinates of that point are the result of machine learning: the internal parameters of a model that can predict future events.
Neural Network Implementation
However, the model described above has a fundamental limitation. A single affine transformation $z = \sum_i w^ix_i + b$ is essentially a weighted sum of astrological features. Such a model assumes that each feature influences the probability of an event linearly and independently of the others.
For example, if feature $x_1$ is the angular distance (aspect) between two planets, then the closer this aspect is to 180°, the more strongly that feature pushes the model toward predicting the event — regardless of all other factors.
But astrological tradition says otherwise. The same aspect between planets works very differently depending on the house the planet occupies, its speed, and the other aspects it simultaneously forms with other planets and the observer's horizon. In other words, the features interact nonlinearly.
A linear model cannot capture this — no matter how precise the weights $w^i$, it always sums the features into a single total, ignoring their mutual influence.
The solution is to add intermediate affine transformations (layers) to the model: these first form new composite features from the originals (e.g., "Mars in an angular house while retrograde") and then combine and sum those composite features. Each such layer is another affine transformation $z = \sum_i w^ix^\text{prev}_i + b,$ applied to the output of the previous layer rather than to the raw input.
After the intermediate result $z$ of an affine transformation, a nonlinear function $f(z)$ is applied — the sigmoid or an analogue — to prevent composite features from collapsing back into a single linear sum. Two consecutive affine transformations without a nonlinearity between them are mathematically equivalent to one, so the whole depth of the network collapses to a single layer.
This multi-layer mathematical structure is called a neural network.
For example, neurons in the first hidden layer may learn to respond to the combination of an aspect and a planet's speed simultaneously; neurons in the second layer may respond to combinations of these composite features with house position. Which combinations turn out to be significant — the model determines on its own during training, without explicit instructions from the astrologer.
The more layers, the more sensitive the model becomes to nonlinear dependencies among features.
Neural Network Architecture
The network has $L$ layers (stages of affine transformation), numbered from $1$ to $L.$ Each layer contains not one, but several independent transformations. Each transforming unit within a layer is called a neuron.
Each neuron in layer $l$ receives inputs ${h_1^{(l-1)}, h_2^{(l-1)}, \ldots, h^{(l-1)}_k}$ from neurons 1, 2, … k of the previous layer $(l-1)$, transforms them via the affine transformation $z = \sum_i w^ih^{(l-1)}_i + b,$ with its own individual weights $w^i$ and bias $b$.
The resulting value $z$ is then passed through a nonlinear activation function $f(z)$ and the result is forwarded to the next layer $(l+1)$.
By analogy with the human brain, the inputs to a neuron are called signals, and $f(z)$ is called the activation function. We say the neuron activates the weighted sum of its incoming signals.
The figure below shows a simple neural network with two hidden layers containing 2 and 3 neurons respectively, as well as the input and output layers.
Let us trace how a signal passes through the neural network.
Input (zeroth) "layer" — simply takes the input data, i.e., the set of astrological features, and passes them on without any modification:
$$ \vec{\mathbf{h}}^{(0)} = \vec{\mathbf{x}} \in \mathbb{R}^{d} $$
Here $\in \mathbb{R}^{d}$ means that the dimension of the column vector $\vec{\mathbf{h}}^{(0)}$ of the zeroth layer equals the number $d$ of astrological features. This is equivalent to writing:
$$ h^{0}_i = x_i, \quad i = 1,2, \ldots d. $$
Hidden layers $l = 1, \ldots, L-1$:
Each subsequent layer $l$ has $n^{(l)}$ neurons. Each neuron takes the entire vector $\vec{\mathbf{h}}^{(l-1)}$ from the previous layer, forms the linear combination $z=\sum_i w^{(l)^i} x_i + b$ of its coordinates, and applies the activation function $f(z)$.
The full set of neurons in layer $l$ first produces a set of linear combinations $z_k = \sum_i w^{(l)i}_k x_i + b_k,,$ and then applies the activation function to these values, forming the layer's total output: $h_k = f(z_k),$ where $k = 1, 2, \ldots n^{(l)}.$
The resulting action of the entire layer $l$ in matrix form is:
where $W^{(l)} \in \mathbb{R}^{n^{(l)} \times n^{(l-1)}}$ is the weight matrix of layer $l$ with dimensions $[n^{(l)} \times n^{(l-1)}],$ $\vec{\mathbf{b}}^{(l)} \in \mathbb{R}^{n^{(l)}}$ is the bias vector of layer $l$, and $f$ is the activation function applied element-wise.
Single-Example vs Batch Notation
In this section, all definitions $h^{(l)}, z^{(l)}, W^{(l)}, b^{(l)}$ refer to a single input vector $x_i$. Later, when we describe gradient descent and training, these variables will be generalized to handle multiple examples at once:
- $H^{(l)}$ — activations of layer l for an entire batch of $N$ examples, each column corresponding to one example.
- $Z^{(l)}$ — pre-activations of layer $l$ for the batch.
This allows fully vectorized computation in NumPy while preserving the same mathematical formalism.
# Example of defining network architecture for a single input
input_dim = len(astrological_features)
hidden_dims = [32, 16] # two hidden layers
net = BernoulliNet(
input_dim=input_dim,
hidden_dims=hidden_dims
)
Activation function in hidden layers:
For hidden layers it is convenient to use the Rectified Linear Unit (ReLU):
$$ f(z) = \max(0, z) $$
Why is this more convenient than the sigmoid?
- For hidden layers (unlike the final output layer) there is no requirement that they return values between 0 and 1, i.e., probabilities. Hidden layers serve a different purpose — they simply capture nonlinear interactions among astrological features.
- Unlike the sigmoid, ReLU does not approach 1 for large $z,$ so ReLU gradients do not vanish as $z$ grows.
- It is fast to compute.
- Despite its simplicity it is nonlinear (it clips all values $z < 0$) — all of this makes training fast and stable.
Output layer $l = L$:
The output layer has a single neuron, and its activation function is the sigmoid — because we want the network's output to be the predicted probability of a single future event $P(y=1, \vec{\mathbf{x}})$ given the astrological features $x_i$:
The sigmoid maps any real number to a probability.
Vectorized Implementation
While the above definitions describe the activation functions for a single input vector, in practice they are applied to entire batches of data using vectorized operations.
For a batch of $N$ examples, the pre-activation values of the output layer form a row vector:
$$ Z^{(L)} \in \mathbb{R}^{1 \times N} $$
The sigmoid function is then applied element-wise: $$ \sigma(Z^{(L)}) = \frac{1}{1 + e^{-Z^{(L)}}} $$
In code, this corresponds to a fully vectorized operation:
ZL = WL @ H_prev + bL[:, None] # shape: (1, N)
P = 1 / (1 + np.exp(-ZL)) # sigmoid applied element-wise
A similar vectorization applies to hidden layers, where the ReLU activation is applied element-wise to matrices $Z^{(l)}$.
This formulation allows efficient computation over entire batches while remaining consistent with the single-example notation introduced above.
Gradient Descent for the Neural Network
We established earlier the loss function for the full dataset. In practice, the loss function is computed over a batch of examples using vectorized operations:
$$ \mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i \ln \hat{p}_i + (1 - y_i)\ln(1 - \hat{p}_i)\right] $$
where $\hat{p}, y \in \mathbb{R}^{1 \times N}$
In implementation, this corresponds to a vectorized computation:
def loss(self, p_hat: Batch, y: Batch) -> float:
"""
Binary cross-entropy (Bernoulli log-likelihood)
"""
return float(
-np.mean(
y * np.log(p_hat + 1e-9)
+ (1 - y) * np.log(1 - p_hat + 1e-9)
)
)
Here p_hat and y are row vectors of shape $[1 \times N]$. A small constant (e.g., $10^{-9}$) is added inside the logarithm to ensure numerical stability.
We discussed that to apply gradient descent we must sequentially compute gradients — i.e., partial derivatives of $\mathcal{L}(\vec{\mathbf{w}}, b)$ along the axes $w_1, w_2, \ldots, w_d, b$ — which will lead us to the minimum of the function at the point $(\vec{\mathbf{w}}_0, b_0).$ The coordinates of that point are the parameters of the event-predicting model.
In the case of a neural network the loss function formula remains unchanged. However, whereas before the probability $\hat{p} = \sigma(\vec{\mathbf{w}}^\text{T} \vec{\mathbf{x}} + b)$ of an event on a single example depended directly on the input features $x_i$, in a neural network the probability $\hat{p} = \sigma\left(\left(\vec{\mathbf{w}}^{(L)}\right)^\text{T} \cdot \vec{\mathbf{h}}^{(L-1)} + b^{(L)}\right)$ now depends on the output of the last hidden layer $L-1$, whose output in turn depends on the output of layer $L-2$, and so on.
In essence, before we introduced the neural network we worked with a single neuron — we immediately formed a linear combination of astrological features and fed it to the sigmoid to obtain the output. The loss function $\mathcal{L}(\vec{\mathbf{w}}, b)$ depended on one weight vector $\vec{\mathbf{w}}$ and one bias $b.$
In a neural network, each layer has many neurons, each with its own weight vector and bias. The collection of all weight vectors $\vec{\mathbf{w}}_i$ in each layer $l$ forms the weight matrix $W^{(l)}$ of the layer, and the collection of all biases $b_i$ forms the bias vector $\vec{\mathbf{b}}^{(l)}.$ The loss function now depends on these parameters: $\mathcal{L}(W^{(l)}, \vec{\mathbf{b}}^{(l)})$ for each layer $l$.
For the neural network we therefore seek partial derivatives of $\mathcal{L}$ with respect to these parameters for each layer: $\partial \mathcal{L} / \partial W^{i\;(l)}_j$ and $\partial \mathcal{L} / \partial b^{(l)}_j,$ where index $j$ (column) runs over all neurons in layer $l$, and index $i$ (row) runs over all neurons in the previous layer. The matrix element $W^{i\;(l)}_j$ is the weight coefficient of the connection from the $i$-th neuron of the previous layer to the $j$-th neuron of the current layer.
Gradients are often denoted by the nabla symbol: $\left(\nabla_{W^{(l)}} \mathcal{L}\right)^i_{j}$ and $\left(\nabla_{b^{(l)}} \mathcal{L}\right)_j$.
Layer Delta
To compute these gradients, let us introduce for convenience the quantity: $$ \delta^{(l)} = \frac{\partial \mathcal{L}}{\partial z^{(l)}} $$ Recall that for the output layer $z^{(L)} = \left(\vec{\mathbf{w}}^{(L)}\right)^\text{T}\cdot \vec{\mathbf{h}}^{(L-1)} + b^{(L)},$ and for each $k$-th neuron of a hidden layer $l:$ $z^{(l)}_k = \left(\vec{\mathbf{w}}^{(l)}_k\right)^\text{T} \cdot \vec{\mathbf{h}}^{(l-1)} + b^{(l)}_k.$
Output layer
First we compute $\delta^{(L)}$ for the output layer. By the chain rule:
We compute each factor.
- Derivative of the BCE loss with respect to $\hat{p}$ (for one example, without the sum):
- Derivative of the sigmoid:
- Multiplying these two derivatives — everything cancels beautifully:
Averaging over a sample of $N$ examples (i.e., for the full dataset):
$$ \boxed{\delta^{(L)} = \frac{1}{N}\sum_{i=1}^{N}(\hat{p}_i - y_i)} $$
The sigmoid derivative and the BCE logarithm cancel each other elegantly — this is exactly why sigmoid and BCE form a canonical pair.
Batch Implementation
For a batch of $N$ examples, the output layer delta is represented as a row vector:
$$ \vec{\boldsymbol{\delta}}^{(L)} \in \mathbb{R}^{1 \times N} $$
where each element corresponds to a single example.
In vectorized form, this can be written as:
$$ \vec{\boldsymbol{\delta}}^{(L)} = \vec{\hat{\mathbf{p}}} - \vec{\mathbf{y}} $$
and averaging over the batch is implemented by scaling:
delta: Batch = (p_hat - y) / N
Here, subtraction is performed element-wise, and division by $N$ corresponds to averaging over the batch.
Hidden layers — the recurrence formula
We compute $\delta^{(l)}$ for the $i$-th neuron of layer $(L-1)$:
Expanding each factor:
- $\dfrac{\partial \mathcal{L}}{\partial z^{(L)}} = \delta^{(L)}$ — the already-known delta of the output layer
- $\dfrac{\partial z^{(L)}}{\partial h^{(L-1)}_i} = w^{i\;(L)}$ — the $i$-th weight in the weight row from the formula $z^{(L)} = \left(\vec{\mathbf{w}}^{(L)}\right)^\text{T}\cdot \vec{\mathbf{h}}^{(L)} + b^{(L)}$ or $z^{(L)} = \sum_j w^jh_j^{(L)} + b^{(L)}$.
- $\dfrac{\partial h^{(L-1)}_i}{\partial z^{(L-1)}_i} = f'(z^{(L-1)}_i)$ — the derivative of the ReLU activation function: one if $z^{(L-1)}_i > 0,$ zero otherwise.
Substituting, we obtain:
$$ \delta^{(L-1)}_i = w^{i,(L)}, \delta^{(L)} \cdot f'(z^{(L-1)}_i) $$
Since, taken together over all values of $i$, the set of values $w^{i,(L)}$ forms a row vector $\left(\vec{\mathbf{w}}^{(L)}\right)^{\text{T}}$, we can express $w^{i,(L)}$ as $\left(w_i^{(L)}\right)^{\text{T}}$.
or, in matrix form for all neurons in the layer:
$$ \boxed{\vec{\boldsymbol{\delta}}^{(L-1)} = \left(\vec{\mathbf{w}}^{(L)}\right)^\text{T}\delta^{(L)} \odot f'(\vec{\mathbf{z}}^{(L - 1)})} $$
where
- $\vec{\mathbf{w}}^{(L)}$ — a column vector of weights of the output neuron.
- $\delta^{(L)}$ — a scalar
- $f'(\vec{\boldsymbol{z}}^{(l)})$ for ReLU: a vector of zeros and ones — ones where $z^{(l)}_i > 0$, zeros otherwise.
- $\odot$ — element-wise product of two vectors.
By analogy we compute $\delta^{(l)}$ for the $i$-th neuron of layer $(L-2)$. By the chain rule:
The summation here runs over all neurons of layer $(L-1)$. Expanding each factor:
Substituting:
Similarly, we can represent $W^i_j$ as $\left(W^j_i\right)^\text{T},$ that is,
Here the summation runs over index $j$, which runs over the coordinates of the row vector $\vec{\boldsymbol{\delta}}^{(L-1)}$. This formula shows the connection between two adjacent layers — if we know $\vec{\boldsymbol{\delta}}^{(L-1)},$ we now know how to find $\vec{\boldsymbol{\delta}}^{(L-2)},$ and from that the next one, all the way back to layer 1. In the general case this recurrence can be written as:
In matrix form for the full layer, this recurrence reads:
where
- $W^{(l)}$ — the weight matrix of the $l$-th layer, with dimensions $[,n^{(l)} \times n^{(l-1)}]$.
- $f'(\vec{\boldsymbol{z}}^{(l)})$ for ReLU: a row vector of zeros and ones — ones where $z^{(l)}_i > 0$, zeros otherwise.
- $\odot$ — element-wise product.
Batch Implementation
For a batch of $N$ examples, the delta vectors generalize to matrices:
$$ \Delta^{(l)} \in \mathbb{R}^{n^{(l)} \times N} $$
where each column corresponds to the delta vector for a single example.
The recurrence relation becomes:
$$ \Delta^{(l-1)} = \left(W^{(l)}\right)^T \Delta^{(l)} \odot f'(Z^{(l-1)}) $$
where:
- $Z^{(l)} \in \mathbb{R}^{n^{(l)} \times N}$ — matrix of pre-activations
- $f'(Z^{(l)})$ — element-wise derivative (for ReLU: 0 or 1)
In NumPy, this is implemented as:
for layer in range(len(self.W) - 1, 0, -1):
delta = (self.W[layer].T @ delta) * d_relu(z_cache[layer - 1])
Gradient with Respect to Layer Weight
Having derived the delta vectors $\vec{\boldsymbol{\delta}}^{(l)}$, we can now express the gradients with respect to the weights of the layer in a compact and computationally efficient form.
Expanding the factors, the first factor $\dfrac{\partial \mathcal{L}}{\partial z^{(l)}_i}$ is, by definition, $\delta^{(l)}_i$. The second factor:
That is,
Or equivalently in matrix form:
The gradient has the same dimensions as ( W^{(l)} ).
Batch Implementation
For a batch of $N$ examples, the gradient generalizes naturally by replacing vectors with matrices:
$$ \Delta^{(l)} \in \mathbb{R}^{n^{(l)} \times N}, \quad H^{(l-1)} \in \mathbb{R}^{n^{(l-1)} \times N} $$
The gradient becomes:
$$ \boxed{ \nabla_{W^{(l)}} \mathcal{L} = \Delta^{(l)} \cdot \left(H^{(l-1)}\right)^\text{T} } $$
This corresponds to multiplying:
- $\Delta^{(l)}$ — matrix of deltas for all examples
- $\left(H^{(l-1)}\right)^T$ — transposed activations of the previous layer
In NumPy, this is implemented as:
dW_l = delta @ H_cache[layer - 1].T
where:
deltacorresponds to $\Delta^{(l)}$H_cache[layer - 1]stores $H^{(l-1)}$
The resulting matrix has the same dimensions as $W^{(l)}$.
Gradient with Respect to Layer Bias
Similarly, the gradient with respect to the bias vector of each layer:
or in matrix form:
$$ \boxed{\nabla_{\vec{\mathbf{b}}^{(l)}} \mathcal{L} = \vec{\boldsymbol{\delta}}^{(l)} \in \mathbb{R}^{n^{(l)}}} $$
Batch Implementation
For a batch of $N$ examples, the bias gradient is obtained by summing the deltas over all examples:
$$ \boxed{ \nabla_{\vec{\mathbf{b}}^{(l)}} \mathcal{L} = \Delta^{(l)} \cdot \vec{\mathbf{1}} } $$
where $\vec{\mathbf{1}} \in \mathbb{R}^{N}$ is a vector of ones.
In practice, this is computed efficiently by summing over the batch dimension:
db_l = delta.sum(axis=1, keepdims=True)
Here:
deltahas shape $[n^{(l)} \times N]$- summation over
axis=1corresponds to summing over all examples keepdims=Truepreserves the column vector shape $[n^{(l)} \times 1]$
Step-by-Step Gradient Descent Algorithm
We can now formalize the gradient descent algorithm for minimizing the loss function of the full neural network.
- For each layer $l$ we **initialize** the weight matrix $W^{(l)}$ and the bias vector $\vec{\mathbf{b}}^{(l)}$ with random values.
- We then traverse all hidden layers — from 1 to $(L-1)$ — computing $\vec{\mathbf{z}}^{(l)}$ and $\vec{\mathbf{h}}^{(l)}.$ We save these values — they will be needed later. This step is known as the *forward pass*.
- We then compute:
- $\delta^{(L)} = \frac{1}{N}\sum_{i=1}^{N}(\hat{p}_i - y_i)$ — for the output layer
- $\vec{\boldsymbol{\delta}}^{(L-1)} = \left(\vec{\mathbf{w}}^{(L)}\right)^{\text{T}}\delta^{(L)} \odot f'(\vec{\mathbf{z}}^{(L - 1)})$ — for the pre-output layer
- $\vec{\boldsymbol{\delta}}^{(l-1)} = \left(W^{(l)}\right)^T \vec{\boldsymbol{\delta}}^{(l)} \odot f'(\vec{\mathbf{z}}^{(l-1)})$ — for all remaining layers (moving toward the first)
- In parallel, for each layer we compute gradients with respect to weights $\nabla_{W^{(l)}} \mathcal{L} = \vec{\boldsymbol{\delta}}^{(l)} \cdot \left(\vec{\mathbf{h}}^{(l-1)}\right)^\text{T}$ and with respect to biases $\nabla_{\vec{\mathbf{b}}^{(l)}} \mathcal{L} = \vec{\boldsymbol{\delta}}^{(l)}.$ This step is known as the *backward pass* — computation from the last layer to the first.
- The gradients point in the direction of growth of the loss function, so we take a small step in the opposite direction:
In step 4 we effectively update/correct the weights and biases of all neurons in every layer. This corresponds to a small step of size $\lambda$ along the surface of $\mathcal{L}$ toward the coordinates of its minimum.
Through successive iterations the loss function begins to approach its minimum, and the weights and biases of all neurons converge to values at which the model predicts the probabilities of future events as close as possible to what we observe in the training examples.
Implementation
In practice, the forward and backward passes are executed sequentially on a batch of data, while caching intermediate activations for reuse during backpropagation:
p_hat, H_cache = forward(Xb)
dW, db = backward(p_hat, yb, H_cache)
Here:
Xb— batch of input featuresp_hat— predicted probabilities for the batchH_cache— stored activations $H^{(l)}$ from all layersdW,db— gradients for all layers
This corresponds exactly to the forward and backward passes described above.
Parameter updates are performed simultaneously for all layers:
W = [W - lr * dw for W, dw in zip(W, dW)]
b = [b - lr * db_ for b, db_ in zip(b, db)]
This operation applies the gradient descent step to each layer, updating all weights and biases in the direction of decreasing loss.
The full training procedure consists of repeating these steps over multiple iterations (epochs), each time sampling batches from the dataset. As the number of iterations increases, the parameters converge toward values that minimize the loss function and maximize predictive accuracy.
Practical Steps
Below are the practical steps for training an artificial intelligence to make astrological predictions.
Data Preparation
A group of astrologers begins by selecting a set of predictive techniques from a single astrological school/tradition whose combined application allows a forecast to be made. These may be directions, revolution charts, and transits, or progressions, profections, and solar returns — this choice is left to the astrologers' discretion.
The astrological features are then converted to numerical values. For example, the ruler of a house can be expressed as a number where each number corresponds to a planet; the current firdaria can be expressed as a sequential index in the firdaria list; an aspect in degrees; a planet's house position as the house number; dignities and debilities as numerical scores; the conjunction (or lack thereof) of house cusps in overlaid charts as 0 or 1.
Each astrological feature is then normalized so that its value lies in the interval from 0 to 1. For example, house and sign numbers are divided by 12, zodiacal degrees by 360. The result is dozens of features whose combinations lead to the predicted event.
Once the feature list is ready, an event is chosen as the training target — relocation, marriage, birth of a child, and so on. This is what the model will learn to predict.
From the AstroDataBank database, all available horoscopes with documented times of the chosen event (e.g., year, day, and month of a child's birth) are selected. Astrological techniques are applied to each such event and the astrological feature vector $x_i$ is filled in with event label $y = 1$. These features might include, for example, planetary positions in houses and signs (expressed as numbers from 0 to 1) in the solar return chart, mutual aspects when overlaying the lunar return chart on the natal chart (also expressed as numbers from 0 to 1), planetary dignities and debilities expressed as numbers (from 0 to 1), and so on — according to the feature list.
An equal number of negative examples must then be selected — random horoscopes and dates when the event did not occur — and for each such example the feature vector $x_i$ is constructed with event label $y = 0$.
Batch Size Selection
Since the dataset may be very large, training is best performed in small groups of $N$ examples called mini-batches. The number of examples processed in a single training step is the batch size.
In practice, training proceeds by iterating over batches sampled from the dataset. For each batch, a forward and backward pass is performed, followed by a parameter update.
The choice of batch size represents a trade-off:
- Smaller batches introduce stochasticity, which can help escape local minima
- Larger batches provide more stable gradient estimates but require more memory
Data Splitting
A common practice is to split the dataset into training (70%), validation (15%), and test (15%) subsets.
1. Training data (the main dataset) — the astrological feature vectors and event labels $(\vec{\mathbf{x}}_1, y_1), (\vec{\mathbf{x}}_2, y_2), \ldots$ fed to the neural network so that it can find the optimal weights and biases for making good predictions.
2. Validation data — a control set used to check how well the model predicts events. It may turn out that the model predicts events well only on the training data but misses badly when presented with unfamiliar input features and labels.
In that case the model's global parameters (hyperparameters) must be tuned so that the prediction accuracy on the training set is similar to that on the validation set. These parameters include:
- Number of layers
- Number of neurons per layer
- Batch size
- Gradient descent step size $\lambda$ (learning rate)
- Number of passes over the full training dataset (number of epochs)
3. Test data is used last, once optimal hyperparameters have been determined. The model makes predictions on the test data. If successful, the model is considered trained.
This procedure defines the full training loop of the neural network, combining data preparation, stochastic sampling, and iterative optimization.
Implementation
In practice, training is organized into epochs, where each epoch corresponds to one full pass over the training dataset.
At the beginning of each epoch, the dataset is shuffled to prevent the model from learning spurious ordering effects:
N_total: int = X.shape[1]
for epoch in range(1, self.epochs + 1):
idx = np.random.permutation(N_total)
X, y = X[:, idx], y[:, idx]
The dataset is then split into mini-batches:
for s in range(0, N_total, self.batch_size):
Xb: Batch = X[:, s: s + self.batch_size]
yb: Batch = y[:, s: s + self.batch_size]
Each batch $(X_b, y_b)$ is used to perform a forward and backward pass, followed by a parameter update, as described in the previous section.
Gradient Descent
After the data are split, the training dataset is shuffled thoroughly and divided into $N$ batches.
Before the first pass, the weight matrix $W^{(l)}$ is initialized with random numbers (see the Appendix), and the bias vector $\vec{\mathbf{b}}^{(l)}$ is set to zero for each layer $l$.
One gradient descent step (forward and backward pass) is then performed on the first batch. Then the next batch is taken and gradient descent is repeated. This continues until all batches are exhausted.
A complete pass over all batches is called a training epoch. When one epoch ends, the training dataset is shuffled again, split into new batches, and one gradient descent step is performed per batch.
Training epochs are repeated many times (typically dozens or hundreds). Throughout training we continuously monitor the loss function as it converges geometrically toward its minimum — this behavior is called loss convergence.
Data Validation
After the model has completed its first training run, we take the validation data and compare the model's predictions with the actual distribution of event labels. If the discrepancies are significant, the model's global parameters must be adjusted:
- Batch size
- Gradient descent step size (learning rate, $\lambda$)
- Number of neurons per layer
- Number of layers
- Number of epochs
We adjust these parameters and retrain the model until it passes validation.
Model Testing
The final step is to give the trained model the test data — astrological features $\vec{\mathbf{x}_i}$ — and compare the model's predictions with the actual distribution of event labels $y_i$. If steps 1–5 were carried out correctly, the test almost always passes.
If the test fails:
- Check for data leakage — identical examples appearing in both training and test/validation sets. If found, remove them and retrain from scratch.
- Check data quality — are all features included, correctly computed, and properly normalized? If errors are found, correct them and retrain.
- If no errors are found, shuffle all examples thoroughly, re-split the data into training, validation, and test sets, and repeat training with the same global parameters.
Applying the Model
Once the model passes testing, it is considered trained and can be applied to new data.
For a given individual, we construct a sequence of astrological feature vectors $\vec{\mathbf{x}}_t$ corresponding to different time points (e.g., years, months, or days). The trained model maps each feature vector to a predicted probability:
$$ \hat{p}_t = P(y = 1 \mid \vec{\mathbf{x}}_t) $$
The output is therefore a time series of probabilities, which can be interpreted as a signal profile over time.
Rather than treating predictions as deterministic forecasts, this framework allows us to analyze the structure of the signal:
- Peaks in $\hat{p}_t$ correspond to time intervals where the model detects patterns similar to those associated with the target event in the training data
- Low values indicate absence of such patterns
- The overall distribution of $\hat{p}_t$ reflects the model's confidence across time
This interpretation is particularly important in weak-signal domains. Individual predictions may be noisy or uncertain, but consistent patterns in the probability distribution may indicate the presence of a genuine underlying signal.
In this sense, the model functions not only as a predictor, but as a statistical filter, transforming complex feature spaces into interpretable probability landscapes.
Such probability profiles can be analyzed further:
- by identifying statistically significant deviations from baseline levels
- by comparing distributions across individuals or event types
- by evaluating temporal clustering of high-probability intervals
This shifts the focus from isolated predictions to systematic signal detection, which is the primary objective in domains where ground truth is sparse or noisy.
Appendix. Parameter Initialization Before the First Pass
We mentioned that before applying the gradient descent algorithm we initialize the weight matrices and bias vectors for each neural layer with arbitrary values. Two negative scenarios must be avoided.
Scenario 1. Very Large Weights
If neurons in the network start with very large (and diverse) weights, the vector $\vec{\mathbf{z}}^{(l)}$ (the result of the affine transformation at layer $l$) will have wildly different coordinates — for example, −1000, 549, −3261. In other words, the variance of the coordinates will be very large. Large values in one layer are amplified by large weights in the next, so by the final neuron the signals $\vec{\mathbf{h}}^{(L-1)}$ are enormous. Their linear combination $z = w^{i\;(L)}h_i^{(L-1)}$ produces large positive and negative values of $z$.
As you may recall from the graphs, the sigmoid $\sigma(z)$ saturates quickly — for large positive $z$ it approaches 1 and barely grows, while for large negative $z$ it approaches zero.
As a consequence, the output-layer delta $\delta^{(L)} = \frac{1}{N}\sum_{i=1}^{N}(\hat{p}_i - y_i)$ takes values close to $-1$, $0$, or $1.$
Since $\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \left(\vec{\boldsymbol{\delta}}^{(l)}\right)^{\text{T}} \cdot \vec{\mathbf{h}}^{(l-1)}$, the weight gradients become the product of potentially large $\vec{\mathbf{h}}^{(l-1)}$ values and nearly discrete $\vec{\boldsymbol{\delta}}^{(l)}$ values. The gradients jump erratically — the loss surface is alternately nearly flat and nearly vertical. It is almost impossible to descend smoothly to the minimum over such a "jagged" surface. This negative scenario is called exploding variance.
Scenario 2. Very Small Weights
If we initialize the weights and biases to zero or very small values, $\vec{\mathbf{z}}^{(l)}$ will have very similar coordinates — for example, −0.00005, 0.00001, 0.00002. In other words, the variance of the coordinates will be very small. Small values in one layer are further reduced by small weights in the next, so by the final neuron the signals $\vec{\mathbf{h}}^{(L-1)}$ are tiny. Their linear combination $z = w^{i\;(L)}h_i^{(L-1)}$ produces near-zero values of $z$.
Near zero, the sigmoid behaves almost linearly: $$ \sigma(z) \approx 0.5 + \frac{z}{4} $$ So the model's predictions $\hat{p}_i$ for all examples are close to 0.5, barely distinguishable from each other.
As a consequence, the output-layer delta $\delta^{(L)} = \frac{1}{N}\sum_{i=1}^{N}(\hat{p}_i - y_i)$ is small in magnitude, since $\hat{p}_i \approx 0.5$ and the difference $(\hat{p}_i - y_i)$ is bounded and does not produce a strong error signal.
During backpropagation the deltas $\vec{\boldsymbol{\delta}}^{(l)}$ shrink further at each layer, since they are sequentially multiplied by small weights. By the time the error reaches the early layers it is extremely weak.
Since $$ \frac{\partial \mathcal{L}}{\partial W^{(l)}} = \left(\vec{\boldsymbol{\delta}}^{(l)}\right)^{\text{T}} \cdot \vec{\mathbf{h}}^{(l-1)}, $$ the weight gradients are the product of small $\vec{\boldsymbol{\delta}}^{(l)}$ and small $\vec{\mathbf{h}}^{(l-1)},$ resulting in near-zero gradients throughout. The loss surface is nearly flat: changing the weights barely changes the loss. Gradient descent "cannot sense" the direction of movement and makes infinitesimally small steps.
By such an almost horizontal surface, movement toward the minimum is extremely slow or practically halts. This negative scenario is called vanishing gradients.
Solution
We need initial weight values such that the variance $\text{Var}(\vec{\mathbf{h}}^{(l)})$ of the signal is preserved when passing from layer to layer — without explosive accumulation or decay: $$ \text{Var}(\vec{\mathbf{h}}^{(l)})\approx \text{Var}(\vec{\mathbf{h}}^{(l−1)}) $$ Because the ReLU activation clips approximately half the values of $\vec{\mathbf{z}^{(l)}}$ (since $\vec{\mathbf{h}}^{(l)} = \max(0, \vec{\mathbf{z}}^{(l)})$, roughly half the coordinates of $\vec{\mathbf{z}^{(l)}}$ are zeroed out): $$ \mathrm{Var}\left(h^{(l)}_j\right) \approx \frac{1}{2} \mathrm{Var}\left(z^{(l)}_j\right) $$ Since $\text{Var}(z_j^{(l)}) = \text{Var}\left(W^{i\;(l)}_j h_i^{(l-1)}\right)$, and using the rules that the variance of a sum equals the sum of variances, and the variance of a product (assuming zero mean) equals the product of variances:
Combining:
Our variance-preservation requirement can now be expressed as:
This means that the weights in layer $l$ should be random numbers with zero mean and variance $2/n^{(l-1)}$, where $n^{(l-1)}$ is the number of neurons in the previous layer.
This is easily achieved with a normal distribution:
$$ W^{(l)} \sim \mathcal{N}\left(0, \frac{2}{n^{(l-1)}}\right) $$
This means we draw from a Gaussian distribution centered at 0 with spread $2/n^{(l-1)}$. Such an initialization makes training stable.
Implementation
In practice, this initialization is implemented using a normal distribution with zero mean and standard deviation $\sqrt{2 / n^{(l-1)}}$:
W_layer: Matrix = np.random.normal(
loc=0.0,
scale=np.sqrt(2.0 / n_prev),
size=(n_curr, n_prev),
).astype(np.float64)
Here:
n_prev = n^{(l-1)}— number of neurons in the previous layern_curr = n^{(l)}— number of neurons in the current layer
The bias vector is initialized to zeros:
b_layer: Vector = np.zeros(
(n_curr, 1),
dtype=np.float64
)
This choice ensures stable signal propagation during the first forward pass and prevents both exploding and vanishing gradients.
These initialization principles are applied independently at each layer, ensuring that the variance of activations remains approximately constant throughout the network. This provides a stable starting point for gradient-based optimization and significantly improves convergence behavior in practice
References
[1] Press, N (1978) An astrological suicide study. Journal of Geocosmic Research 2(2) 23-33.
[2] Dean and Maher (1977) Recent Advances in Natal Astrology. Analogic p207 and 252.
[3] Morinus J.B. (1661) Astrologia Gallica, Book 16-26. Hagae