Примерно с 13:30 про Arian-5 :
Дмитрий Дагаев писал(а):
Восемь смертных грехов профессиональных программистов
Дагаев Дмитрий Викторович, главный эксперт, АО «РАСУ», Росатом, Москва
https://youtu.be/5govjItgS3o В ходе обсуждения, было признано полезным результаты "раскопок в стиле цифровой археологии" на январь 2019 (
https://stackoverflow.com/questions/527 ... 0#54345040 ) опубликовать в данном подразделе форума.
In short: it hasn't be pure programming problem, it's more complex problem
The defect on Ariane 5 was not caused by a single cause. Throughout the development and testing processes, there were many stages at which this defect could be identified.
- The software module was reused in a new environment where operating conditions differed from the requirements of the software module. These requirements have not been revised.
- The system has identified and recognized the error. Unfortunately, the specification of the error handling mechanism was inconsistent and caused final destruction.
- An erroneous module has never been properly tested in a new environment — neither at the hardware level, nor at the system integration level. Therefore, the fallacy of development and implementation was not detected.
See diff with correct code:
Код:
--- LIRE_DERIVE.ads 000 Tue Jun 04 12:00:00 1996
+++ LIRE_DERIVE.ads Fri Jan 29 13:50:00 2010
@@ -3,10 +3,17 @@
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elsif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BV_32));
end if;
-P_M_DERIVE(T_ALG.E_BH) :=
- UC_16S_EN_16NS (TDB.T_ENTIER_16S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH)));
+L_M_BH_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH));
+
+if L_M_BH_32 > 32767 then
+ P_M_DERIVE(T_ALG.E_BH) := 16#7FFF#;
+elsif L_M_BH_32 < -32768 then
+ P_M_DERIVE(T_ALG.E_BH) := 16#8000#;
+else
+ P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BH_32));
+end if;
More detailed info:
A)
The “Operand Error” error occurred due to an unexpectedly large value of BH (Horizontal Bias) calculated by an internal function based on the value of the “horizontal speed” measured by the sensors on the Platform.
The value of BH served as an indicator of the accuracy of the Platform positioning. the BH value was much larger than expected because the flight path of the Ariane 5 at an early stage was significantly different from the flight path of the Ariane 4 (where this software module was used earlier), which led to a significantly higher "horizontal speed".
The final action, which had fatal consequences, was the termination of the processor. Accordingly, the entire Navigation System has ceased to function. It was technically impossible to resume her actions.
B)
However, Ariane 5, unlike the previous model, had a fundamentally different discipline for performing pre-flight actions - so different that the work of the fatal program module after the launch time did not make sense at all. However, the module was reused without any modifications .
The investigation revealed that there were as many as seven variables involved in type conversion operations in this software module. It turned out that the developers conducted an analysis of all operations that could potentially generate an exception for vulnerability.
It was their very conscious decision to add proper protection to four variables, and leave three - including BH - unprotected. The reason for this decision was the conviction that for these three variables the occurrence of an overflow situation is impossible in principle .
This confidence was supported by calculations showing that the expected range of physical flight parameters, on the basis of which the values of these variables are determined, is such that it cannot lead to an undesirable situation. And that was true - but for the trajectory calculated for the Ariane 4 model.
And the rocket of the new generation Ariane 5 was launched on a completely different trajectory, for which no evaluations were performed. Meanwhile, it (together with a high initial acceleration) was such that the “horizontal speed” exceeded the calculated (for Ariane 4) more than five times.
Protection for all seven (including BH) variables was not provided, because the maximum workload of 80% was declared for the IRS computer. Developers had to look for ways to reduce unnecessary computational costs and they weakened protection where a theoretically undesirable situation could not arise. When it arose, such an exception handling mechanism came into effect, which turned out to be completely inadequate.
This mechanism included the following three basic actions.
- Information about the occurrence of an emergency situation must be transmitted via the bus to the onboard computer OBC.
- In parallel, she - along with the entire context - was recorded in the reprogrammable EEPROM memory (which, during the investigation, it was possible to restore and read its contents).
- The IRS processor was supposed to crash.
The last action turned out to be fatal - it was he who happened in a situation that actually was normal (despite the software exception generated due to an unprotected overflow), which led to the catastrophe.
C)
Velocity was represented as a 64-bit float
A conversion into a 16-bit signed integer caused an overflow
The current velocity of Ariane 5 was too high to be represented as a 16-bit integer
Error handling was suppressed for performance reasons
(
Protection for all seven (including BH) variables was not provided, because the maximum workload of 80% was declared for the IRS computer.
Developers had to look for ways to reduce unnecessary computational costs and they weakened protection where a theoretically undesirable situation could not arise.
)
According to a presentation by Jean-Jacques Levy (who was part of the
team who searched for the source of the problem), the actual source code
in Ada that caused the problem was as follows:
Код:
-- Vertical velocity bias as measured by sensor
L_M_BV_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BV) * G_M_INFO_DERIVE(T_ALG.E_BV));
-- Check, if measured vertical velocity bias ban be
-- converted to a 16 bit int. If so, then convert
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elsif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BV_32));
end if;
-- Horizontal velocity bias as measured by sensor -- is converted to a 16 bit int without checking P_M_DERIVE
P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS (TDB.T_ENTIER_16S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH)));
The last line (shown here as two lines of text) caused the overflow,
where the conversion from 64 bits to 16 bits unsigned is not protected.
The code before is protected by testing before the assignment if the
number is too big.
The correct code would have been:
Код:
L_M_BV_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BV) * G_M_INFO_DERIVE(T_ALG.E_BV));
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elsif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BV_32));
end if;
L_M_BH_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH));
if L_M_BH_32 > 32767 then
P_M_DERIVE(T_ALG.E_BH) := 16#7FFF#;
elsif L_M_BH_32 < -32768 then
P_M_DERIVE(T_ALG.E_BH) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BH_32));
end if;
in other words, the same overflow check should have been present for the horizontal part of the calculation (E_BH) as was already present for the vertical part of the calculation (E_BV).
'Source:
http://moscova.inria.fr/~levy/talks/10e ... slongo.pdf( On 2019-01-24 can be found at:
http://para.inria.fr/~levy/talks/10ensl ... slongo.pdf)
https://habr.com/ru/company/pvs-studio/blog/306748/----
prospero78 писал(а):
Не хотелось бы повторения даже на уровне "Ариан-5".
А там был даже не С++. Там была (. . .) Ада!!
В обощённом виде:
VM390 писал(а):
Всему виной дополнительным требования по загрузке компьютера на уровне 80% по мощности (хотя 100% загрузка бортового компьютера РН Ариан, в принципе тоже норма), из-за которых пришлось вводить ограничения на проверку параметров в коде. Плюс, грубая ошибка в использовании принципа «повторноиспользуемости кода», при отсутствии тестирования интеграции в условиях новых входных данных.