Optimization of an algorithm

2022-04-09

There is a script on our project that downloads the translations from the internal translation tool to files inside a folder in the web project and that takes long time (~10/15min), so I decided to take a look to see what was happening. After a quick read I detected a fragment of code that had four nested loops, so my intuition told me that maybe there was a bottleneck here.

The code had a similar structure to this (it’s simplified for a better understanding):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
$customTranslationsDataStructure = [
'es_ES' => [
'MESSAGES' => [
'filename1' => [
'translationId1' => 'translationValue1',
'translationId2' => 'translationValue2'
],
'filename2' => [
'translationId1' => 'translationValue1',
'translationId2' => 'translationValue2'
]
],
'en_US' => [
'MESSAGES' => [
'filename3' => [
'translationId1' => 'translationValue1',
'translationId2' => 'translationValue2'
],
'filename4' => [
'translationId1' => 'translationValue1',
'translationId2' => 'translationValue2'
]
]
]
];

foreach ($customTranslationsDataStructure as $locale => $subFolder) {
foreach ($subFolder as $message => $files) {
foreach ($files as $file => $translations) {
$fileName = $destinationPath.'/'.$locale.'/'.$subFolder.'/'.$file;
$this->openFile($fileName);
$this->write($fileName, $this->getFileHeaders());
foreach ($translations as $translationId => $translationValue) {
$this->write($fileName, $this->formatTranslation($translationId, $translationValue));
}
$this->closeFile($fileName);
}
}
}

Prior to change anything, I decided to have metrics to confirm my searches and investigation so I included a few simple lines of code that prints me some system information like:

  • Memory usage
  • Memory peak usage
  • Time execution

Here are the results:

1
2
3
4
5
{
"Total time in seconds" : 597.04154491425, # 9,95 min
"Function time in seconds" : 445.33128285408, # 7,41667 min
"Peak memory usage in MB" : 79.081642150879
}

In addition, I discovered that the $subFolder was a constant so the second loop always had O(1) and the complexity of the algorithm was O(n) * O(1) * O(f) * O(t) = O(n * f * t). So, I wondered if I could reduce it to O(n) because if you read the code it can be simplified to “write a content in a translation file”.

Iteration 1:

For the first iteration I decided to flatten the keys of the array that represents the file path. By doing this I could remove two loops from the code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

$customTranslationsDataStructure = [
'es_ES/MESSAGES/filename1' => [
'translationId1' => 'translationValue1',
'translationId2' => 'translationValue2'
],
'es_ES/MESSAGES/filename2' => [
'translationId1' => 'translationValue1',
'translationId2' => 'translationValue2'
]
],
'en_US/MESSAGES/filename3' => [
'translationId1' => 'translationValue1',
'translationId2' => 'translationValue2'
]
],
'en_US/MESSAGES/filename4' => [
'translationId1' => 'translationValue1',
'translationId2' => 'translationValue2'
]
];

foreach ($customTranslationsDataStructure as $file => $translations) {
$fileName = $destinationPath.'/'.$file;
$this->openFile($fileName);
$this->write($fileName, $this->getFileHeaders());
foreach ($translations as $translationId => $translationValue) {
$this->write($fileName, $this->formatTranslation($translationId, $translationValue));
}
$this->closeFile($fileName);
}

Surprisingly for me, this “optimization” did not work. In fact it was almost the same duration and the use of memory increased.

1
2
3
4
5
{
"Total time in seconds" : 590.02654569548, # 9,83 min vs 9,95 min
"Function time in seconds" : 405.33128285408, # 6,75 min vs 7,41667 min
"Peak memory usage in MB" : 83.700309753418 # 84 MB vs 79 MB
}

Iteration 2:

For the second iteration I decided to keep the flatten of the keys and change the value from an array to a string that represents the whole content of the file.

Note: it implies to modify other loops that are actually executing to structure the data so I did not have to add more loops

1
2
3
4
5
6
7
8
9
10
11
12
13
14

$customTranslationsDataStructure = [
'es_ES/MESSAGES/filename1' => 'some string that represents the content of the file',
'es_ES/MESSAGES/filename2' => 'some string that represents the content of the file',
'en_US/MESSAGES/filename3' => 'some string that represents the content of the file',
'en_US/MESSAGES/filename4' => 'some string that represents the content of the file'
];

foreach ($customTranslationsDataStructure as $file => $fileContent) {
$fileName = $destinationPath.'/'.$file;
$this->openFile($fileName);
$this->write($fileName, $fileContent);
$this->closeFile($fileName);
}

Benchmarks:

1
2
3
4
5
{
"Total time in seconds" : 201.28873109818, # 3,35 min
"Function time in seconds" : 40.77, # 445/40 ≈ x11 times faster
"Peak memory usage in MB" : 91.849418640137 # 91/79 ≈ x1,15 increase of memory usage
}

Aha! This actually worked as expected:

  • Reduced the loops to a single one O(n)
  • The total time was reduced significantly
  • As extra, we simplified the I/O actions from (t + 1) times to a single 1 for writing operations

But as always, we have trade offs:

  • The memory usage increased (but this is not a problem because is still low)
  • The function that creates the $customTranslationsDataStructure with the content of the file increased a little (but not too much to worry about, it was less than 10s)

What about duplicated translations?

Yes, another surprise that I encountered was this fact that “supposedly” was “impossible” to happen. But code does not lie!

Example:

1
2
3
4
[
'translationId1' => 'my awesome translation',
'translationId1' => 'whatever translation'
]

So…

Iteration 3:

How can I handle this new scenario? Which value do I have to take? Two possible answers came to my mind:

  1. Doing a str_pos() (string contains method) to check if it is already “concatenated”
  2. Having some “cache”

After implementing both proposals, I decided to use the second one. Due to the access of an array (hash) is O(1) where the hashkey was the union of the $filename and the $translationId so I can avoid duplicates. This solution raised a bit the memory usage but it also decreased the execution time due to the skipped translations.

1
2
3
4
5
{
"Total time in seconds" : 188.28873109818, # 3,13 minutes
"Function time in seconds" : 35.77, # 445/36 ≈ x12 times faster
"Peak memory usage in MB" : 209.34106445312 # 209/79 ≈ x2,6 increase of memory usage
}

At this time, I found that these duplicated translations had different values and with this solution I am taking the first one instead of the last one (because it was overwritten in the original code). So the solution for this new “problem” is to normalize all translations.

Validations

How to validate that the actual output (remember that they are files with some content) is the same as the previous one to be able to replace it with the new algorithm? Well, I found a utility via command line that helped me:

1
diff -r -C 1 --suppress-common-lines directory1 directory2

And when there are differences, it shows a message like:

1
2
3
4
5
6
7
*** es_ES/MESSAGES/filename1
--- es_ES/MESSAGES/filename1
***************
*** 52,54 ****
"my awesome translation"
--- 52,54 ----
! "whatever translation"

Conclusion

It has been an entertaining exercise full of unexpected surprises and learnings that I enjoyed a lot. Also, I put into practice my knowledge about “algorithms design” and could validate it with real numbers. I’m very happy with the final solution.


Comments: