The sound distributed version control system

#430 [CRASH, ADD, CORRUPTION] new rustX test suite loop

Closed on May 16, 2021
tankf33der on May 15, 2021

One time I seen Pristine corruption alarm and now crash report on add command.

All this from rustX test suite running in loop.

tankf33der on May 15, 2021
$ cat /tmp/report-12de1f12-b805-47c7-aef2-2796b00b3efd.toml
name = 'pijul'
operating_system = 'unix:Unknown'
crate_version = '1.0.0-alpha.48'
explanation = '''
Panic occurred in file '/home/mpech/.cargo/registry/src/github.com-1ecc6299db9ec823/sanakirja-core-1.2.7/src/btree/page_unsized/put.rs' at line 232
'''
cause = '''
assertion failed: HDR + hdr.n() as usize * L::OFFSET_SIZE + L::OFFSET_SIZE + size <
    data as usize'''
method = 'Panic'
backtrace = '''

   0: 0x5633d7a226fd - core::panicking::panic::h5db36e5a1d6d2297
   1: 0x5633d7784ca7 - sanakirja_core::btree::page_unsized::put::put::h4e6a3c163ed00340
   2: 0x5633d775d55e - sanakirja_core::btree::put::put::ha2fe645091586d7b
   3: 0x5633d77cc6c2 - <libpijul::pristine::sanakirja::GenericTxn<sanakirja::environment::muttxn::MutTxn<alloc::sync::Arc<sanakirja::environment::Env>,()>> as libpijul::pristine::TreeMutTxnT>::put_tree::h7fc979828da740c5
   4: 0x5633d73abe5c - libpijul::fs::make_new_child::he1d8f6f878779a72
   5: 0x5633d73af367 - libpijul::fs::add_inode::h698ff6ec4ae7dfb5
   6: 0x5633d738ed06 - libpijul::working_copy::filesystem::FileSystem::add_prefix_rec::hb200e2066ed11e4f
   7: 0x5633d74b25f7 - pijul::run::{{closure}}::h1addda3139a1a578
   8: 0x5633d744d3e8 - <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h17c33ee9c519857f
   9: 0x5633d728be72 - tokio::runtime::thread_pool::ThreadPool::block_on::h08771bb26e14a89a
  10: 0x5633d7427ac0 - tokio::runtime::Runtime::block_on::h0a6547192b68156a
  11: 0x5633d73e61b6 - pijul::main::h6ddd0222d5aa7776
  12: 0x5633d73c3b63 - std::sys_common::backtrace::__rust_begin_short_backtrace::hcd09ee0b4eeec1e8
  13: 0x5633d73c3f99 - std::rt::lang_start::{{closure}}::h41872dd1011dfad7
  14: 0x5633d79fd0e8 - std::rt::lang_start_internal::h35b6f595e94b11f4
  15: 0x5633d73e6272 - main
  16: 0x7ff0cc399a03 - <unresolved>'''
$
tankf33der on May 15, 2021

Pristine corrupt message, again add command:

SKIP
....
+ pijul record -am89c58eac682
+ for G in $(git log  --pretty=format:%h | head -64)
+ git checkout -q c7cb72828d2
+ pijul add -r src
[2021-05-15T05:40:12Z ERROR libpijul::pristine::sanakirja] IO(Os { code: 2, kind: NotFound, message: "No such file or directory" })
Error: Pristine corrupt
delta:~/rustX $ 
pmeunier on May 15, 2021

Congrats! If you can, I’d be interested in a tarball of the repository.

tankf33der on May 15, 2021

Latest facts of one loop:

  • 64 iterations = 64 commits
  • ~26k files
  • size of .pijul/ dir ~80MB

On corruption .pijul is 9GB, so it aet all disk space on VPS and no left room for crash report in /tmp.

Corrupted repo became huge ill.

tankf33der on May 15, 2021

From 80MB to all disk space in one step is integer overflow and you gonna fill the whole Universe.

Somewhere you did not assert calculation and got the huge number:

assertion failed: HDR + hdr.n() as usize * L::OFFSET_SIZE + L::OFFSET_SIZE + size <
    data as usize'''
pmeunier on May 15, 2021

My guess is that it tries to read from an unallocated page (possibly because of a double free). Unfortunately, problems in the tree table aren’t easily reproducible, because the indices are mostly random. I’m starting the test at the moment.

How often do you get the error?

tankf33der on May 15, 2021

I am just running in loop while [ true ]; do ./runme.sh || break; done

tankf33der on May 15, 2021

My collection of errors:

+ for G in $(cat ../rustX/commits64)
+ git checkout -q 21e92b97309
+ pijul add -r src
+ pijul record -am21e92b97309
+ for G in $(cat ../rustX/commits64)
+ git checkout -q 48517460a5b
+ pijul add -r src
+ pijul record -am48517460a5b
+ for G in $(cat ../rustX/commits64)
+ git checkout -q 57291b8c5ee
+ pijul add -r src
Error: No such file or directory (os error 2)
$ 


+ pijul record -am3db335b934d
+ for G in $(cat ../rustX/commits64)
+ git checkout -q d2df620789c
+ pijul add -r src
+ pijul record -amd2df620789c
+ for G in $(cat ../rustX/commits64)
+ git checkout -q c61e8face09
+ pijul add -r src
+ pijul record -amc61e8face09
+ for G in $(cat ../rustX/commits64)
+ git checkout -q 2a245e0226c
+ pijul add -r src
[2021-05-15T10:27:58Z ERROR libpijul::pristine::sanakirja] IO(Os { code: 2, kind: NotFound, message: "No such file or directory" })
Error: Pristine corrupt
$
tankf33der on May 15, 2021

I got it

https://envs.sh/lU.xz

after this type of error:

[2021-05-15T10:27:58Z ERROR libpijul::pristine::sanakirja] IO(Os { code: 2, kind: NotFound, message: "No such file or directory" })
Error: Pristine corrupt
pmeunier on May 16, 2021

I made everything deterministic and will send a patch on Pijul to remove the dependency on rand completely. I was able to reproduce and fix this bug, which was a very special case of a deletion in Sanakirja. This is fixed in sanakirja-core 1.2.8, feel free to reopen if it reappears.

pmeunier closed this discussion on May 16, 2021
tankf33der on May 16, 2021

Please keep up to date and update all required Cargo toml files.

pmeunier on May 16, 2021

Right, the Cargo.lock is outdated indeed. I’m doing it now, sorry about that.

pmeunier added a change on May 16, 2021
XZYSNXG4RJNWDD466KMNB5IEWF5KLQCU6AEINDHWA7CX3DQ237NAC
main
tankf33der on May 16, 2021

all thrussh family is far away too.

pmeunier on May 16, 2021

Yes, but only in the Cargo.toml, which isn’t what dictates the actual versions, just the minimal requirements. As long as the Cargo.lock is up to date, everything is fine.

tankf33der on May 16, 2021

Passed 8+h loop.

pmeunier on May 17, 2021

Wow. Was that the last bug in Sanakirja? It was incredibly specific: Sanakirja is a B tree, and that bug only happened in the following case:

Let’s call the root block A, its last child B, and then two consecutive children of B, C and D.

The bug happened when C and D were rebalanced, and the entry between them after the rebalance occupied more space than the entry between them, and the extra space was sufficiently large that there wasn’t enough room on B anymore to hold the new entry. This would cause B to split, which was handled fine in all cases, except when A was the root and B was A’s last child, which would cause entries to be inserted in the wrong order, and to reference freed pages.

Additionally, this took me a little while to figure out because my debugging code was showing the keys (which are integers) in little-endian base32, which isn’t very helpful to spot ordering problems.

I don’t think there are many other edge cases in Sanakirja, it isn’t a very complex datastructure.

tankf33der on May 17, 2021

Passed all suits i have so far.