3K2FVFHEU5R4HBCGYR5QTAJ33MZ3GVMCGHWT2GFNJPLE74AV53LAC
Basic support for this added in 3K2FVF.
This is invoked as pijul log -- <file_or_dir>*
. Directory filters are recursive, so passing a filter my_dir
will show the logs for bothmy_dir
and for its contents. In the future you may want to support more fancy glob patterns, but for now this seems to work pretty well.
VFNKNHZRWSXIBPVBPER2NFD2FGRMKLZS7MAMWF7H6IAOSY6JMQEAC
Hi @ammkrn. Your changes are a good start, but there’s a much more efficient way (the performance of log
is quite critical, actually): if you look at pull, you will see that Pijul maintains a table mapping “inodes” to the changes that affected them, and the reverse table.
This is likely to be many, many times faster than opening each changefile (especially if changefiles are large).
If you look at pull, you will see that Pijul maintains a table mapping “inodes” to the changes that affected them, and the reverse table.
I spent a fair amount of time on this today but ultimately wasn’t able to get anything done by emulating the code in push/pull, there seems to be quite a bit of work between point A and point B here. Using the txn/graph methods is returning file not found errors that are difficult to make heads or tails of without any documentation. ChangeId and Position are currently extra opaque since they’re both hidden and undocumented.
You know how sorry I am about that situation. Since we discussed documentation last week, I’ve been thinking about where to start to have the highest impact (where the 20% are, in Pareto terms).
There are really many concepts involved:
Hash
is a long external identifier of a patch. It is essentially a cryptographic hash of the byte representation of a patch, with a few subtleties to make patches detachable from their content (and avoid downloading large obsolete files).
ChangeId
is an internal identifier of a patch, computed by initially taking the first 64 bits of the hash, and incrementing that if there’s a collision. There are two tables mapping Hash
to ChangeId
in the database (we could probably save some disk space by only registering collisions in these tables, but I’m not sure how to do that, or how much space would actually be saved).
The main internal datastructure in Pijul is a graph of vertices. The representation of that structure, and the basic operations on that structure, are quite straightforward, and actually form a rather small nucleus inside Pijul. Because of performance issues linked to the underlying on-disk representation, and because of a very large number of edge cases, they took forever to debug, but now seem relatively stable. The type of a vertex in that graph is Vertex<ChangeId>
, which is given by a ChangeId
(a wrapper around u64
), a start position and an end position (both u64
), which are byte offsets inside the change given by ChangeId
. A Position<ChangeId>
is a position inside a change. Some vertices of the graph are empty, meaning that their start and end position are equal. In that case, there’s an identity between positions and these vertices.
“inodes” in my sentence above referred to the empty vertices at the start of each folder or file, used in the graph to make edits inside a folder or file commute with operations higher up in the folder hierarchy. Even though these empty vertices are actually also positions, Pijul overloads the term “inode” with another meaning: identifiers of a file in the user’s file system. This distinction is necessary to remember to-be-recorded files after a pijul add
, as well as to output conflicts on file paths (imagine the case where Alice and Bob both rename the same file f
into a
and b
, respectively, or conversely, the case where the give two different files a
and b
the same name f
). Pijul maintains a table mapping the two kinds of “inodes” to each other. Formally, the “graph kind of inodes” is called Position<ChangeId>
, and the filesystem identitiers are Inode
s.
Patches contain information about which inodes (in the graph sense) they change. This is useful to do partial clones, for example. Pijul pristines contain a table mapping touched inodes to the change that touched them.
Thanks for the help, I apologize if it sounded like I was blaming you or being accusatory
You didn’t sound like that, I totally understand the frustration, most of the things I use are new/experimental and hence poorly documented. So, even if you had sounded accusatory, I am willing to take that blame, don’t worry!
Documentation is definitely an area I have to work a lot on. One of my projects is to try and convince a friend of mine, who researched proof theory and language design for years, to help me prove the algorithms formally, which would give an even better documentation. I have a manually written pseudo-proof, which helped me debug the system a lot, but it isn’t machine-checkable.
I don’t understand how to get from a txn
and a libpijul::Inode
representing a file to the entries in the “table mapping ‘inodes’ to the changes that affected them”. Everything seems to either be an iterator over IE all of the file inodes (iter_inodes, looping over cursor_iter_next, same for the tree iters), or it’s a mapping of scalars to scalars like Inode <-> Position<ChangeId>
.
As a side note, would it be possible in the future to change the associated types for things like InodesCursor
to take libpijul::pristine::Cursor
, or to have a to/from implementation? Getting a libpijul::pristine::Cursor<.., Self::InodesCursor, ..>
and then having to fish out the inner sanakirja cursor to use the iterators was hard to figure out since there are macros and big generics between them.
I don’t understand how to get from a txn and a libpijul::Inode representing a file to the entries in the “table mapping ‘inodes’ to the changes that affected them”. Everything seems to either be an iterator over IE all of the file inodes (iter_inodes, looping over cursor_iter_next, same for the tree iters), or it’s a mapping of scalars to scalars like Inode <-> Position.
That one is a bit tricky: Inode
is a type representing a file in the filesystem, and these are never actually “touched” by changes, since changes only modify the graph. After a number of changes are applied, new files are output to the working copy, and get assigned an Inode
in the process. For example, Inode
s are purely local, they are never shared between different instances of a repository. In some cases, the same file in different channels might even get assigned different Inode
s (however, if you switch between two channels that have the same file, I believe the Inode
stays the same).
One extra detail is that Inode
always correspond to a single file and a single path (which is also a valid path in the working copy), whereas conflicts might cause multiple files to have the same path in the graph, or a single file to have multiple different paths.
So, what you want to do here is ambiguous, and there are two interpretations (I even think both might coexist):
If you actually have an Inode
, it means you have already output the graph into the working copy, and the path is unambiguous. This is probably the preferred way, since it’s the most intuitive for the user. In this case, you would need to find which vertex represents the file in the graph, which you can do with the get_inodes
and get_revinodes
methods (see libpijul::output::output
and libpijul::record
for examples of how to use these).
If all you have is a path, and you can’t map it to an Inode
, you might want to map it directly to a graph vertex. I believe users should need a special CLI flag to make sure this is really what they want to do, since the graph is supposed to be transparent and hence not accessible directly. For thath reason, it is perfectly OK, in a first implementation, to just reject the request. Another possibility is to use the functions in libpijul::fs
to get one candidate for the graph vertex representing that path (follow_oldest_path
does this: it gives the only possible candidate when there are no conflicts, and the oldest one else, which is generally what users expect, when recording for example).
In both cases, you can use the get_touched_files
and get_rev_touched_files
methods, from the following macros:
table_get!(touched_files, Position<ChangeId>, ChangeId, DepsError);
table_get!(rev_touched_files, ChangeId, Position<ChangeId>, DepsError);
This
As a side note, would it be possible in the future to change the associated types for things like InodesCursor to take libpijul::pristine::Cursor, or to have a to/from implementation?
I need to remember my “plan for the day associated type constructors would become a reality in Rust”, since it seems this has happened. However:
Getting a libpijul::pristine::Cursor<.., Self::InodesCursor, ..> and then having to fish out the inner sanakirja cursor to use the iterators was hard to figure out since there are macros and big generics between them.
Don’t do that! Sanakirja is terribly hard to use and has footguns everywhere. I have plans to make a higher-level library based on the experience I have with Pijul. While this is probably not too hard for general cases, the specific case of Pijul channels would be hard to deal with automatically and efficiently.
Instead, Cursor<…>
implements Iterator
directly (via the initialized_cursor!
macro, defined in libpijul::pristine
). For example uses, see how pijul::commands::protocol
does this, for example:
for x in txn.log(&*channel.read(), from)? {
let (n, (h, m)) = x?;
let h_int = txn.get_internal(h)?.unwrap();
if paths.is_empty()
|| paths.iter().any(|x| {
x.change == *h_int
|| txn.get_touched_files(x, Some(h_int)).unwrap().is_some()
})
{
let h: Hash = h.into();
let m: Merkle = m.into();
writeln!(o, "{}.{}.{}", n, h.to_base32(), m.to_base32())?
}
}
Also, libpijul::fs::get_latest_touch
has a more general way of doing this, by iterating over the changes
and touched
table at the same time, stopping at the first match, so that the complexity is in O(min(age, changes))
, where age
is the number of changes applied after the last one that touched the file, and changes
is the total number of changes that touched the file.
2J2AST52AMZEYVGUAIET5XRSXI25M2OUJXODAL3AALTYGIP3VJIAC
2J2AST is an attempt just based on trial and error; the iterator yielded from iter_touched
seems to always start with the relevant files, but usually contains superfluous elements as well. This patch doesn’t always yield the filtered items in order; this can be pretty easily solved by going over reverse_log once to check the order, but I’m not sure what the performance implications of that are (I assume they’re not bad since that’s what’s usually done for log
).
Looks good. Just one comment, why do you use the scary-looking
let mut touched_cursor = txn.iter_touched(inode_position)?;
while let Some((position, touched_change_id)) =
txn.cursor_touched_files_next(&mut touched_cursor.cursor)?
{
Instead of
for Some((position, touched_change_id)) in txn.iter_touched(inode_position)? {
P5LC4V2SJCWXUPSB27A5PDY6263DZB6EXKEGA46X4OLRHMMRXUDAC
Touche. Is there a reason to have both the Iterator implementation and an explicit next()
method then? PSLC4V fixes the filtered-by-file changes not always being in order by going over the revlog, but only as much as it needs to. Iterating over the revlog for the main pijul repo only takes a few milliseconds on my old laptop so I don’t anticipate that being a problem.
This discussion can be closed. For posterity, this feature has been implemented in pijul:main
This reminds me that I didn’t answer your last question: the reason is that we want zero-copy iterators, which in some cases requires associated type constructors. I should probably take another look at this issue, now that the traits are stable.
E.g.
would only show changes affecting PATH (file or directory).
BTW right now there is an error message that advice to use the very syntax that was entered!