pijul/pijul - Discussion #353 - [Feature request] Filter log by prefixes.

#353 [Feature request] Filter log by prefixes.

Closed on September 15, 2021

unidual on February 21, 2021

E.g.

% pijul log -- <PATH>

would only show changes affecting PATH (file or directory).

BTW right now there is an error message that advice to use the very syntax that was entered!

% pijul log -- README.md
error: Found argument 'README.md' which wasn't expected, or isn't valid in this context

If you tried to supply `README.md` as a PATTERN use `-- README.md`

pmeunier added tag Feature request on April 29, 2021

ammkrn added a change on May 25, 2021

Add log filtering by file and directory name(s)

3K2FVFHEU5R4HBCGYR5QTAJ33MZ3GVMCGHWT2GFNJPLE74AV53LAC

ammkrn on May 25, 2021

Basic support for this added in 3K2FVF.

This is invoked as pijul log -- <file_or_dir>*. Directory filters are recursive, so passing a filter my_dir will show the logs for bothmy_dir and for its contents. In the future you may want to support more fancy glob patterns, but for now this seems to work pretty well.

ammkrn added a change on June 2, 2021

Add log filtering by file and directory name(s)

VFNKNHZRWSXIBPVBPER2NFD2FGRMKLZS7MAMWF7H6IAOSY6JMQEAC

pmeunier on July 4, 2021

Hi @ammkrn. Your changes are a good start, but there’s a much more efficient way (the performance of log is quite critical, actually): if you look at pull, you will see that Pijul maintains a table mapping “inodes” to the changes that affected them, and the reverse table.

This is likely to be many, many times faster than opening each changefile (especially if changefiles are large).

ammkrn on July 17, 2021

If you look at pull, you will see that Pijul maintains a table mapping “inodes” to the changes that affected them, and the reverse table.

I spent a fair amount of time on this today but ultimately wasn’t able to get anything done by emulating the code in push/pull, there seems to be quite a bit of work between point A and point B here. Using the txn/graph methods is returning file not found errors that are difficult to make heads or tails of without any documentation. ChangeId and Position are currently extra opaque since they’re both hidden and undocumented.

pmeunier on July 18, 2021

You know how sorry I am about that situation. Since we discussed documentation last week, I’ve been thinking about where to start to have the highest impact (where the 20% are, in Pareto terms).

There are really many concepts involved:

Hash is a long external identifier of a patch. It is essentially a cryptographic hash of the byte representation of a patch, with a few subtleties to make patches detachable from their content (and avoid downloading large obsolete files).
ChangeId is an internal identifier of a patch, computed by initially taking the first 64 bits of the hash, and incrementing that if there’s a collision. There are two tables mapping Hash to ChangeId in the database (we could probably save some disk space by only registering collisions in these tables, but I’m not sure how to do that, or how much space would actually be saved).
The main internal datastructure in Pijul is a graph of vertices. The representation of that structure, and the basic operations on that structure, are quite straightforward, and actually form a rather small nucleus inside Pijul. Because of performance issues linked to the underlying on-disk representation, and because of a very large number of edge cases, they took forever to debug, but now seem relatively stable. The type of a vertex in that graph is Vertex<ChangeId>, which is given by a ChangeId (a wrapper around u64), a start position and an end position (both u64), which are byte offsets inside the change given by ChangeId. A Position<ChangeId> is a position inside a change. Some vertices of the graph are empty, meaning that their start and end position are equal. In that case, there’s an identity between positions and these vertices.
“inodes” in my sentence above referred to the empty vertices at the start of each folder or file, used in the graph to make edits inside a folder or file commute with operations higher up in the folder hierarchy. Even though these empty vertices are actually also positions, Pijul overloads the term “inode” with another meaning: identifiers of a file in the user’s file system. This distinction is necessary to remember to-be-recorded files after a pijul add, as well as to output conflicts on file paths (imagine the case where Alice and Bob both rename the same file f into a and b, respectively, or conversely, the case where the give two different files a and b the same name f). Pijul maintains a table mapping the two kinds of “inodes” to each other. Formally, the “graph kind of inodes” is called Position<ChangeId>, and the filesystem identitiers are Inodes.
Patches contain information about which inodes (in the graph sense) they change. This is useful to do partial clones, for example. Pijul pristines contain a table mapping touched inodes to the change that touched them.

ammkrn on July 18, 2021

@pmeunier Thanks for the help, I apologize if it sounded like I was blaming you or being accusatory. I’ve been keeping some notes that I think would make a good alpha version of a contributor’s guide, I’ll add these and ask for additional input on Zulip.

pmeunier on July 18, 2021

Thanks for the help, I apologize if it sounded like I was blaming you or being accusatory

You didn’t sound like that, I totally understand the frustration, most of the things I use are new/experimental and hence poorly documented. So, even if you had sounded accusatory, I am willing to take that blame, don’t worry!

Documentation is definitely an area I have to work a lot on. One of my projects is to try and convince a friend of mine, who researched proof theory and language design for years, to help me prove the algorithms formally, which would give an even better documentation. I have a manually written pseudo-proof, which helped me debug the system a lot, but it isn’t machine-checkable.

ammkrn on August 3, 2021

@pmeunier

I don’t understand how to get from a txn and a libpijul::Inode representing a file to the entries in the “table mapping ‘inodes’ to the changes that affected them”. Everything seems to either be an iterator over IE all of the file inodes (iter_inodes, looping over cursor_iter_next, same for the tree iters), or it’s a mapping of scalars to scalars like Inode <-> Position<ChangeId>.

As a side note, would it be possible in the future to change the associated types for things like InodesCursor to take libpijul::pristine::Cursor, or to have a to/from implementation? Getting a libpijul::pristine::Cursor<.., Self::InodesCursor, ..> and then having to fish out the inner sanakirja cursor to use the iterators was hard to figure out since there are macros and big generics between them.

pmeunier on August 4, 2021

I don’t understand how to get from a txn and a libpijul::Inode representing a file to the entries in the “table mapping ‘inodes’ to the changes that affected them”. Everything seems to either be an iterator over IE all of the file inodes (iter_inodes, looping over cursor_iter_next, same for the tree iters), or it’s a mapping of scalars to scalars like Inode <-> Position.

That one is a bit tricky: Inode is a type representing a file in the filesystem, and these are never actually “touched” by changes, since changes only modify the graph. After a number of changes are applied, new files are output to the working copy, and get assigned an Inode in the process. For example, Inodes are purely local, they are never shared between different instances of a repository. In some cases, the same file in different channels might even get assigned different Inodes (however, if you switch between two channels that have the same file, I believe the Inode stays the same).

One extra detail is that Inode always correspond to a single file and a single path (which is also a valid path in the working copy), whereas conflicts might cause multiple files to have the same path in the graph, or a single file to have multiple different paths.

So, what you want to do here is ambiguous, and there are two interpretations (I even think both might coexist):

If you actually have an Inode, it means you have already output the graph into the working copy, and the path is unambiguous. This is probably the preferred way, since it’s the most intuitive for the user. In this case, you would need to find which vertex represents the file in the graph, which you can do with the get_inodes and get_revinodes methods (see libpijul::output::output and libpijul::record for examples of how to use these).
If all you have is a path, and you can’t map it to an Inode, you might want to map it directly to a graph vertex. I believe users should need a special CLI flag to make sure this is really what they want to do, since the graph is supposed to be transparent and hence not accessible directly. For thath reason, it is perfectly OK, in a first implementation, to just reject the request. Another possibility is to use the functions in libpijul::fs to get one candidate for the graph vertex representing that path (follow_oldest_path does this: it gives the only possible candidate when there are no conflicts, and the oldest one else, which is generally what users expect, when recording for example).

In both cases, you can use the get_touched_files and get_rev_touched_files methods, from the following macros:

    table_get!(touched_files, Position<ChangeId>, ChangeId, DepsError);
    table_get!(rev_touched_files, ChangeId, Position<ChangeId>, DepsError);

This

As a side note, would it be possible in the future to change the associated types for things like InodesCursor to take libpijul::pristine::Cursor, or to have a to/from implementation?

I need to remember my “plan for the day associated type constructors would become a reality in Rust”, since it seems this has happened. However:

Getting a libpijul::pristine::Cursor<.., Self::InodesCursor, ..> and then having to fish out the inner sanakirja cursor to use the iterators was hard to figure out since there are macros and big generics between them.

Don’t do that! Sanakirja is terribly hard to use and has footguns everywhere. I have plans to make a higher-level library based on the experience I have with Pijul. While this is probably not too hard for general cases, the specific case of Pijul channels would be hard to deal with automatically and efficiently.

Instead, Cursor<…> implements Iterator directly (via the initialized_cursor! macro, defined in libpijul::pristine). For example uses, see how pijul::commands::protocol does this, for example:

                for x in txn.log(&*channel.read(), from)? {
                    let (n, (h, m)) = x?;
                    let h_int = txn.get_internal(h)?.unwrap();
                    if paths.is_empty()
                        || paths.iter().any(|x| {
                            x.change == *h_int
                                || txn.get_touched_files(x, Some(h_int)).unwrap().is_some()
                        })
                    {
                        let h: Hash = h.into();
                        let m: Merkle = m.into();
                        writeln!(o, "{}.{}.{}", n, h.to_base32(), m.to_base32())?
                    }
                }

Also, libpijul::fs::get_latest_touch has a more general way of doing this, by iterating over the changes and touched table at the same time, stopping at the first match, so that the complexity is in O(min(age, changes)), where age is the number of changes applied after the last one that touched the file, and changes is the total number of changes that touched the file.

ammkrn added a change on August 6, 2021

Add log filtering by file and directory name(s)

2J2AST52AMZEYVGUAIET5XRSXI25M2OUJXODAL3AALTYGIP3VJIAC

ammkrn on August 6, 2021

@pmeunier

2J2AST is an attempt just based on trial and error; the iterator yielded from iter_touched seems to always start with the relevant files, but usually contains superfluous elements as well. This patch doesn’t always yield the filtered items in order; this can be pretty easily solved by going over reverse_log once to check the order, but I’m not sure what the performance implications of that are (I assume they’re not bad since that’s what’s usually done for log).

pmeunier on August 7, 2021

Looks good. Just one comment, why do you use the scary-looking

        let mut touched_cursor = txn.iter_touched(inode_position)?;
        while let Some((position, touched_change_id)) =
            txn.cursor_touched_files_next(&mut touched_cursor.cursor)?
        {

Instead of

        for Some((position, touched_change_id)) in txn.iter_touched(inode_position)? {

ammkrn added a change on August 7, 2021

Add log filtering by file and directory name(s)

P5LC4V2SJCWXUPSB27A5PDY6263DZB6EXKEGA46X4OLRHMMRXUDAC

ammkrn on August 7, 2021

@pmeunier

Touche. Is there a reason to have both the Iterator implementation and an explicit next() method then? PSLC4V fixes the filtered-by-file changes not always being in order by going over the revlog, but only as much as it needs to. Iterating over the revlog for the main pijul repo only takes a few milliseconds on my old laptop so I don’t anticipate that being a problem.

ammkrn on September 15, 2021

@pmeunier

This discussion can be closed. For posterity, this feature has been implemented in pijul:main

pmeunier on September 15, 2021

This reminds me that I didn’t answer your last question: the reason is that we want zero-copy iterators, which in some cases requires associated type constructors. I should probably take another look at this issue, now that the traits are stable.

pmeunier closed this discussion on September 15, 2021