quietlight/skraak_mcp - Change YO52AKKLNL6KEFZCL3NHNB4XN4S3LZ2ATMUTMES3TXWYEARUI7GQC

tracking claude skills

Created by quietlight on March 18, 2026

YO52AKKLNL6KEFZCL3NHNB4XN4S3LZ2ATMUTMES3TXWYEARUI7GQC

Dependencies

[2] CDT7NGOU3VIKCP3YPZTXKOMLV54VERP4FFRVSRPVKMU3IQTI76CAC

In channels

main

Change contents

File addition: .claude (d--x------)
[2.1]
File addition: skills (d--x------)
[0.1]
File addition: writing-skills (d--x------)
[0.22]

File addition: SKILL.md (----------)

[0.42]

---
name: writing-skills
description: Guides the creation of effective Claude Skills with proper structure, concise content, and progressive disclosure patterns. Use when authoring new Skills, improving existing Skills, or when the user asks about Skill creation best practices.
---
# Writing Skills
Helps create effective Claude Skills following Anthropic's best practices.
## Core Principles
**Concise is key**: The context window is shared. Only add context Claude doesn't already have. Challenge each piece: "Does Claude really need this explanation?"
**Set appropriate degrees of freedom**:
- High freedom (text instructions): Multiple approaches valid, context-dependent decisions
- Medium freedom (pseudocode/scripts with parameters): Preferred pattern exists, variation acceptable
- Low freedom (specific scripts): Operations fragile, consistency critical
**Test with all models**: Skills behave differently on Haiku vs Sonnet vs Opus. Test with all models you plan to use.
## Skill Structure Requirements
### YAML Frontmatter
```yaml
---
name: skill-name-here
description: What the skill does and when to use it. Always write in third person.
---
```
**Name requirements**:
- Maximum 64 characters
- Lowercase letters, numbers, hyphens only
- No XML tags, no reserved words (anthropic, claude)
- Use gerund form: `processing-pdfs`, `analyzing-spreadsheets`
**Description requirements**:
- Maximum 1024 characters
- Include both WHAT it does and WHEN to use it
- Write in third person (not "I can help" or "You can use")
- Be specific with key terms for discovery
### Body Organization
Keep SKILL.md under 500 lines. Use progressive disclosure for longer content:
```markdown
# Skill Name
## Quick start
[Basic usage here]
## Advanced features
**Feature A**: See [FEATURE_A.md](FEATURE_A.md)
**Feature B**: See [FEATURE_B.md](FEATURE_B.md)
```
**Important**: Keep file references one level deep from SKILL.md. Deeply nested references cause partial reads.
## Common Patterns
### Workflow Pattern (for complex tasks)
Provide a checklist for multi-step processes:
````markdown
## Research synthesis workflow
Copy this checklist and track progress:
```
Research Progress:
- [ ] Step 1: Read all source documents
- [ ] Step 2: Identify key themes
- [ ] Step 3: Cross-reference claims
- [ ] Step 4: Create structured summary
- [ ] Step 5: Verify citations
```
**Step 1: Read all source documents**
[Details...]
**Step 2: Identify key themes**
[Details...]
````
### Template Pattern
**For strict requirements**:
```markdown
ALWAYS use this exact template structure:
[template here]
```
**For flexible guidance**:
```markdown
Here is a sensible default format, but use best judgment:
[template here]
Adjust sections as needed for the specific context.
```
### Examples Pattern
Show input/output pairs for quality-dependent tasks:
```markdown
## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output:
```
feat(auth): implement JWT-based authentication
Add login endpoint and token validation middleware
```
```
### Conditional Workflow Pattern
```markdown
## Document modification workflow
1. Determine the modification type:
   **Creating new content?** → Follow "Creation workflow"
   **Editing existing content?** → Follow "Editing workflow"
2. Creation workflow: [steps...]
3. Editing workflow: [steps...]
```
### Feedback Loop Pattern
**Common pattern**: Run validator → fix errors → repeat
```markdown
## Document editing process
1. Make edits to `word/document.xml`
2. **Validate immediately**: `python scripts/validate.py unpacked_dir/`
3. If validation fails:
   - Review error message
   - Fix issues
   - Run validation again
4. **Only proceed when validation passes**
5. Rebuild: `python scripts/pack.py unpacked_dir/ output.docx`
```
## Content Guidelines
### Avoid Time-Sensitive Information
**Bad**: "If you're doing this before August 2025, use the old API."
**Good**: Use "old patterns" section:
```markdown
## Current method
Use the v2 API endpoint: `api.example.com/v2/messages`
## Old patterns
<details>
<summary>Legacy v1 API (deprecated 2025-08)</summary>
The v1 API used: `api.example.com/v1/messages`
This endpoint is no longer supported.
</details>
```
### Use Consistent Terminology
Choose one term and use it throughout:
- Always "API endpoint" (not mixing "URL", "route", "path")
- Always "field" (not mixing "box", "element", "control")
### Structure Long Reference Files
For files over 100 lines, include table of contents at top:
```markdown
# API Reference
## Contents
- Authentication and setup
- Core methods (create, read, update, delete)
- Advanced features (batch operations, webhooks)
- Error handling patterns
- Code examples
## Authentication and setup
...
```
## Anti-Patterns to Avoid
- ❌ **Windows-style paths**: Use `scripts/helper.py`, not `scripts\helper.py`
- ❌ **Too many options**: Don't list 5 libraries. Provide a default with escape hatch
- ❌ **Vague descriptions**: Not "Helps with documents" but "Extracts text and tables from PDF files"
- ❌ **Deeply nested references**: Keep references one level deep from SKILL.md
- ❌ **Time-sensitive info**: Use "old patterns" section instead
- ❌ **Inconsistent terminology**: Pick one term per concept and stick to it
- ❌ **Over-explaining**: Don't explain what Claude already knows
## Skills with Executable Code
### Provide Utility Scripts
Pre-made scripts are better than generated code:
- More reliable
- Save tokens (no code in context)
- Save time (no generation)
- Ensure consistency
**Make execution intent clear**:
- "Run `analyze_form.py` to extract fields" (execute)
- "See `analyze_form.py` for the extraction algorithm" (read as reference)
### Solve, Don't Punt
Handle error conditions in scripts rather than failing and letting Claude figure it out:
```python
def process_file(path):
    """Process a file, creating it if it doesn't exist."""
    try:
        with open(path) as f:
            return f.read()
    except FileNotFoundError:
        print(f"File {path} not found, creating default")
        with open(path, "w") as f:
            f.write("")
        return ""
```
### Avoid Voodoo Constants
Document why configuration values are chosen:
```python
# HTTP requests typically complete within 30 seconds
# Longer timeout accounts for slow connections
REQUEST_TIMEOUT = 30
# Three retries balances reliability vs speed
# Most intermittent failures resolve by the second retry
MAX_RETRIES = 3
```
### Package Dependencies
List required packages and verify they're available in the code execution environment. Skills run in platform-specific environments:
- **claude.ai**: Can install from npm/PyPI and pull from GitHub
- **Anthropic API**: No network access, no runtime installation
### MCP Tool References
Always use fully qualified tool names: `ServerName:tool_name`
Example: `BigQuery:bigquery_schema`, `GitHub:create_issue`
## Development Process
### Evaluation-Driven Development
1. **Identify gaps**: Run Claude on tasks without the Skill. Document failures
2. **Create evaluations**: Build 3+ test scenarios
3. **Establish baseline**: Measure performance without the Skill
4. **Write minimal instructions**: Just enough to pass evaluations
5. **Iterate**: Execute evaluations, compare, refine
### Iterative Development with Claude
**Creating a new Skill**:
1. Complete a task without a Skill (Claude A helps you work through it)
2. Identify the reusable pattern (what context did you repeatedly provide?)
3. Ask Claude A to create a Skill capturing that pattern
4. Review for conciseness (remove unnecessary explanations)
5. Improve information architecture (organize into separate files if needed)
6. Test on similar tasks (use Claude B with the Skill loaded)
7. Iterate based on observation
**Improving existing Skills**:
1. Use the Skill in real workflows (Claude B)
2. Observe behavior (where does it struggle or succeed?)
3. Return to Claude A with current SKILL.md and observations
4. Review Claude A's suggestions
5. Apply changes and test with Claude B
6. Repeat based on usage
## Checklist for Effective Skills
### Core Quality
- [ ] Description is specific with key terms
- [ ] Description includes what it does AND when to use it
- [ ] Written in third person
- [ ] SKILL.md body under 500 lines
- [ ] Additional details in separate files (if needed)
- [ ] No time-sensitive information (or in "old patterns" section)
- [ ] Consistent terminology throughout
- [ ] Examples are concrete
- [ ] File references one level deep
- [ ] Progressive disclosure used appropriately
- [ ] Workflows have clear steps
### Code and Scripts (if applicable)
- [ ] Scripts solve problems, don't punt to Claude
- [ ] Error handling is explicit
- [ ] No voodoo constants (all values justified)
- [ ] Required packages listed and verified
- [ ] No Windows-style paths (forward slashes only)
- [ ] Validation steps for critical operations
- [ ] Feedback loops for quality-critical tasks
### Testing
- [ ] At least 3 evaluations created
- [ ] Tested with target models (Haiku/Sonnet/Opus)
- [ ] Tested with real usage scenarios
- [ ] Team feedback incorporated (if applicable)
## Quick Reference
**Good description example**:
```yaml
description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
```
**Progressive disclosure example**:
```markdown
## Quick start
[Basic usage]
## Advanced
**Form filling**: See [FORMS.md](FORMS.md)
**API reference**: See [REFERENCE.md](REFERENCE.md)
```
**Workflow with checklist**:
````markdown
Copy this checklist:
```
- [ ] Step 1: Analyze
- [ ] Step 2: Validate
- [ ] Step 3: Execute
```
**Step 1**: [details]
````
**Feedback loop**:
```markdown
1. Make changes
2. Validate: `python validate.py`
3. If fails, fix and validate again
4. Only proceed when validation passes
```
## Tips
- **Token budget**: Keep SKILL.md under 500 lines
- **Degrees of freedom**: Match specificity to task fragility
- **File organization**: Name files descriptively (`form_validation_rules.md`)
- **Reference files**: One level deep from SKILL.md
- **Utility scripts**: Preferred over generated code for deterministic operations
- **Testing**: Build evaluations first, then write minimal docs to pass them
- **Iteration**: Work with Claude A to create/refine, Claude B to test, observe and iterate

File addition: skraak-import-segments (d--x------)
[0.22]

File addition: SKILL.md (----------)

[0.10600]

---
name: import-segments
description: Import classified segments from a single folder of .data files into the database - looks up IDs, creates mapping, runs import
---
# Import Segments from a Single Folder
Import manually classified AviaNZ `.data` file segments into the database for one folder (one cluster).
## When to Use
When the user wants to import classified `.data` files from a folder. The folder must already have `.data` files (from `calls classify` or `calls from-preds`) and the WAV files must already be imported into the database.
## Prerequisites
Before this skill runs, the user should have already:
1. Generated predictions and created `.data` files (`calls from-preds` or similar)
2. Classified calls (`calls classify`)
3. Imported the WAV files into the database (`import folder` or `import bulk`)
## Workflow
### Step 1: Identify the folder
Get the folder path from the user. Confirm `.data` files exist:
```bash
ls <folder>/*.data | head -5
```
### Step 2: Determine dataset, location, and cluster IDs
Extract the cluster/location name from the folder path (e.g., `tx51` from `.../tx51/`). Query the database to find matching IDs:
```bash
./skraak sql --db ./db/skraak.duckdb "SELECT c.id AS cluster_id, c.name AS cluster_name, l.id AS location_id, l.name AS location_name, d.id AS dataset_id, d.name AS dataset_name FROM cluster c JOIN location l ON c.location_id = l.id JOIN dataset d ON l.dataset_id = d.id WHERE c.active = true AND c.name ILIKE '%tx51%'"
```
Adjust the `ILIKE` pattern based on the folder name. If multiple results, ask the user to pick. Show the user the matched dataset/location/cluster and confirm before proceeding.
### Step 3: Check mapping file
Look for an existing `mapping_*.json` in the folder:
```bash
ls <folder>/mapping_*.json
```
If no mapping exists, tell the user to run the `skraak-data-mapping` skill first:
```
No mapping file found. Please run: /skraak-data-mapping
```
Then stop and wait for the user.
### Step 4: Run the import
```bash
./skraak import segments \
  --db ./db/skraak.duckdb \
  --dataset <dataset_id> \
  --location <location_id> \
  --cluster <cluster_id> \
  --folder "<folder_path>" \
  --mapping "<folder_path>/mapping_YYYY-MM-DD.json"
```
Use the production database (`./db/skraak.duckdb`) unless the user specifies otherwise.
### Step 5: Report results
Show the user the import summary from the JSON output:
- Data files processed
- Segments imported
- Labels imported
- Any errors
## Error Handling
- If no `.data` files found: abort with clear message
- If cluster/location not found in DB: ask user for the correct name or IDs
- If files not yet imported (hash mismatch): tell user to run `import folder` first
- If segments already have labels: warn user - `import segments` only works on files with no existing labels

File addition: skraak-dataset-report (d--x------)
[0.22]

File addition: SKILL.md (----------)

[0.13509]

---
name: dataset-report
description: Generate a report showing dataset breakdown - locations, file counts per location and per dataset - handling both structured and unstructured dataset types
---
# Dataset Report
Generate a summary report of all active datasets showing location counts, file totals, and files per location.
## When to Use
When the user asks for a dataset overview, summary, breakdown, or report.
## Query
Structured datasets link files through `location > cluster > file`. Unstructured datasets link files through the `file_dataset` junction table with no location hierarchy.
```sql
WITH structured AS (
    SELECT
        d.name AS dataset,
        d.type,
        COUNT(DISTINCT l.id) AS locations,
        COUNT(f.id) AS total_files
    FROM dataset d
    LEFT JOIN location l ON d.id = l.dataset_id AND l.active = true
    LEFT JOIN cluster c ON l.id = c.location_id AND c.active = true
    LEFT JOIN file f ON c.id = f.cluster_id AND f.active = true
    WHERE d.active = true AND d.type != 'unstructured'
    GROUP BY d.name, d.type
),
unstructured AS (
    SELECT
        d.name AS dataset,
        d.type,
        0 AS locations,
        COUNT(fd.file_id) AS total_files
    FROM dataset d
    LEFT JOIN file_dataset fd ON d.id = fd.dataset_id
    LEFT JOIN file f ON fd.file_id = f.id AND f.active = true
    WHERE d.active = true AND d.type = 'unstructured'
    GROUP BY d.name, d.type
),
combined AS (
    SELECT * FROM structured
    UNION ALL
    SELECT * FROM unstructured
)
SELECT
    dataset,
    type,
    locations,
    total_files,
    CASE WHEN locations > 0
         THEN ROUND(total_files::DECIMAL / locations, 0)
         ELSE NULL END AS files_per_location
FROM combined
ORDER BY total_files DESC
```
## Output
Present results as a markdown table with a summary line showing totals.
- Use `-` for files_per_location on unstructured datasets (no location hierarchy)
- Format large numbers with commas for readability
- Include dataset type column

File addition: skraak-data-mapping (d--x------)
[0.22]

File addition: SKILL.md (----------)

[0.15578]

---
name: data-mapping
description: Build a mapping.json file that translates species and calltype names from .data files to DB labels, using interactive prompts
---
# Data File Species/Calltype Mapping
Build a `mapping.json` that translates `.data` file species and calltype names to database `species.label` and `call_type.label` values. Map-only: never create new species or calltypes in the DB.
## When to Use
When the user needs to map species/calltypes from `.data` files before importing selections. Typically run before `skraak import selections`.
## Workflow
### Step 1: Summarise .data files
```bash
./skraak calls summarise --folder <folder> --brief
```
Parse the JSON output. Extract unique species and per-species calltypes from the `filters` object. Each filter has:
- `species`: map of species name to count
- `calltypes`: map of species name to (calltype name to count)
Collect all unique species across all filters, and for each species collect all unique calltypes.
### Step 2: Query DB for available species and calltypes
```bash
./skraak sql --db ./db/skraak.duckdb "SELECT id, label FROM species WHERE active = true"
./skraak sql --db ./db/skraak.duckdb "SELECT ct.label, s.label as species FROM call_type ct JOIN species s ON ct.species_id = s.id WHERE ct.active = true ORDER BY s.label, ct.label"
```
Parse the JSON results. Build a lookup: DB species labels, and per-species DB calltype labels.
### Step 3: Interactive mapping
For each unique species found in `.data` files:
1. Show the user the .data species name and its count
2. Use `AskUserQuestion` with DB species labels as options (pick the most likely matches, up to 4 options + "Skip - no match")
3. If the user picks a DB species and that species has calltypes in the .data files:
   - For each .data calltype, use `AskUserQuestion` with that DB species' calltypes as options (up to 4 + "Keep as-is")
   - "Keep as-is" means the .data calltype name equals the DB calltype name (no mapping needed, omit from calltypes map)
### Step 4: Write mapping.json
Save to `<folder>/mapping_YYYY-MM-DD.json` (using the current date, e.g. `mapping_2026-03-14.json`):
```json
{
  "Don't Know": {
    "species": "Don't Know"
  },
  "GSK": {
    "species": "Haast Tokoeka",
    "calltypes": {
      "Male": "Male - Solo",
      "Female": "Female - Solo"
    }
  },
  "Morepork": {
    "species": "Morepork"
  }
}
```
Structure rules:
- Top-level keys = .data file species names
- `species` value = DB species label
- `calltypes` map = only present if calltypes exist AND at least one needs remapping
- Each calltype entry: .data calltype name -> DB calltype label
- Omit calltypes that map to themselves (user chose "Keep as-is")
- If a species maps to itself with no calltype remapping, still include it with just `{"species": "SpeciesName"}`
- If user chose "Skip", warn and omit that species entirely
### Step 5: Display summary
Print the final `mapping.json` content to the user as formatted JSON.
## Error Handling
- If a .data species has no reasonable DB match, warn and skip (don't write to mapping)
- If a DB species has no calltypes defined but .data files have calltypes for it, warn the user (the calltypes will be ignored on import)
- If `calls summarise` finds no .data files, abort with clear message
- Never create new species or calltypes in the DB
## Example Session
```
Found 4 species in .data files: GSK (342), Don't Know (28), Morepork (15), Weka (3)
Mapping species: "GSK" (342 segments)
> Which DB species? [Haast Tokoeka | Great Spotted Kiwi | Stewart Island Tokoeka | Other | Skip]
User picks: Haast Tokoeka
GSK has calltypes: Male (200), Female (120), Duet (22)
> Map calltype "Male"? [Male - Solo | Male - Duet | Keep as-is | Other]
User picks: Male - Solo
...
Mapping species: "Don't Know" (28 segments)
> Which DB species? [Don't Know | Skip]
User picks: Don't Know
Writing mapping.json...
```

File addition: skraak-bulk-import-setup (d--x------)
[0.22]

File addition: example_prompt_for_human_reference.md (----------)

[0.19568]

Note skill: skraak-bulk-import-setup
# Bulk Import
  **Dataset:** [name]
  **Base folder:** [path]
  **WAV files:** [count]
  **Folders:** [count]
  **Ignore:** Clips_*
  **Recursive:** yes/no
  
  ## Folders
  [list]
  
  ## Log files (GPS)
  [list, or "none - assume Takaka -40.85085, 172.80703"]
  Note if multiple logs per folder (average coords)
  ## Output paths
  - CSV: [path]
  - Log: [path]
  - Python: save to base folder
  
  ## Import command
  ./skraak import bulk --db ./db/skraak.duckdb --dataset [id] --csv "[csv]" --log "[log]"

File addition: SKILL.md (----------)

[0.19568]

---
name: bulk-import-setup
description: Set up bulk import for acoustic monitoring datasets - fuzzy match existing location NAMES, create new ones with GPS parsing if log.txt present in folder, and generate import CSV with comprehensive logging
---
# Bulk Import Setup for Acoustic Monitoring Datasets
This skill helps you prepare acoustic monitoring datasets for bulk import by analyzing locations, matching existing records, creating new locations with optional GPS parsing, and generating import CSV files.
## When to Use This Skill
Use this skill when the user needs to:
- Import a batch of acoustic recordings organized in location folders
- Create multiple location records for a dataset
- Match folder names to existing database locations (fuzzy matching)
- Parse GPS coordinates from log files
- Generate CSV files for bulk file import
## Workflow Overview
1. **Analyze** directory structure and existing locations
2. **Fuzzy match** folder names to database locations (NAME ONLY - never GPS!)
3. **Parse GPS log.txt files** from folders
4. **Prepare CSV** with date-range cluster names and file counts
5. **Check with user** before creating any new locations
6. **Create locations** (if confirmed) with GPS or fallback coordinates
7. **Update existing locations** with GPS data (if they have assumed coords)
8. **Update CSV** with final location IDs
9. **Verify** and provide bulk import command
**NEVER match on GPS coordinates!** Recorders can be physically close but represent different locations.
```python
# ❌ WRONG - Don't compare GPS distances
if abs(loc1_lat - loc2_lat) < 0.001:  # NO!
    return "match"
# ✅ CORRECT - Only compare normalized names
normalized_name1 = normalize_location_name(folder_name)
normalized_name2 = normalize_location_name(existing_name)
score = SequenceMatcher(None, normalized_name1, normalized_name2).ratio()
```
### 2. GPS Log Files are INSIDE Folders
**Parse log files by folder structure**, not by separate log file list.
```python
# ✅ CORRECT - Log file is inside the folder
folder_path = "/path/to/mok_bl76/"
log_path = folder_path / "*log.txt"  # Inside the folder!
# Extract folder name from path for mapping
folder_name = Path(folder_path).name  # "mok_bl76"
gps_data[folder_name] = parse_gps_log(log_path)
```
### 3. Update Matched Locations with GPS Data
If fuzzy matching finds an existing location with "assumed" coordinates, **update it** with GPS data if GPS data is available.
```python
if matched_location:
    # Check if location has assumed Takaka coordinates
    if matched_location['description'].contains("assumed") or \
       matched_location['description'].contains("Takaka") or \
       matched_location['description'].contains("pending GPS"):
        # We have GPS data - UPDATE the location!
        if gps_data:
            update_location(matched_location['id'], gps_data['lat'], gps_data['lon'])
```
### 4. Cluster Names = Date Ranges (NOT Location Names!)
**Cluster names represent time periods**, not locations.
```python
# ❌ WRONG - Using location name
cluster_name = "mok_bl76"
# ✅ CORRECT - Using date range from files
first_file = "mok_bl76_20250422_173008.wav"
last_file = "mok_bl76_20250503_064508.wav"
cluster_name = "2025-04-22 to 2025-05-03"
```
### 5. CSV Must Have 6 Columns (Including file_count)
The bulk import expects **exactly 6 columns**:
```csv
location_name,location_id,directory_path,date_range,sample_rate,file_count
mok_bl11,HA4TDv3DlhjX,/media/david/Misc-2/Manu o Kahurangi kiwi survey (3)/Data/K3_mokbl_boulder lake track/mok_bl11,2024-12,8000,68```
**Missing file_count will cause**: `Error: CSV row 2 has insufficient columns (expected 6, got 5)`
## Step 1: Analyze Directory Structure
First, understand the organization:
```python
# Key questions to answer:
- How many location folders?
- Naming patterns (dash/underscore)?
- GPS log files present?
- File counts per location?
- Date formats in filenames?
```
**Check for:**
- Log files: `LOCATION-log.txt` or `log.txt`
- GPS data format: `GPS (lat,long): -40.123456,172.123456`
- Filename date patterns: `YYYYMMDD` or `DDMMYY` or `YYMMDD`
## Step 2: Fuzzy Match Existing Locations
**IMPORTANT**: Always query existing locations first to avoid duplicates!
### Fuzzy Matching Rules
**CRITICAL: Use strict threshold (0.90) to avoid false matches!**
```python
from difflib import SequenceMatcher
def normalize_location_name(name: str) -> str:
    """Normalize location name for fuzzy matching."""
    # Remove common suffixes/prefixes
    name = name.lower()
    name = re.sub(r'[_\-\s]+', '', name)  # Remove underscores, hyphens, spaces
    name = re.sub(r'(low|high)$', '', name)  # Remove Low/High suffix
    name = re.sub(r'^mok', 'mok', name)  # Normalize mok prefix
    return name
def fuzzy_match_score(s1: str, s2: str) -> float:
    """Calculate fuzzy match score between two strings."""
    return SequenceMatcher(None, normalize_location_name(s1), normalize_location_name(s2)).ratio()
def fuzzy_match_location(folder_name: str, existing_locations: List[Dict]) -> Optional[Dict]:
    """
    Fuzzy match folder name to existing location BY NAME ONLY.
    IMPORTANT:
    - Returns match ONLY if similarity > 0.90
    - NEVER compares GPS coordinates (locations can be close but different!)
    This threshold prevents false matches like:
    - "mok_bl6P" matching to "mok_bl16" (score: 0.86) ✗ Different locations!
    - "mok_tuna51" matching to "mok_tuna31" (score: 0.89) ✗ Different locations!
    But allows:
    - "mok_bl6P_Low" matching to "mok_bl6P" (score: 1.00) ✓ Same location
    """
    normalized_folder = normalize_location_name(folder_name)
    best_match = None
    best_score = 0.0
    for loc in existing_locations:
        normalized_loc = normalize_location_name(loc['name'])
        score = SequenceMatcher(None, normalized_folder, normalized_loc).ratio()
        if score > best_score:
            best_score = score
            best_match = loc
    # Strict threshold - only match if very similar
    if best_score > 0.90:
        print(f"  Matched '{folder_name}' -> '{best_match['name']}' (score: {best_score:.3f})")
        # Check if matched location needs GPS update
        if 'description' in best_match:
            desc = best_match['description'].lower()
            if 'assumed' in desc or 'takaka' in desc or 'pending gps' in desc:
                print(f"  ⚠️  Location has assumed coords - check for GPS data to update")
        return best_match
    return None
```
**Why 0.90 threshold?**
- Too low (0.80-0.85): False matches → wrong location assignments
- 0.90: Only matches very similar names (case/punctuation differences)
- Example: `mok_bl6P_Low` (score: 1.00) ✓ vs `mok_bl16` (score: 0.86) ✗
### Query Existing Locations
```python
cmd = [
    SKRAAK_BIN, "sql",
    "--db", db_path,
    f"SELECT name, id FROM location WHERE dataset_id = '{dataset_id}' AND active = true",
    "--limit", "500"
]
```
**Note**: If production DB is locked, ask to use use backup database for querying, but note that locations will be created in production later.
**Note**: Don't forget to check some file sample rates using skraak metadata command, or sox.
## Step 3: Prepare CSV with Analysis
**Create preparation script** that:
1. Matches folders to existing locations
2. Identifies which locations need to be created
3. Analyzes file metadata (count, sample rate, date range)
4. Generates CSV with placeholder IDs for new locations
5. Writes comprehensive log file
### CSV Format
**CRITICAL: Must have exactly 6 columns with date-range cluster names!**
```csv
location_name,location_id,directory_path,date_range,sample_rate,file_count
mok_bl11,HA4TDv3DlhjX,/media/david/Misc-2/Manu o Kahurangi kiwi survey (3)/Data/K3_mokbl_boulder lake track/mok_bl11,2024-12,8000,68
```
**Key points:**
- `dataset_id`: 12-character nanoid
- `location_id`: Actual ID or `<CREATE:name>` placeholder
- `cluster_name`: **Date range** (e.g., "2025-04-22 to 2025-05-03") - NEVER location name!
- `folder_path`: Full absolute path
- `sample_rate`: Integer (e.g., 8000, 32000)
- `file_count`: Integer count of WAV files in folder
### Log File Format
Create detailed log showing:
- Each location processed
- Match status (existing or new)
- File counts and metadata
- Summary statistics
- Next steps
```python
LOG_FILE = BASE_DIR / "import_log.txt"
with open(LOG_FILE, 'w') as log_file:
    log(f"Processing: {folder_name}", log_file)
    if matched:
        log(f"  ✓ MATCH: {folder} → {existing} (ID: {id})", log_file)
    else:
        log(f"  + NEW: {folder} (will create)", log_file)
    log(f"    Files: {count}, Sample Rate: {rate} Hz, Date: {date}", log_file)
```
## Step 4: Check with User Before Creating Locations
**CRITICAL**: Never create locations without user confirmation!
Present summary and ask:
```
PLAN READY - AWAITING CONFIRMATION
===================================
Matched existing: 17
Need to create: 5
Total files: 8,003
Locations to be created:
  - new-location-1 (420 files)
  - new-location-2 (350 files)
  ...
Shall I proceed to create these 5 new locations?
```
**Wait for user approval** before executing any location creation!
## Step 5: Create Locations (If Confirmed)
### GPS Parsing (If Available)
```python
def parse_gps_from_log(log_file: Path) -> Tuple[float, float, int]:
    """
    Parse and average GPS coordinates from log file.
    IMPORTANT: Use errors='ignore' to handle encoding issues!
    Some log files contain non-UTF-8 bytes that will crash without this.
    """
    if not log_file.exists():
        return None, None, 0
    gps_pattern = re.compile(r'GPS \(lat,long\): (-?\d+\.\d+),(-?\d+\.\d+)')
    latitudes = []
    longitudes = []
    # CRITICAL: errors='ignore' prevents UnicodeDecodeError
    with open(log_file, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            match = gps_pattern.search(line)
            if match:
                latitudes.append(float(match.group(1)))
                longitudes.append(float(match.group(2)))
    if latitudes:
        avg_lat = sum(latitudes) / len(latitudes)
        avg_lon = sum(longitudes) / len(longitudes)
        return (avg_lat, avg_lon, len(latitudes))
    return None, None, 0
```
### Location Creation
```python
# Use CLI tool (not MCP) for scripting
cmd = [
    "./skraak", "create", "location",
    "--db", db_path,
    "--dataset", dataset_id,
    "--name", location_name,
    "--lat", str(latitude),
    "--lon", str(longitude),
    "--timezone", "Pacific/Auckland",  # NZ default
    "--description", description
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
response = json.loads(result.stdout)
location_id = response["location"]["id"]
```
### Description Format
**With GPS:**
```python
description = f"Averaged from {fix_count} GPS readings"
# Example: "Averaged from 14 GPS readings"
```
**Without GPS (fallback to Takaka if Kahurangi data, Te Anau if Fiordland data):**
```python
description = "Location assumed to be Takaka pending GPS data"
# Note: Takaka coords are -40.85085, 172.80703
```
**Generic (user preference - if provided):**
```python
description = "Friends of Cobb - Assumed location (Takaka) - no GPS data available"
```
## Step 6: File Metadata Analysis
### Count WAV Files (RECURSIVE!)
**CRITICAL: Always count recursively!** Files may be organized in daily subfolders.
```python
def count_wav_files(folder_path: str) -> int:
    """
    Count WAV files recursively (searches all subdirectories).
    IMPORTANT: Don't use glob("*.wav") - it only searches top level!
    Files are often organized in daily subfolders:
      location/
        2025-03-08/
          file1.wav
          file2.wav
        2025-03-09/
          file3.wav
    """
    count = 0
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.wav'):
                count += 1
    return count
```
### Extract Sample Rate (RECURSIVE!)
```python
def get_sample_rate(folder_path: str) -> int:
    """
    Get sample rate from first WAV file (searches recursively).
    IMPORTANT: Search recursively since files may be in subfolders!
    """
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.wav'):
                file_path = os.path.join(root, file)
                # Read WAV header (sample rate at bytes 24-27)
                try:
                    with open(file_path, 'rb') as f:
                        f.seek(24)
                        sample_rate = int.from_bytes(f.read(4), byteorder='little')
                        return sample_rate
                except:
                    pass
    return 8000  # Default fallback
```
### Parse Date Range (RECURSIVE!)
**Support both formats:**
```python
def extract_date_from_filename(filename: str) -> Optional[str]:
    """Extract date from filename like mok_bl6P_20250308_200003.wav"""
    match = re.search(r'(\d{8})_', filename)
    if match:
        date_str = match.group(1)
        # Convert YYYYMMDD to YYYY-MM
        return f"{date_str[:4]}-{date_str[4:6]}"
    return None
def determine_date_range(folder_path: str) -> str:
    """
    Determine date range from WAV filenames (searches recursively).
    IMPORTANT: Search recursively since files may be in daily subfolders!
    """
    dates = set()
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.wav'):
                date = extract_date_from_filename(file)
                if date:
                    dates.add(date)
    if dates:
        sorted_dates = sorted(dates)
        if len(sorted_dates) == 1:
            return sorted_dates[0]
        else:
            return f"{sorted_dates[0]} to {sorted_dates[-1]}"
    return "unknown"
```
## Step 7: Update CSV and Verify
After creating locations, update CSV with real IDs:
```python
# Update TO_BE_CREATED → real location IDs
# Regenerate CSV with all real IDs
```
### Verification Queries
```sql
-- Check all locations created
SELECT id, name, latitude, longitude, SUBSTRING(description, 1, 50)
FROM location
WHERE dataset_id = 'DATASET_ID'
  AND name LIKE 'PREFIX%'
  AND active = true
ORDER BY name;
-- Verify row count matches expected
```
### CSV Verification
```bash
# Should be N+1 rows (header + N locations)
wc -l import.csv
# Sum file counts
cat import.csv | tail -n +2 | cut -d',' -f6 | paste -sd+ | bc
```
## Database Safety
### Use Backup for Queries if Production Locked
```python
# Query backup database for existing locations
QUERY_DB = Check, expect something like "/home/david/go/src/skraak/db/backup_YYYY-MM-DD_X.duckdb"
# Create locations in production (when unlocked)
PROD_DB = "/home/david/go/src/skraak/db/skraak.duckdb"
```
### Always Default to Test Database
⚠️ **Unless user explicitly specifies production:**
```python
DB_PATH = "/home/david/go/src/skraak/db/test.duckdb"  # Default
```
## Example Usage
### Example 1: All Locations Already Exist
**User Request:**
```
Setup bulk import for Friends of Cobb dataset:
- Dataset ID: QZ0tlUrX4Nyi
- 17 location folders
- Use Takaka coordinates (-40.85085, 172.80703)
- Generate CSV at: /path/to/foc_import.csv
```
**Your Response:**
1. Query existing locations (17 found)
2. Fuzzy match all 17 folders → all matched!
3. Generate CSV with real location IDs
4. Report: "All locations exist, CSV ready for import"
5. Provide bulk import command
### Example 2: Create New Locations (MOK Import)
**User Request:**
```
Import Manu o Kahurangi kiwi survey:
- 23 location folders
- 8 have GPS log files
- 15 need Takaka default coords
- Generate CSV at: /path/to/mok_import.csv
```
**Your Workflow:**
1. **Parse GPS logs** (8 locations):
   ```python
   GPS_DATA = {
       "mok_bl6P_Low": {"lat": -40.823070, "lon": 172.582703, "fixes": 14},
       "mok_bl6T_Low": {"lat": -40.797182, "lon": 172.594582, "fixes": 14},
       ...
   }
   ```
2. **Query existing locations** with strict fuzzy matching (threshold 0.95):
   - Result: No matches found (all new locations)
3. **Count files recursively** (files in daily subfolders):
   - Total: 15,212 WAV files across 23 locations
4. **Present plan to user**:
   ```
   PLAN: Create 23 new locations
   - 8 with GPS coordinates (averaged from log files)
   - 15 with Takaka defaults
   Total files: 15,212
   Proceed? [yes/no]
   ```
5. **After approval, create all locations**:
   ```python
   for location in locations:
       if location in GPS_DATA:
           gps = GPS_DATA[location]
           description = f"Averaged from {gps['fixes']} GPS readings"
       else:
           gps = {'lat': DEFAULT_LAT, 'lon': DEFAULT_LON}
           description = "Location assumed to be Takaka pending GPS data"
       location_id = create_location(location, gps['lat'], gps['lon'], description)
   ```
6. **Generate CSV** with real location IDs
7. **Verify**:
   - Query database to confirm all 23 locations created
   - Verify CSV has 23 rows + header
   - Provide bulk import command
## Output Summary Template
```
======================================================================
BULK IMPORT SETUP COMPLETE
======================================================================
Dataset: Dataset Name (ID: abc123)
Database: /path/to/db
Locations processed: 17/17
  - Existing (matched): 17
  - New (created): 0
  - Failed: 0
Total files: 8,003
CSV Generated: /path/to/import.csv
Log file: /path/to/import_log.txt
Ready for bulk import!
Run:
  cd /home/david/go/src/skraak
  ./skraak import bulk --db ./db/skraak.duckdb \
    --dataset DATASET_ID \
    --csv "/path/to/import.csv" \
    --log /path/to/bulk_import.log
```
## All-in-One Python Script Template
For complex imports with many locations, create a comprehensive Python script:
```python
#!/usr/bin/env python3
"""
Bulk import setup: Parse GPS, create locations, generate CSV
"""
import os
import re
import json
import subprocess
from pathlib import Path
from difflib import SequenceMatcher
from typing import Dict, List, Optional, Tuple
# Configuration
DATASET_ID = "abc123xyz789"
DB_PATH = "./db/skraak.duckdb"
TIMEZONE = "Pacific/Auckland"
BASE_PATH = "/path/to/data"
DEFAULT_LAT = -40.85085
DEFAULT_LON = 172.80703
# GPS log files (if available)
GPS_LOGS = ["location1/log.txt", "location2/log.txt"]
# Location folders
FOLDERS = ["location1", "location2", ...]
def parse_gps_log(log_path: str) -> Optional[Tuple[float, float, int]]:
    """Parse GPS log and return (lat, lon, fix_count)"""
    if not os.path.exists(log_path):
        return None
    gps_pattern = re.compile(r'GPS \(lat,long\): (-?\d+\.\d+),(-?\d+\.\d+)')
    coords = []
    with open(log_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            match = gps_pattern.search(line)
            if match:
                coords.append((float(match.group(1)), float(match.group(2))))
    if coords:
        avg_lat = sum(c[0] for c in coords) / len(coords)
        avg_lon = sum(c[1] for c in coords) / len(coords)
        return (avg_lat, avg_lon, len(coords))
    return None
def count_wav_files_recursive(folder_path: str) -> int:
    """Count WAV files recursively"""
    count = 0
    for root, dirs, files in os.walk(folder_path):
        count += sum(1 for f in files if f.lower().endswith('.wav'))
    return count
def get_sample_rate_recursive(folder_path: str) -> int:
    """Get sample rate from first WAV file found"""
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.wav'):
                try:
                    with open(os.path.join(root, file), 'rb') as f:
                        f.seek(24)
                        return int.from_bytes(f.read(4), byteorder='little')
                except:
                    pass
    return 8000
def create_location(name: str, lat: float, lon: float, desc: str) -> str:
    """Create location via CLI and return ID"""
    cmd = [
        "./skraak", "create", "location",
        "--db", DB_PATH,
        "--dataset", DATASET_ID,
        "--name", name,
        "--lat", str(lat),
        "--lon", str(lon),
        "--timezone", TIMEZONE,
        "--description", desc
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
    return json.loads(result.stdout)['location']['id']
def main():
    # Parse GPS logs
    gps_data = {}
    for log_file in GPS_LOGS:
        result = parse_gps_log(os.path.join(BASE_PATH, log_file))
        if result:
            folder = log_file.split('/')[0]
            gps_data[folder] = {
                'lat': result[0],
                'lon': result[1],
                'fixes': result[2]
            }
    # Process locations
    location_map = {}
    for folder in FOLDERS:
        folder_path = os.path.join(BASE_PATH, folder)
        # Determine coordinates and description
        if folder in gps_data:
            lat = gps_data[folder]['lat']
            lon = gps_data[folder]['lon']
            desc = f"Averaged from {gps_data[folder]['fixes']} GPS readings"
        else:
            lat = DEFAULT_LAT
            lon = DEFAULT_LON
            desc = "Location assumed to be Takaka pending GPS data"
        # Create location
        location_id = create_location(folder, lat, lon, desc)
        # Get metadata
        file_count = count_wav_files_recursive(folder_path)
        sample_rate = get_sample_rate_recursive(folder_path)
        location_map[folder] = {
            'id': location_id,
            'files': file_count,
            'sample_rate': sample_rate
        }
    # Generate CSV (use manual writes, not csv module, to avoid \r\n line endings)
    csv_path = os.path.join(BASE_PATH, "import.csv")
    with open(csv_path, 'w') as f:
        f.write("location_name,location_id,directory_path,date_range,sample_rate,file_count\n")
        for folder, info in location_map.items():
            f.write(f"{folder},{info['id']},{BASE_PATH}/{folder},2025-02,{info['sample_rate']},{info['files']}\n")
    print(f"✓ Created {len(location_map)} locations")
    print(f"✓ Generated CSV: {csv_path}")
if __name__ == '__main__':
    main()
```
## Common Patterns
### Date Format Detection
```python
# YYYYMMDD: 20231115_050004
# DDMMYY:   011123_050004
# YYMMDD:   231101_050004
# Heuristic: If first 4 chars > 2000, likely YYYYMMDD
# Can use variance to disambiguate YYMMDD/DDMMYY (many days, fewer years)
```
### Multiple Sample Rates
```python
# Don't assume uniform - check each location
# Common: 8000 Hz, 16000 Hz, 32000 Hz, 250000 Hz
```
### Trip References in Names
```python
# Location names may include trip references:
# foc-thup-foc01  → Friends of Cobb, trip 1
# foc-thup-fof11  → Friends of Flora, trip 11
```
## Troubleshooting
### Production Database Locked
**Solution**: Use backup database for queries, create locations later:
```python
# Step 1: Query backup for matches
# Step 2: Prepare CSV with analysis
# Step 3: Check with user
# Step 4: Wait for prod unlock
# Step 5: Check prod agrees with backup. 
# Step 5: Create locations in prod
# Step 6: Update CSV with real IDs
```
### Ambiguous Fuzzy Matches
**Problem**: One database location matches multiple folders
**Solution**: Be conservative - create new locations for each folder
```python
# foc-thup-f could match:
#   - foc-thup-foc01
#   - foc-thup-fof11
# create new locations for each folder
```
### Zero-Byte Files
**Handling**: Count them but note in logs
```python
file_count = len(list(directory.glob("*.wav")))
zero_byte_files = [f for f in directory.glob("*.wav") if f.stat().st_size == 0]
if zero_byte_files:
    log(f"  ⚠ {len(zero_byte_files)} zero-byte files found", log_file)
```
## Key Lessons (From Production Imports)
### 🔴 CRITICAL: Stricter Fuzzy Matching (0.9 threshold)
**Problem:** Threshold of 0.80 caused false matches:
- `mok_bl6P` matched to `mok_bl16` (score: 0.86) → WRONG! Different locations
- `mok_tuna51` matched to `mok_tuna31` (score: 0.89) → WRONG! Different locations
**Solution:** Use 0.9 threshold:
```python
if best_score > 0.9:  # NOT 0.80!
    return best_match
```
### 🔴 CRITICAL: Always Search Recursively
**Problem:** Using `glob("*.wav")` missed files in subdirectories.
**Solution:** Use `os.walk()` everywhere:
```python
for root, dirs, files in os.walk(folder_path):  # NOT glob()!
    for file in files:
        if file.lower().endswith('.wav'):
            # process file
```
**Applies to:**
- File counting
- Sample rate detection
- Date range parsing
### 🔴 CRITICAL: Handle Encoding Errors
**Problem:** Some GPS log files contain non-UTF-8 bytes → UnicodeDecodeError
**Solution:** Always use `errors='ignore'`:
```python
with open(log_file, 'r', encoding='utf-8', errors='ignore') as f:  # errors='ignore'!
```
### 🟡 IMPORTANT: Remove Suffixes in Normalization
Normalize location names to handle suffix variations:
```python
name = re.sub(r'(low|high)$', '', name.lower())  # Remove Low/High
# mok_bl6P_Low → mok_bl6p (for matching)
```
### 🟡 IMPORTANT: All-in-One Python Script
For complex imports (10+ locations), create a single Python script that:
1. Parses GPS logs
2. Counts files recursively
3. Gets metadata
4. Creates locations via CLI
5. Generates CSV
**Benefits:**
- Single point of failure
- Easy to debug
- Reusable for similar imports
- Complete audit trail
## Guidelines
- Always check existing locations before creating new ones (avoid duplicates)
- Use **strict** fuzzy matching (0.9 threshold) to avoid false matches
- **Always search recursively** with `os.walk()` (files in subfolders!)
- Prepare CSV before creating locations (allows user review)
- Never create locations without explicit user confirmation
- Write comprehensive logs for debugging and records
- Support multiple date formats (YYYYMMDD, DDMMYY and YYMMDD)
- Handle missing GPS gracefully with fallback coordinates
- **Handle encoding errors** with `errors='ignore'` when reading logs
- Verify results with SQL queries and CSV checks
- Provide ready-to-run bulk import command
## Success Criteria
✅ No duplicate locations created
✅ All folders matched or new locations planned
✅ CSV has correct format with all required columns
✅ Log file provides audit trail
✅ User confirmed before any database writes
✅ Verification queries show expected results
✅ Bulk import command provided

tracking claude skills

Dependencies

In channels

Change contents

File addition: .claude (d--x------)

File addition: skills (d--x------)

File addition: writing-skills (d--x------)

File addition: SKILL.md (----------)

File addition: skraak-import-segments (d--x------)

File addition: SKILL.md (----------)

File addition: skraak-dataset-report (d--x------)

File addition: SKILL.md (----------)

File addition: skraak-data-mapping (d--x------)

File addition: SKILL.md (----------)

File addition: skraak-bulk-import-setup (d--x------)

File addition: example_prompt_for_human_reference.md (----------)

File addition: SKILL.md (----------)