Kill active build steps when builds are cancelled

[?]
Oct 31, 2016, 1:58 PM
LVQXQIYA7QMLVYOANYEFHDBTFAOSE3D2IYAVOG2DXURTASRCUNYQC

Dependencies

  • [2] TTBLPQAJ Keep track of wait time per system type
  • [3] OTNJLJHA Sort build steps
  • [4] UYUVQWXQ Fix hydra-queue-runner --build-one
  • [5] KQ3EGUQY Add some instrumentation to keep track of dispatcher cost
  • [6] UNVLTCV4 Fix showing machine name for aborted build steps
  • [7] EHEQ4AY3 Fix retry of transient failures
  • [8] OPN3PED2 Tweak
  • [9] 3BKPZ52C Disambiguate "marking build as succeeded" message
  • [10] YTAYNN7V Queue monitor: Bail out earlier if a step has failed previously
  • [11] NTEDD7T4 Provide a plugin hook for when build steps finish
  • [12] OBOTGFG6 Prevent orphaned build steps
  • [13] BRAESISH Warn if PostgreSQL appears stalled
  • [14] PH3DFCNU Render machine correctly if it doesn't contain @
  • [15] TX7Q4RAS Add page showing latest build steps
  • [16] 2GUAKGTB Fix indentation of build.tt
  • [17] ZH6B56XR Try harder to find build logs
  • [18] HJOEIMLR Refactor
  • [19] MHVIT4JY Split hydra-queue-runner.cc more
  • [20] MSIHMO45 Tweak build steps
  • [21] EYR3EW6J Keep stats for the Hydra auto scaler
  • [22] LE4VZIY5 More stats
  • [23] 73YR46NJ hydra-queue-runner: Write directly to a binary cache
  • [24] UNVMKJV5 Unify build and step status codes
  • [25] R7MDDCB2 Some unnecessary job names
  • [26] 7LWB2J2Z Periodically clear orphaned build steps
  • [27] FCTX433O Add buildStarted plugin hook
  • [28] O64P4XJS Keep per-machine stats
  • [29] FQQRJUO4 Mark builds as busy
  • [30] PMNWRTGJ Add multiple output support
  • [31] VQISTKOP hydra-queue-runner: Use substitutes
  • [32] TPNHTE5V Remove obsolete Builds columns and provide accurate "Running builds"
  • [33] 7LFMSF4K Don't show "localhost" as machine for cached failed build steps
  • [34] 5JB5DKQL Don't repeat links to build step logs
  • [35] DKJFD6JN Process Nix API changes
  • [36] BD3GRK4B * Get rid of "positive failures" and separate log phases.
  • [37] 24BMQDZA Start of single-process hydra-queue-runner
  • [38] PLOZBRTR Add command ‘hydra-queue-runner --status’ to show current status
  • [39] HUUZFPPK Fix race between the queue monitor and the builder threads
  • [40] FITVNQ2S Keep track of the time we spend copying to/from build machines
  • [41] 5AIYUMTB Basic remote building
  • [42] BG6PEOB2 Make the output size limit configurable
  • [43] 62MQPRXC Pass null values to libpqxx properly
  • [44] KBZHIGLG Record the machine used for a build step
  • [45] LJILHOJ7 Create BuildSteps race-free
  • [46] WE5Q2NVI Allow build to be bumped to the front of the queue via the web interface
  • [47] NKQOEVVP Get rid of "will retry" messages after "maybe cancelling..."
  • [48] DWFTK56E Keep track of how many threads are waiting
  • [49] UQQ4IL55 Add a error type for "unsupported system type"
  • [50] NQ2X3Y4K Don't render machine name if not applicable to step
  • [*] J5UVLXOK * Start of a basic Catalyst web interface.
  • [*] OCZ4LSGG Automatically retry aborted builds
  • [*] JGLE5BRN Add separate build step status codes for cached failures and timeouts
  • [*] N22GPKYT * Put info about logs / build products in the DB.

Change contents

  • edit in src/hydra-queue-runner/builder.cc at line 15
    [5.23]
    [13.350]
    reservation->threadId = pthread_self();
  • replacement in src/hydra-queue-runner/builder.cc at line 18
    [13.351][13.351:388]()
    MaintainCount mc(nrActiveSteps);
    [13.351]
    [13.58]
    activeSteps_.lock()->insert(reservation);
    Finally removeActiveStep([&]() {
    reservation->threadId = -1;
    activeSteps_.lock()->erase(reservation);
    });
  • replacement in src/hydra-queue-runner/builder.cc at line 73
    [13.2097][13.2097:2164]()
    thousands of builds), so we don't. */
    Build::ptr build;
    [13.2097]
    [13.2164]
    thousands of builds), so we don't.
    We don't keep a Build::ptr here to allow
    State::processQueueChange() to detect whether a step can be
    cancelled (namely if there are no more Builds referring to
    it). */
    BuildID buildId;
    Path buildDrvPath;
    unsigned int maxSilentTime, buildTimeout;
  • edit in src/hydra-queue-runner/builder.cc at line 102
    [13.3048]
    [13.0]
    Build::ptr build;
  • edit in src/hydra-queue-runner/builder.cc at line 112
    [13.3215]
    [13.3215]
    buildId = build->id;
    buildDrvPath = build->drvPath;
    maxSilentTime = build->maxSilentTime;
    buildTimeout = build->buildTimeout;
  • replacement in src/hydra-queue-runner/builder.cc at line 118
    [13.3327][13.3327:3414]()
    % step->drvPath % machine->sshName % build->id % (dependents.size() - 1));
    [13.3327]
    [13.3414]
    % step->drvPath % machine->sshName % buildId % (dependents.size() - 1));
  • replacement in src/hydra-queue-runner/builder.cc at line 121
    [13.3421][4.0:74]()
    bool quit = build->id == buildOne && step->drvPath == build->drvPath;
    [13.3421]
    [13.3460]
    bool quit = buildId == buildOne && step->drvPath == buildDrvPath;
  • replacement in src/hydra-queue-runner/builder.cc at line 132
    [13.124][12.0:86]()
    printError("marking step %d of build %d as orphaned", stepNr, build->id);
    [13.124]
    [13.124]
    printError("marking step %d of build %d as orphaned", stepNr, buildId);
  • replacement in src/hydra-queue-runner/builder.cc at line 134
    [13.179][13.179:235]()
    orphanedSteps_->emplace(build->id, stepNr);
    [13.179]
    [13.235]
    orphanedSteps_->emplace(buildId, stepNr);
  • replacement in src/hydra-queue-runner/builder.cc at line 151
    [13.4030][13.0:100]()
    stepNr = createBuildStep(txn, result.startTime, build, step, machine->sshName, bsBusy);
    [13.4030]
    [13.4224]
    stepNr = createBuildStep(txn, result.startTime, buildId, step, machine->sshName, bsBusy);
  • replacement in src/hydra-queue-runner/builder.cc at line 158
    [13.4376][13.1330:1432]()
    buildRemote(destStore, machine, step, build->maxSilentTime, build->buildTimeout, result);
    [13.4376]
    [13.2415]
    buildRemote(destStore, machine, step, maxSilentTime, buildTimeout, result);
  • edit in src/hydra-queue-runner/builder.cc at line 165
    [7.36]
    [13.4597]
    } catch (__cxxabiv1::__forced_unwind & e) {
    /* The queue monitor thread cancelled this step. */
    try {
    printInfo("marking step %d of build %d as succeeded", stepNr, buildId);
    pqxx::work txn(*conn);
    finishBuildStep(txn, result.startTime, time(0), result.overhead, buildId,
    stepNr, machine->sshName, bsCancelled, "");
    txn.commit();
    stepFinished = true;
    } catch (...) {
    ignoreException();
    }
    throw;
  • replacement in src/hydra-queue-runner/builder.cc at line 204
    [11.149][11.149:225]()
    logCompressorQueue_->push({build->id, stepNr, result.logFile});
    [11.149]
    [13.4984]
    logCompressorQueue_->push({buildId, stepNr, result.logFile});
  • replacement in src/hydra-queue-runner/builder.cc at line 224
    [13.5536][13.434:530]()
    finishBuildStep(txn, result.startTime, result.stopTime, result.overhead, build->id,
    [13.5536]
    [8.0]
    finishBuildStep(txn, result.startTime, result.stopTime, result.overhead, buildId,
  • replacement in src/hydra-queue-runner/builder.cc at line 280
    [13.620][13.173:242]()
    build->id, stepNr, machine->sshName, bsSuccess);
    [13.620]
    [13.7478]
    buildId, stepNr, machine->sshName, bsSuccess);
  • replacement in src/hydra-queue-runner/builder.cc at line 284
    [9.131][13.2665:2748](),[13.7519][13.2665:2748]()
    markSucceededBuild(txn, b, res, build != b || result.isCached,
    [9.131]
    [13.7622]
    markSucceededBuild(txn, b, res, buildId != b->id || result.isCached,
  • replacement in src/hydra-queue-runner/builder.cc at line 377
    [13.3181][13.3181:3266]()
    (result.stepStatus != bsCachedFailure && build == build2) ||
    [13.3181]
    [13.11736]
    (result.stepStatus != bsCachedFailure && buildId == build2->id) ||
  • replacement in src/hydra-queue-runner/builder.cc at line 380
    [13.11816][13.11816:11892](),[13.11892][13.3267:3361]()
    createBuildStep(txn, 0, build2, step, machine->sshName,
    result.stepStatus, result.errorMsg, build == build2 ? 0 : build->id);
    [13.11816]
    [13.11984]
    createBuildStep(txn, 0, build2->id, step, machine->sshName,
    result.stepStatus, result.errorMsg, buildId == build2->id ? 0 : buildId);
  • replacement in src/hydra-queue-runner/builder.cc at line 387
    [13.784][13.3421:3519]()
    build->id, stepNr, machine->sshName, result.stepStatus, result.errorMsg);
    [13.784]
    [12.478]
    buildId, stepNr, machine->sshName, result.stepStatus, result.errorMsg);
  • replacement in src/hydra-queue-runner/builder.cc at line 429
    [13.14020][11.342:468]()
    notificationSenderQueue_->push(NotificationItem{NotificationItem::Type::BuildFinished, build->id, dependentIDs});
    [13.14020]
    [13.14107]
    notificationSenderQueue_->push(NotificationItem{NotificationItem::Type::BuildFinished, buildId, dependentIDs});
  • replacement in src/hydra-queue-runner/hydra-queue-runner.cc at line 227
    [13.8075][11.469:540]()
    unsigned int State::allocBuildStep(pqxx::work & txn, Build::ptr build)
    [13.8075]
    [13.8254]
    unsigned int State::allocBuildStep(pqxx::work & txn, BuildID buildId)
  • replacement in src/hydra-queue-runner/hydra-queue-runner.cc at line 233
    [13.198][13.8256:8363](),[13.8256][13.8256:8363]()
    auto res = txn.parameterized("select max(stepnr) from BuildSteps where build = $1")(build->id).exec();
    [13.198]
    [13.63]
    auto res = txn.parameterized("select max(stepnr) from BuildSteps where build = $1")(buildId).exec();
  • replacement in src/hydra-queue-runner/hydra-queue-runner.cc at line 238
    [13.129][11.541:647]()
    unsigned int State::createBuildStep(pqxx::work & txn, time_t startTime, Build::ptr build, Step::ptr step,
    [13.129]
    [13.452]
    unsigned int State::createBuildStep(pqxx::work & txn, time_t startTime, BuildID buildId, Step::ptr step,
  • replacement in src/hydra-queue-runner/hydra-queue-runner.cc at line 241
    [13.339][11.648:702]()
    unsigned int stepNr = allocBuildStep(txn, build);
    [13.339]
    [13.8431]
    auto stepNr = allocBuildStep(txn, buildId);
  • replacement in src/hydra-queue-runner/hydra-queue-runner.cc at line 245
    [13.578][13.385:405]()
    (build->id)
    [13.578]
    [13.405]
    (buildId)
  • replacement in src/hydra-queue-runner/hydra-queue-runner.cc at line 261
    [13.9164][13.9164:9238]()
    (build->id)(stepNr)(output.first)(output.second.path).exec();
    [13.9164]
    [13.9238]
    (buildId)(stepNr)(output.first)(output.second.path).exec();
  • replacement in src/hydra-queue-runner/hydra-queue-runner.cc at line 287
    [13.691][13.691:736]()
    int stepNr = allocBuildStep(txn, build);
    [13.691]
    [13.736]
    auto stepNr = allocBuildStep(txn, build->id);
  • replacement in src/hydra-queue-runner/hydra-queue-runner.cc at line 577
    [13.1115][13.1115:1166]()
    root.attr("nrActiveSteps", nrActiveSteps);
    [13.1115]
    [13.1166]
    root.attr("nrActiveSteps", activeSteps_.lock()->size());
  • edit in src/hydra-queue-runner/queue-monitor.cc at line 4
    [13.1279]
    [13.1280]
    #include <cstring>
  • replacement in src/hydra-queue-runner/queue-monitor.cc at line 185
    [10.1858][10.1858:1956]()
    createBuildStep(txn, 0, build, ex.step, "", bsCachedFailure, "", propagatedFrom);
    [10.1858]
    [10.1956]
    createBuildStep(txn, 0, build->id, ex.step, "", bsCachedFailure, "", propagatedFrom);
  • replacement in src/hydra-queue-runner/queue-monitor.cc at line 317
    [13.29011][13.29011:29044]()
    auto builds_(builds.lock());
    [13.29011]
    [13.29044]
    {
    auto builds_(builds.lock());
  • replacement in src/hydra-queue-runner/queue-monitor.cc at line 320
    [13.29045][13.29045:29106](),[13.29106][13.1591:1672](),[13.1672][13.29167:29360](),[13.29167][13.29167:29360](),[13.29360][13.1673:1695]()
    for (auto i = builds_->begin(); i != builds_->end(); ) {
    auto b = currentIds.find(i->first);
    if (b == currentIds.end()) {
    printMsg(lvlInfo, format("discarding cancelled build %1%") % i->first);
    i = builds_->erase(i);
    // FIXME: ideally we would interrupt active build steps here.
    continue;
    [13.29045]
    [13.1695]
    for (auto i = builds_->begin(); i != builds_->end(); ) {
    auto b = currentIds.find(i->first);
    if (b == currentIds.end()) {
    printMsg(lvlInfo, format("discarding cancelled build %1%") % i->first);
    i = builds_->erase(i);
    // FIXME: ideally we would interrupt active build steps here.
    continue;
    }
    if (i->second->globalPriority < b->second) {
    printMsg(lvlInfo, format("priority of build %1% increased") % i->first);
    i->second->globalPriority = b->second;
    i->second->propagatePriorities();
    }
    ++i;
  • replacement in src/hydra-queue-runner/queue-monitor.cc at line 335
    [13.1705][13.1705:1940]()
    if (i->second->globalPriority < b->second) {
    printMsg(lvlInfo, format("priority of build %1% increased") % i->first);
    i->second->globalPriority = b->second;
    i->second->propagatePriorities();
    [13.1705]
    [13.1940]
    }
    {
    auto activeSteps(activeSteps_.lock());
    for (auto & activeStep : *activeSteps) {
    auto threadId = activeStep->threadId; // FIXME: use Sync or atomic?
    if (threadId == 0) continue;
    std::set<Build::ptr> dependents;
    std::set<Step::ptr> steps;
    getDependents(activeStep->step, dependents, steps);
    if (!dependents.empty()) continue;
    printInfo("cancelling thread for build step ‘%s’", activeStep->step->drvPath);
    int err = pthread_cancel(threadId);
    if (err)
    printError("error cancelling thread for build step ‘%s’: %s",
    activeStep->step->drvPath, strerror(err));
  • edit in src/hydra-queue-runner/queue-monitor.cc at line 355
    [13.1950][13.1950:1963]()
    ++i;
  • edit in src/hydra-queue-runner/state.hh at line 31
    [13.1660]
    [13.912]
    bsCancelled = 4,
  • edit in src/hydra-queue-runner/state.hh at line 300
    [13.5630][13.5630:5660]()
    counter nrActiveSteps{0};
  • edit in src/hydra-queue-runner/state.hh at line 373
    [13.2908]
    [13.2908]
    pthread_t threadId = 0;
    bool cancelled = false;
  • edit in src/hydra-queue-runner/state.hh at line 378
    [13.3027]
    [2.1480]
    nix::Sync<std::set<std::shared_ptr<MachineReservation>>> activeSteps_;
  • replacement in src/hydra-queue-runner/state.hh at line 420
    [13.3075][11.3251:3320]()
    unsigned int allocBuildStep(pqxx::work & txn, Build::ptr build);
    [13.3075]
    [13.7026]
    unsigned int allocBuildStep(pqxx::work & txn, BuildID buildId);
  • replacement in src/hydra-queue-runner/state.hh at line 422
    [13.7027][11.3321:3424]()
    unsigned int createBuildStep(pqxx::work & txn, time_t startTime, Build::ptr build, Step::ptr step,
    [13.7027]
    [13.1029]
    unsigned int createBuildStep(pqxx::work & txn, time_t startTime, BuildID buildId, Step::ptr step,
  • replacement in src/root/build.tt at line 26
    [13.53][13.11819:11941](),[3.85][13.11819:11941](),[13.2133][13.11819:11941](),[13.11819][13.11819:11941]()
    [% IF ( type == "All" ) || ( type == "Failed" && step.status != 0 ) || ( type == "Running" && step.busy == 1 ) %]
    [3.85]
    [13.54]
    [% IF ( type == "All" ) || ( type == "Failed" && step.busy == 0 && step.status != 0 ) || ( type == "Running" && step.busy == 1 ) %]
  • replacement in src/root/build.tt at line 52
    [13.932][6.0:240]()
    <td>[% IF step.busy == 1 || ((step.machine || step.starttime) && (step.status == 0 || step.status == 1 || step.status == 3 || step.status == 7)); INCLUDE renderMachineName machine=step.machine; ELSE; "<em>n/a</em>"; END %]</td>
    [13.932]
    [13.985]
    <td>[% IF step.busy == 1 || ((step.machine || step.starttime) && (step.status == 0 || step.status == 1 || step.status == 3 || step.status == 4 || step.status == 7)); INCLUDE renderMachineName machine=step.machine; ELSE; "<em>n/a</em>"; END %]</td>
  • edit in src/root/build.tt at line 60
    [53.6911]
    [54.0]
    [% ELSIF step.status == 4 %]
    <span class="error">Cancelled</span>
  • replacement in src/root/build.tt at line 247
    [13.70][13.2322:2395]()
    [% IF steps && build.buildstatus != 0 && build.buildstatus != 6 %]
    [13.70]
    [13.5185]
    [% IF steps && build.buildstatus != 0 && build.buildstatus != 4 && build.buildstatus != 6 %]
  • replacement in src/sql/hydra.sql at line 206
    [13.1560][13.1560:1637]()
    -- 4 = build cancelled (removed from queue; never built) [builds only]
    [13.1560]
    [13.1637]
    -- 4 = build or step cancelled