Allgemein

C.J. Collier: The WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104

Von Markus Leitermann 21.03.2026 Loading...

The
WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104

This document synthesizes the extensive work performed from March
13th to March 20th, 2026, to harden, stabilize, and refactor the
WWW::Mechanize::Chrome library and its test suite. This
effort involved deep dives into asynchronous programming,
platform-specific bug hunting, and strategic architectural
decisions.

Part I:
The Quest for Cross-Platform Stability (March 13 – 16)

The initial phase of work focused on achieving a “green” test suite
across a variety of Linux distributions and preparing for a new release.
This involved significant hardening of the library to account for
different browser versions, OS-level security restrictions, and
filesystem differences.

Key Milestones &
Engineering Decisions:

Fedora & RHEL-family Success: A major effort
was undertaken to achieve a 100% pass rate on modern Fedora 43 and
CentOS Stream 10. This required several key engineering decisions to
handle modern browser behavior:
- Decision: Implement Asynchronous DOM Serialization
  Fallback. Synchronous fallbacks in an async context are
  dangerous. To prevent Resource was not cached errors during
  saveResources, we implemented a fully asynchronous fallback
  in _saveResourceTree. By chaining
  _cached_document with DOM.getOuterHTML
  messages, we can reconstruct document content without blocking the event
  loop, even if Chromium has evicted the resource from its cache. This
  also proved resilient against Fedora’s security policies, which often
  block file:// access.
- Decision: Truncate Filenames for Cross-Platform
  Safety. To avoid File name too long errors,
  especially on Windows where the MAX_PATH limit is 260
  characters, filenameFromUrl was hardened. The filename
  truncation was reduced to a more conservative 150
  characters, leaving ample headroom for deeply nested CI
  temporary directories. Logic was also added to preserve file extensions
  during truncation and to sanitize backslashes from URI paths.
- Decision: Expand Browser Discovery Paths. To
  support RHEL-based systems out-of-the-box, the
  default_executable_names was expanded to include
  headless_shell and search paths were updated to include
  /usr/lib64/chromium-browser/.
- Decision: Mitigate Race Conditions with Stabilization Waits
  and Resilient Fetching. On fast systems,
  DOM.documentUpdated events could invalidate
  nodeIds immediately after navigation, causing XPath queries
  to fail with “Could not find node with given id”. A small stabilization
  sleep(0.25s) was added after page loads to ensure the DOM
  is settled. Furthermore, the asynchronous DOM fetching loop was hardened
  to gracefully handle these errors by catching protocol errors and
  returning an empty string for any node that was invalidated during
  serialization, ensuring the overall process could complete.
Windows Hardening:
- Decision: Adopt Platform-Aware Watchdogs. The test
  suite’s reliance on ualarm was a blocker for Windows, where
  it is not implemented. The t::helper::set_watchdog function
  was refactored to use standard alarm() (seconds) on Windows
  and ualarm (microseconds) on Unix-like systems, enabling
  consistent test-level timeout enforcement.
Version 0.77 Release:
- Decision: Adopt SOP for Version Synchronization.
  The project maintains duplicate version strings across 24+ files. A
  Standard Operating Procedure was adopted to use a batch-replacement tool
  to update all sub-modules in lib/ and to always run
  make clean and perl Makefile.PL to ensure
  META.json and META.yml reflect the new
  version. After achieving stability on Linux, the project version was
  bumped to 0.77.
Infrastructure & Strategic Work:
- The ad2 Windows Server 2025 instance was restored and
  optimized, with Active Directory demoted and disk I/O performance
  improved.
- A strategic proposal for the Heterogeneous Directory
  Replication Protocol (HDRP) was drafted and published.

Part II: The
Great Async Refactor (March 17 – 18)

Despite success on Linux, tests on the slow ad2 Windows
host were still plagued by intermittent, indefinite hangs. This
triggered a fundamental architectural shift to move the library’s core
from a mix of synchronous and asynchronous code to a fully non-blocking
internal API.

Key Milestones &
Engineering Decisions:

Decision: Expose a _future API.
Instead of hardcoding timeouts in the library, the core strategy was to
refactor all blocking methods (xpath, field,
get, etc.) into thin wrappers around new non-blocking
..._future counterparts. This moved timeout management to
the test harness, allowing for flexible and explicit handling of
stalls.
```
# Example library implementation
sub xpath($self, $query, %options) {
    return $self->xpath_future($query, %options)->get;
}

sub xpath_future($self, $query, %options) {
    # Async implementation using $self->target->send_message(...)
}
```

Decision: Centralize Test Hardening in a Helper.
A dedicated test library, t/lib/t/helper.pm, was created to
contain all stabilization logic. “Safe” wrappers (safe_get,
safe_xpath) were implemented there, using
Future->wait_any to race asynchronous operations against
a timeout, preventing tests from hanging.

# Example test helper implementation
sub safe_xpath {
    my ($mech, $query, %options) = @_;
    my $timeout = delete $options{timeout} || 5;
    my $call_f = $mech->xpath_future($query, %options);
    my $timeout_f = $mech->sleep_future($timeout)->then(sub { Future->fail("Timeout") });
    return Future->wait_any($call_f, $timeout_f)->get;
}

Decision: Refactor Node Attribute Cache.
Investigations into flaky checkbox tests (t/50-tick.t)
revealed that WWW::Mechanize::Chrome::Node was storing
attributes as a flat list ([key, val, key, val]), which was
inefficient for lookups and individual updates. The cache was refactored
to definitively use a HashRef, providing O(1) lookups
and enabling atomic dual-updates where both the browser property (via
JS) and the internal library attribute are synchronized
simultaneously.
Decision: Implement Self-Cancelling Socket
Watchdog. On Windows, traditional watchdog processes often
failed to detect parent termination, leading to 60-second hangs after
successful tests. We implemented a new socket-based watchdog in
t::helper that listens on an ephemeral port; the background
process terminates immediately when the parent socket closes,
eliminating these cumulative delays.
Decision: Deep Recursive Refactoring & Form
Selection. To make the API truly non-blocking, the entire
internal call stack had to be refactored. For example, making
get_set_value_future non-blocking required first making its
dependency, _field_by_name, asynchronous. This culminated
in refactoring the entire form selection API (form_name,
form_id, etc.) to use the new asynchronous
_future lookups, which was a key step in mitigating the
Windows deadlocks.
Decision: Fix Critical Regressions & Memory
Cycles.
- Evaluation Normalization: Implemented a
  _process_eval_result helper to centralize the parsing of
  results from Runtime.evaluate. This ensures consistent
  handling of return values and exceptions between synchronous
  (eval_in_page) and asynchronous (eval_future)
  calls.
- Memory Cycle Mitigation: A significant memory
  leak was discovered where closures attached to CDP event futures (like
  for asynchronous body retrieval) would capture strong references to
  $self and the $response object, creating a
  circular reference. The established rule is to now always use
  Scalar::Util::weaken on both $self and any
  other relevant objects before they are used inside a
  ->then block that is stored on an object.
- Context Propagation (wantarray): A
  major regression was discovered where Perl’s wantarray
  context, which distinguishes between scalar and list context, was lost
  inside asynchronous Future->then blocks. This caused
  methods like xpath to return incorrect results (e.g., a
  count instead of a list of nodes). The solution was to adopt the “Async
  Context Pattern”: capture wantarray in the synchronous
  wrapper, pass it as an option to the _future method, and
  then use that captured value inside the future’s final resolution
  block.
```
# Synchronous Wrapper
sub xpath($self, $query, %options) {
    $options{ wantarray } = wantarray; # 1. Capture
    return $self->xpath_future($query, %options)->get; # 2. Pass
}

# Asynchronous Implementation
sub xpath_future($self, $query, %options) {
    my $wantarray = delete $options{ wantarray }; # 3. Retrieve
    # ... async logic ...
    return $doc->then(sub {
        if ($wantarray) { # 4. Respect
            return Future->done(@results);
        } else {
            return Future->done($results[0]);
        }
    });
}
```
- Asynchronous Body Retrieval & Robust Content
  Fallbacks: Fixed a bug where decoded_content()
  would return empty strings by ensuring it awaited a
  __body_future. This was implemented by storing the
  retrieval future directly on the response object
  ($response->{__body_future}). To make this more robust,
  a tiered strategy was implemented: first try to get the content from the
  network response, but if that fails (e.g., for about:blank
  or due to cache eviction), fall back to a JavaScript
  XMLSerializer to get the live DOM content.
- Signature Hardening: Fixed “Too few arguments”
  errors when using modern Perl signatures with
  Future->then. Callbacks were updated to use optional
  parameters (sub($result = undef) { ... }) to gracefully
  handle futures that resolve with no value.
- XHTML “Split-Brain” Bug: Resolved a
  long-standing Chromium bug (40130141) where content provided via
  setDocumentContent is parsed differently than content
  loaded from a URL. A workaround was implemented: for XHTML documents,
  WMC now uses a JavaScript-based XPath evaluation
  (document.evaluate) against the live DOM, bypassing the
  broken CDP search mechanism.

Derived Architectural Rules
& SOPs:

Rule: Always provide _future variants.
Every library method that interacts with the browser via CDP must have a
non-blocking asynchronous counterpart.
Rule: Centralize stabilization in the test layer.
All timeout and retry logic should reside in the test harness
(t/lib/t/helper.pm), not in the core library.
Rule: Explicitly propagate wantarray
context. Synchronous wrappers must capture the caller’s context
and pass it down the Future chain to ensure correct
scalar/list behavior.
Rule: The entire call chain must be asynchronous.
To enable non-blocking timeouts, even a single “hidden” blocking call in
an otherwise asynchronous method will cause a stall.
SOP: Reduce Library Noise. Diagnostic messages
(warn, note, diag) should be
removed from library code before commits. All such messages should be
converted to use the internal $self->log('debug', ...)
mechanism, ensuring a clean TAP output for CI systems.

Part III: The
`MutationObserver` Saga (March 19)

With most of the library refactored to be asynchronous, one stubborn
test, t/65-is_visible.t, continued to fail with timeouts.
This led to an ambitious, but ultimately unsuccessful, attempt to
replace the wait_until_visible polling logic with a more
“modern” MutationObserver.

Key Milestones & Challenges:

The Theory: The goal was to replace an inefficient
repeat { sleep } loop with an event-driven
MutationObserver in JavaScript that would notify Perl
immediately when an element’s visibility changed.
Implementation & Cascade Failure: The
implementation proved incredibly difficult and introduced a series of
new, hard-to-diagnose bugs:
1. An incorrect function signature for
  callFunctionOn_future.
2. A critical unit mismatch, passing seconds from Perl to JavaScript’s
  setTimeout, which expected milliseconds.
3. A fundamental hang where the MutationObserver’s
  JavaScript Promise would never resolve, even after the
  underlying DOM element changed.
Debugging Maze: Multiple attempts to fix the
checkVisibility JavaScript logic inside the observer
callback, including making it more robust by adding DOM tree traversal
and extensive console.log tracing, failed to resolve the
hang. This highlighted the opacity and difficulty of debugging complex,
cross-language asynchronous interactions, especially when dealing with
low-level browser APIs.

Procedural Learning:
Granular Edits

The effort was plagued by procedural missteps in using automated
file-editing tools. Initial attempts to replace large code blocks in a
single operation led to accidental code loss and match failures.

Decision: Adopt “Delete, then Add” Workflow.
Following forceful user correction, a new SOP was established for all
future modifications:
1. Isolate: Break the file into small, manageable
  chunks (e.g., 250 lines).
2. Delete: Perform a “delete” operation by replacing
  the old code block with an empty string.
3. Add: Perform an “add” operation by inserting the
  new code into the empty space.
4. Verify: Verifying each atomic step before
  proceeding. This granular process, while slower, ensured surgical
  precision and regained technical control over the large
  Chrome.pm module.

The consistent failure of the MutationObserver approach
eventually led to the decision to abandon it in favor of stabilizing the
original, more transparent implementation.

Part IV:
Reversion and Final Stabilization (March 20)

After exhausting all reasonable attempts to fix the
MutationObserver, a strategic decision was made to revert
to the simpler, more transparent polling implementation and fix it
correctly. This proved to be the correct path to a stable solution.

Key Milestones &
Engineering Decisions:

Decision: Perform Strategic Reversion. The
MutationObserver implementation, when integrated via
callFunctionOn_future with awaitPromise,
proved fundamentally unstable. Its JavaScript promise would consistently
fail to resolve, causing indefinite hangs. A decision was made to
revert all MutationObserver code from
WWW::Mechanize::Chrome.pm and restore the original
repeat { sleep } polling mechanism. A stable,
understandable solution was prioritized over an elegant but broken
one.
Decision: Correct Timeout Delegation in the
Harness. The root cause of the original timeout failure was
identified as a race condition in the t/lib/t/helper.pm
test harness. The safe_wait_until_* wrappers were
implementing their own timeout (via wait_any and
sleep_future) that raced against the underlying polling
function’s internal timeout. This led to intermittent failures on slow
machines. The helpers were refactored to delegate all timeout
management to the library’s polling functions, ensuring a
single, authoritative timer controlled the operation.
Decision: Optimize Polling Performance. At the
user’s request, the polling interval was reduced from 300ms to
150ms. This modest performance improvement reduced the
test suite’s wallclock execution time by over a second while maintaining
stability.
Decision: Tune Test Watchdogs. The global watchdog
timeout was adjusted to 12 seconds, specifically calculated as 1.5x the
observed real execution time of the optimized test. This provides a
data-driven safety margin for CI.

Part
V: The Last Bug – A Platform-Specific Memory Leak (March 20)

With all other tests passing, a single memory leak failure in
t/78-memleak.t persisted, but only on the Windows
ad2 environment. This required a different approach than
the timeout fixes.

Key Milestones:

The Bug: A strong reference cycle involving the
on_dialog event listener was not being broken on Windows,
despite multiple attempts to fix it. Fixes that worked on Linux (such as
calling on_dialog(undef) in DESTROY) were not
sufficient on the Windows host.
The Diagnosis: The issue was determined to be a
deep, platform-specific interaction between Perl’s garbage collector,
the IO::Async event loop implementation on Windows, and the
Test::Memory::Cycle module. The cycle report was identical
on both platforms, but the cleanup behavior was different.
Failed Attempts: A series of increasingly
aggressive fixes were attempted to break the cycle, including:
1. Moving the on_dialog(undef) call from
  close() to DESTROY().
2. Explicitly deleteing the listener and callback
  properties from the object hash in DESTROY.
3. Swapping between $self->remove_listener and
  $self->target->unlisten in a mistaken attempt to find
  the correct un-registration method.
Pragmatic Solution: After exhausting all reasonable
code-level fixes without a resolution on Windows, the user opted to mark
the failing test as a known issue for that specific platform.
Final Fix: The single failing test in
t/78-memleak.t was wrapped in a conditional
TODO block that only executes on Windows
(if ($^O =~ /MSWin32/i)), formally acknowledging the bug
without blocking the build. This allows the test suite to pass in CI
environments while flagging the issue for future, deeper
investigation.

Part VI: CI Hardening (March
20)

A final failure in the GitHub Actions CI environment revealed one
last configuration flaw.

Key Milestones:

The Bug: The CI was running
prove --nocount --jobs 3 -I local/ -bl xt t directly. This
command was missing the crucial -It/lib include path, which
is necessary for test files to locate the t::helper module.
This resulted in nearly all tests failing with
Can't locate t/helper.pm in @INC.
The Investigation: An analysis of
Makefile.PL revealed a custom MY::test block
specifically designed to inject the -It/lib flag into the
make test command. This confirmed that
make test is the correct, canonical way to run the test
suite for this project.
The Fix: The
.github/workflows/linux.yml file was modified to replace
the direct prove call with make test in the
Run Tests step. This ensures the CI environment runs the
tests in the exact same way as a local developer, with all necessary
include paths correctly configured by the project’s build system.

Final Outcome

After this long and arduous journey, the
WWW::Mechanize::Chrome test suite is now stable and
passing on all targeted platforms, with known
platform-specific issues clearly documented in the code. The project is
in a vastly more robust and reliable state.

KI-Assistent

Kontext geladen: C.J. Collier: The WWW::Mechanize::Chrome Saga: A Comprehensi

The WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104

Part I: The Quest for Cross-Platform Stability (March 13 – 16)

Key Milestones & Engineering Decisions:

Part II: The Great Async Refactor (March 17 – 18)

Key Milestones & Engineering Decisions:

Derived Architectural Rules & SOPs:

Part III: The MutationObserver Saga (March 19)

Key Milestones & Challenges:

Procedural Learning: Granular Edits

Part IV: Reversion and Final Stabilization (March 20)

Key Milestones & Engineering Decisions:

Part V: The Last Bug – A Platform-Specific Memory Leak (March 20)

Key Milestones:

Part VI: CI Hardening (March 20)

Key Milestones:

Final Outcome

Verwandte Beitraege

Big Win for Open Source as Germany Backs Open Document Format

Slug Algorithm for On-GPU Rendering of Fonts with Bézier Curves now in Public Domain

OpenShot 3.5 is, yet again, the biggest and fastest release ever

Leave a Reply Cancel reply

The
WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104

Part I:
The Quest for Cross-Platform Stability (March 13 – 16)

Key Milestones &
Engineering Decisions:

Part II: The
Great Async Refactor (March 17 – 18)

Key Milestones &
Engineering Decisions:

Derived Architectural Rules
& SOPs:

Part III: The
`MutationObserver` Saga (March 19)

Procedural Learning:
Granular Edits

Part IV:
Reversion and Final Stabilization (March 20)

Key Milestones &
Engineering Decisions:

Part
V: The Last Bug – A Platform-Specific Memory Leak (March 20)

Part VI: CI Hardening (March
20)