C.J. Collier: The WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104
The
WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104
This document synthesizes the extensive work performed from March
13th to March 20th, 2026, to harden, stabilize, and refactor the
WWW::Mechanize::Chrome library and its test suite. This
effort involved deep dives into asynchronous programming,
platform-specific bug hunting, and strategic architectural
decisions.
Part I:
The Quest for Cross-Platform Stability (March 13 – 16)
The initial phase of work focused on achieving a “green” test suite
across a variety of Linux distributions and preparing for a new release.
This involved significant hardening of the library to account for
different browser versions, OS-level security restrictions, and
filesystem differences.
Key Milestones &
Engineering Decisions:
- Fedora & RHEL-family Success: A major effort
was undertaken to achieve a 100% pass rate on modern Fedora 43 and
CentOS Stream 10. This required several key engineering decisions to
handle modern browser behavior:- Decision: Implement Asynchronous DOM Serialization
Fallback. Synchronous fallbacks in an async context are
dangerous. To preventResource was not cachederrors during
saveResources, we implemented a fully asynchronous fallback
in_saveResourceTree. By chaining
_cached_documentwithDOM.getOuterHTML
messages, we can reconstruct document content without blocking the event
loop, even if Chromium has evicted the resource from its cache. This
also proved resilient against Fedora’s security policies, which often
blockfile://access. - Decision: Truncate Filenames for Cross-Platform
Safety. To avoidFile name too longerrors,
especially on Windows where theMAX_PATHlimit is 260
characters,filenameFromUrlwas hardened. The filename
truncation was reduced to a more conservative 150
characters, leaving ample headroom for deeply nested CI
temporary directories. Logic was also added to preserve file extensions
during truncation and to sanitize backslashes from URI paths. - Decision: Expand Browser Discovery Paths. To
support RHEL-based systems out-of-the-box, the
default_executable_nameswas expanded to include
headless_shelland search paths were updated to include
/usr/lib64/chromium-browser/. - Decision: Mitigate Race Conditions with Stabilization Waits
and Resilient Fetching. On fast systems,
DOM.documentUpdatedevents could invalidate
nodeIds immediately after navigation, causing XPath queries
to fail with “Could not find node with given id”. A small stabilization
sleep(0.25s)was added after page loads to ensure the DOM
is settled. Furthermore, the asynchronous DOM fetching loop was hardened
to gracefully handle these errors by catching protocol errors and
returning an empty string for any node that was invalidated during
serialization, ensuring the overall process could complete.
- Decision: Implement Asynchronous DOM Serialization
- Windows Hardening:
- Decision: Adopt Platform-Aware Watchdogs. The test
suite’s reliance onualarmwas a blocker for Windows, where
it is not implemented. Thet::helper::set_watchdogfunction
was refactored to use standardalarm()(seconds) on Windows
andualarm(microseconds) on Unix-like systems, enabling
consistent test-level timeout enforcement.
- Decision: Adopt Platform-Aware Watchdogs. The test
- Version 0.77 Release:
- Decision: Adopt SOP for Version Synchronization.
The project maintains duplicate version strings across 24+ files. A
Standard Operating Procedure was adopted to use a batch-replacement tool
to update all sub-modules inlib/and to always run
make cleanandperl Makefile.PLto ensure
META.jsonandMETA.ymlreflect the new
version. After achieving stability on Linux, the project version was
bumped to 0.77.
- Decision: Adopt SOP for Version Synchronization.
- Infrastructure & Strategic Work:
- The
ad2Windows Server 2025 instance was restored and
optimized, with Active Directory demoted and disk I/O performance
improved. - A strategic proposal for the Heterogeneous Directory
Replication Protocol (HDRP) was drafted and published.
- The
Part II: The
Great Async Refactor (March 17 – 18)
Despite success on Linux, tests on the slow ad2 Windows
host were still plagued by intermittent, indefinite hangs. This
triggered a fundamental architectural shift to move the library’s core
from a mix of synchronous and asynchronous code to a fully non-blocking
internal API.
Key Milestones &
Engineering Decisions:
-
Decision: Expose a
_futureAPI.
Instead of hardcoding timeouts in the library, the core strategy was to
refactor all blocking methods (xpath,field,
get, etc.) into thin wrappers around new non-blocking
..._futurecounterparts. This moved timeout management to
the test harness, allowing for flexible and explicit handling of
stalls. -
Decision: Centralize Test Hardening in a Helper.
A dedicated test library,t/lib/t/helper.pm, was created to
contain all stabilization logic. “Safe” wrappers (safe_get,
safe_xpath) were implemented there, using
Future->wait_anyto race asynchronous operations against
a timeout, preventing tests from hanging.# Example test helper implementation sub safe_xpath { my ($mech, $query, %options) = @_; my $timeout = delete $options{timeout} || 5; my $call_f = $mech->xpath_future($query, %options); my $timeout_f = $mech->sleep_future($timeout)->then(sub { Future->fail("Timeout") }); return Future->wait_any($call_f, $timeout_f)->get; } -
Decision: Refactor Node Attribute Cache.
Investigations into flaky checkbox tests (t/50-tick.t)
revealed thatWWW::Mechanize::Chrome::Nodewas storing
attributes as a flat list ([key, val, key, val]), which was
inefficient for lookups and individual updates. The cache was refactored
to definitively use a HashRef, providing O(1) lookups
and enabling atomic dual-updates where both the browser property (via
JS) and the internal library attribute are synchronized
simultaneously. -
Decision: Implement Self-Cancelling Socket
Watchdog. On Windows, traditional watchdog processes often
failed to detect parent termination, leading to 60-second hangs after
successful tests. We implemented a new socket-based watchdog in
t::helperthat listens on an ephemeral port; the background
process terminates immediately when the parent socket closes,
eliminating these cumulative delays. -
Decision: Deep Recursive Refactoring & Form
Selection. To make the API truly non-blocking, the entire
internal call stack had to be refactored. For example, making
get_set_value_futurenon-blocking required first making its
dependency,_field_by_name, asynchronous. This culminated
in refactoring the entire form selection API (form_name,
form_id, etc.) to use the new asynchronous
_futurelookups, which was a key step in mitigating the
Windows deadlocks. -
Decision: Fix Critical Regressions & Memory
Cycles.-
Evaluation Normalization: Implemented a
_process_eval_resulthelper to centralize the parsing of
results fromRuntime.evaluate. This ensures consistent
handling of return values and exceptions between synchronous
(eval_in_page) and asynchronous (eval_future)
calls. -
Memory Cycle Mitigation: A significant memory
leak was discovered where closures attached to CDP event futures (like
for asynchronous body retrieval) would capture strong references to
$selfand the$responseobject, creating a
circular reference. The established rule is to now always use
Scalar::Util::weakenon both$selfand any
other relevant objects before they are used inside a
->thenblock that is stored on an object. -
Context Propagation (
wantarray): A
major regression was discovered where Perl’swantarray
context, which distinguishes between scalar and list context, was lost
inside asynchronousFuture->thenblocks. This caused
methods likexpathto return incorrect results (e.g., a
count instead of a list of nodes). The solution was to adopt the “Async
Context Pattern”: capturewantarrayin the synchronous
wrapper, pass it as an option to the_futuremethod, and
then use that captured value inside the future’s final resolution
block.# Synchronous Wrapper sub xpath($self, $query, %options) { $options{ wantarray } = wantarray; # 1. Capture return $self->xpath_future($query, %options)->get; # 2. Pass } # Asynchronous Implementation sub xpath_future($self, $query, %options) { my $wantarray = delete $options{ wantarray }; # 3. Retrieve # ... async logic ... return $doc->then(sub { if ($wantarray) { # 4. Respect return Future->done(@results); } else { return Future->done($results[0]); } }); } -
Asynchronous Body Retrieval & Robust Content
Fallbacks: Fixed a bug wheredecoded_content()
would return empty strings by ensuring it awaited a
__body_future. This was implemented by storing the
retrieval future directly on the response object
($response->{__body_future}). To make this more robust,
a tiered strategy was implemented: first try to get the content from the
network response, but if that fails (e.g., forabout:blank
or due to cache eviction), fall back to a JavaScript
XMLSerializerto get the live DOM content. -
Signature Hardening: Fixed “Too few arguments”
errors when using modern Perl signatures with
Future->then. Callbacks were updated to use optional
parameters (sub($result = undef) { ... }) to gracefully
handle futures that resolve with no value. -
XHTML “Split-Brain” Bug: Resolved a
long-standing Chromium bug (40130141) where content provided via
setDocumentContentis parsed differently than content
loaded from a URL. A workaround was implemented: for XHTML documents,
WMC now uses a JavaScript-based XPath evaluation
(document.evaluate) against the live DOM, bypassing the
broken CDP search mechanism.
-
Derived Architectural Rules
& SOPs:
- Rule: Always provide
_futurevariants.
Every library method that interacts with the browser via CDP must have a
non-blocking asynchronous counterpart. - Rule: Centralize stabilization in the test layer.
All timeout and retry logic should reside in the test harness
(t/lib/t/helper.pm), not in the core library. - Rule: Explicitly propagate
wantarray
context. Synchronous wrappers must capture the caller’s context
and pass it down theFuturechain to ensure correct
scalar/list behavior. - Rule: The entire call chain must be asynchronous.
To enable non-blocking timeouts, even a single “hidden” blocking call in
an otherwise asynchronous method will cause a stall. - SOP: Reduce Library Noise. Diagnostic messages
(warn,note,diag) should be
removed from library code before commits. All such messages should be
converted to use the internal$self->log('debug', ...)
mechanism, ensuring a clean TAP output for CI systems.
Part III: The
MutationObserver Saga (March 19)
With most of the library refactored to be asynchronous, one stubborn
test, t/65-is_visible.t, continued to fail with timeouts.
This led to an ambitious, but ultimately unsuccessful, attempt to
replace the wait_until_visible polling logic with a more
“modern” MutationObserver.
Key Milestones & Challenges:
- The Theory: The goal was to replace an inefficient
repeat { sleep }loop with an event-driven
MutationObserverin JavaScript that would notify Perl
immediately when an element’s visibility changed. - Implementation & Cascade Failure: The
implementation proved incredibly difficult and introduced a series of
new, hard-to-diagnose bugs:- An incorrect function signature for
callFunctionOn_future. - A critical unit mismatch, passing seconds from Perl to JavaScript’s
setTimeout, which expected milliseconds. - A fundamental hang where the
MutationObserver’s
JavaScriptPromisewould never resolve, even after the
underlying DOM element changed.
- An incorrect function signature for
- Debugging Maze: Multiple attempts to fix the
checkVisibilityJavaScript logic inside the observer
callback, including making it more robust by adding DOM tree traversal
and extensiveconsole.logtracing, failed to resolve the
hang. This highlighted the opacity and difficulty of debugging complex,
cross-language asynchronous interactions, especially when dealing with
low-level browser APIs.
Procedural Learning:
Granular Edits
The effort was plagued by procedural missteps in using automated
file-editing tools. Initial attempts to replace large code blocks in a
single operation led to accidental code loss and match failures.
- Decision: Adopt “Delete, then Add” Workflow.
Following forceful user correction, a new SOP was established for all
future modifications:- Isolate: Break the file into small, manageable
chunks (e.g., 250 lines). - Delete: Perform a “delete” operation by replacing
the old code block with an empty string. - Add: Perform an “add” operation by inserting the
new code into the empty space. - Verify: Verifying each atomic step before
proceeding. This granular process, while slower, ensured surgical
precision and regained technical control over the large
Chrome.pmmodule.
- Isolate: Break the file into small, manageable
The consistent failure of the MutationObserver approach
eventually led to the decision to abandon it in favor of stabilizing the
original, more transparent implementation.
Part IV:
Reversion and Final Stabilization (March 20)
After exhausting all reasonable attempts to fix the
MutationObserver, a strategic decision was made to revert
to the simpler, more transparent polling implementation and fix it
correctly. This proved to be the correct path to a stable solution.
Key Milestones &
Engineering Decisions:
- Decision: Perform Strategic Reversion. The
MutationObserverimplementation, when integrated via
callFunctionOn_futurewithawaitPromise,
proved fundamentally unstable. Its JavaScript promise would consistently
fail to resolve, causing indefinite hangs. A decision was made to
revert allMutationObservercode from
WWW::Mechanize::Chrome.pmand restore the original
repeat { sleep }polling mechanism. A stable,
understandable solution was prioritized over an elegant but broken
one. - Decision: Correct Timeout Delegation in the
Harness. The root cause of the original timeout failure was
identified as a race condition in thet/lib/t/helper.pm
test harness. Thesafe_wait_until_*wrappers were
implementing their own timeout (viawait_anyand
sleep_future) that raced against the underlying polling
function’s internal timeout. This led to intermittent failures on slow
machines. The helpers were refactored to delegate all timeout
management to the library’s polling functions, ensuring a
single, authoritative timer controlled the operation. - Decision: Optimize Polling Performance. At the
user’s request, the polling interval was reduced from 300ms to
150ms. This modest performance improvement reduced the
test suite’s wallclock execution time by over a second while maintaining
stability. - Decision: Tune Test Watchdogs. The global watchdog
timeout was adjusted to 12 seconds, specifically calculated as 1.5x the
observed real execution time of the optimized test. This provides a
data-driven safety margin for CI.
Part
V: The Last Bug – A Platform-Specific Memory Leak (March 20)
With all other tests passing, a single memory leak failure in
t/78-memleak.t persisted, but only on the Windows
ad2 environment. This required a different approach than
the timeout fixes.
Key Milestones:
- The Bug: A strong reference cycle involving the
on_dialogevent listener was not being broken on Windows,
despite multiple attempts to fix it. Fixes that worked on Linux (such as
callingon_dialog(undef)inDESTROY) were not
sufficient on the Windows host. - The Diagnosis: The issue was determined to be a
deep, platform-specific interaction between Perl’s garbage collector,
theIO::Asyncevent loop implementation on Windows, and the
Test::Memory::Cyclemodule. The cycle report was identical
on both platforms, but the cleanup behavior was different. - Failed Attempts: A series of increasingly
aggressive fixes were attempted to break the cycle, including:- Moving the
on_dialog(undef)call from
close()toDESTROY(). - Explicitly
deleteing the listener and callback
properties from the object hash inDESTROY. - Swapping between
$self->remove_listenerand
$self->target->unlistenin a mistaken attempt to find
the correct un-registration method.
- Moving the
- Pragmatic Solution: After exhausting all reasonable
code-level fixes without a resolution on Windows, the user opted to mark
the failing test as a known issue for that specific platform. - Final Fix: The single failing test in
t/78-memleak.twas wrapped in a conditional
TODOblock that only executes on Windows
(if ($^O =~ /MSWin32/i)), formally acknowledging the bug
without blocking the build. This allows the test suite to pass in CI
environments while flagging the issue for future, deeper
investigation.
Part VI: CI Hardening (March
20)
A final failure in the GitHub Actions CI environment revealed one
last configuration flaw.
Key Milestones:
- The Bug: The CI was running
prove --nocount --jobs 3 -I local/ -bl xt tdirectly. This
command was missing the crucial-It/libinclude path, which
is necessary for test files to locate thet::helpermodule.
This resulted in nearly all tests failing with
Can't locate t/helper.pm in @INC. - The Investigation: An analysis of
Makefile.PLrevealed a customMY::testblock
specifically designed to inject the-It/libflag into the
make testcommand. This confirmed that
make testis the correct, canonical way to run the test
suite for this project. - The Fix: The
.github/workflows/linux.ymlfile was modified to replace
the directprovecall withmake testin the
Run Testsstep. This ensures the CI environment runs the
tests in the exact same way as a local developer, with all necessary
include paths correctly configured by the project’s build system.
Final Outcome
After this long and arduous journey, the
WWW::Mechanize::Chrome test suite is now stable and
passing on all targeted platforms, with known
platform-specific issues clearly documented in the code. The project is
in a vastly more robust and reliable state.
