HybridHarness

Security

Summary

HybridHarness was a pytest-based testing framework developed for validating a file-based XML API used by a proprietary secure data transfer system. The project was conducted over a 5-week period using graybox testing methodology, where API documentation was available but source code was not. In addition to functional testing, the project included security analysis of the underlying communication protocols through packet capture analysis, reverse engineering, and cryptographic assessment.

The framework was delivered alongside a penetration testing report, documentation suite, and a listener program for multi-device testing scenarios.

Context

The target system was a secure data transfer platform designed for enterprise environments. It used a proprietary application-layer protocol over TCP (replacing TLS) and a custom authentication mechanism for node access. The system employed a zero-knowledge server architecture where intermediary relay servers could not access data contents, and used layered encryption for data containers.

Third-party applications interacted with the system through a file-based XML API: requests were written as XML files to monitored directories, a local client application processed them and communicated with network infrastructure, and responses appeared in output directories. Each request-response pair was tracked by a unique GUID.

HybridHarness was built to test this API as a third-party integrator would, writing XML requests, polling for responses, and validating behavior.

Architecture

The framework followed a modular architecture separating concerns into core infrastructure, request/response handling, test fixtures, and optional components.

Core Components

The RequestWriter generated XML request files from parameterized templates and wrote them to the appropriate request directories, supporting both serial and asynchronous modes.

The ResponseReader monitored response, notification, and error directories using FIFO polling ordered by file modification time, and matched responses to requests by GUID. It parsed each response file as XML, extracted status, error messages, warnings, and response data into a structured result object that tests asserted against. Instead of fixed timeouts, the reader polled in a loop gated on the target application’s process health: tests continued as long as the process was running and failed immediately on crash, eliminating the need for arbitrary timeout values.

The TestEnvironment managed the folder structure, file tracking, and cleanup lifecycle.

The ProcessMonitor used psutil to detect if the target application crashed or stopped running during test execution. On crash detection, the entire test session halted immediately to prevent cascading failures.

The PacketCapture component was a pytest plugin that integrated tshark (the Wireshark CLI) to automatically record network traffic during test execution. It launched a tshark subprocess before each test, filtered to traffic between the target application and the configured server address, and stopped capture after the test completed. Captures could be saved as pcap files for deep inspection in the Wireshark GUI or as csv files for scripted analysis with grep, awk, or spreadsheet tools. Two capture modes were supported: multi-file mode created one capture per test case for isolating test-specific traffic, while single-file mode recorded the entire session into one file for cross-test pattern analysis. At session end the plugin wrote a summary file listing each capture with its size, duration, and associated test. The plugin degraded gracefully when tshark was unavailable, allowing tests to proceed without capture. Permission handling was documented for both the wireshark group approach and setcap on dumpcap.

The TestLogger was a pytest plugin that created timestamped session directories with per-test logs, optional XML archiving, and session summaries.

The XML Template System provided template files with parameterized builders for all supported API actions, enabling consistent request generation across the test suite.

Data Flow

Test Code → RequestWriter → XML file in request directory
                                    ↓
                        [Target application processes request]
                                    ↓
                        [ProcessMonitor checks application health]
                                    ↓
ResponseReader ← XML file in response/notification/error directory
                                    ↓
                        [PacketCapture records network traffic]
                                    ↓
                        [TestLogger archives XML files to logs/]
                                    ↓
                            Test Assertions

Configuration

The framework used YAML-based configuration with environment variable overrides. Configuration covered the target application’s working directory, polling intervals, logging levels, network connection parameters, packet capture settings, and fuzzing profiles. Sensitive values like server addresses and node identifiers were kept in a gitignored config file.

Test Categories

Smoke Tests

Basic connectivity validation and infrastructure checks. These verified the testing environment was properly configured, the folder structure existed, and the target application could communicate with network infrastructure. This included the fundamental connectivity test that confirmed the API was operational.

Sunny Day Tests

Expected use cases covering standard workflows: partner management (invitation generation, acceptance), IoT operations (device discovery, pairing), data object transfers (upload, download, sharing), and messaging.

Error Handling Tests

Validated that the API responded appropriately to invalid inputs: malformed XML (missing prolog, unclosed tags), invalid parameters (non-existent partner IDs, unreachable addresses, out-of-range ports), and missing required fields.

Fuzzing Tests

Property-based testing using the Hypothesis library. Since source code was unavailable, coverage-guided fuzzing was not possible. The fuzzing harness instead took a blackbox approach: it generated inputs based on API specification analysis and wrote them to the target application’s request directories, then observed behavior by checking whether the application process was still alive. The core invariant under test was that the target application must never crash regardless of what XML it received.

The fuzzing harness was organized into three categories. XML structure fuzzing generated random text content, large files from 100KB to 5MB, and deeply nested structures from 10 to 200 levels, targeting parser robustness and resource limits. Field value fuzzing covered invalid GUIDs, extreme port numbers across the full int32 range, and random action names, probing input validation boundaries. Boundary condition fuzzing tested empty fields, whitespace-only values, and concurrent requests of 2 to 10 simultaneous writes to expose race conditions and missing validation on required fields.

Fuzzing intensity was controlled through four YAML-configurable profiles. The quick profile ran 20 examples per test and completed in around 5 to 10 minutes, suitable for development iteration. The standard profile ran 100 examples in 20 to 30 minutes for regular CI runs. The thorough profile ran 500 examples over 1 to 2 hours for pre-release validation. The continuous profile ran 10,000 examples over hours or days, intended for overnight bug-hunting campaigns.

Because fuzzing generated large volumes of temporary files and could destabilize the target application, the harness included several safety mechanisms. A disk space monitor checked available space before each test and skipped execution if free space dropped below 500MB. The ProcessMonitor integration detected target application crashes after each fuzzed input and halted the entire session immediately, preventing cascading failures and preserving the input that triggered the crash for reproduction. Aggressive per-test cleanup removed all generated XML files from request, response, and error directories after each example. Per-example deadlines were tuned to local processing time only (not network round-trips), since fuzzed inputs used random addresses and never reached backend servers.

Vulnerability Tests

Security-focused tests for XML bombs (billion laughs), XXE injection, path traversal via file naming, and resource exhaustion scenarios. These were written to validate that the XML parser handled adversarial input securely.

Integration Tests

End-to-end workflows requiring multiple steps, such as full partner lifecycle or complete data transfer workflows.

Two-Device Testing

Many API workflows required two registered clients communicating through the network. To support this, a standalone listener program was developed.

The listener ran on a second device with its own instance of the target application. It monitored incoming notification and request directories, parsed XML files in FIFO order, and dispatched events to configurable action handlers for partner invitation acceptance, IoT request acceptance, data object download, and message acknowledgment.

The listener supported both automatic response mode for automated integration testing and manual/passive mode for debugging and observation. It tracked processed files to avoid duplicates and organized all received files into timestamped log directories.

This enabled real end-to-end testing over actual network infrastructure without manual coordination between testers.

Security Analysis

Security analysis was conducted alongside framework development using multiple methodologies.

Packet capture analysis recorded traffic across 51 test runs over an 83-minute session. Analysis identified connection models, header structures, and authentication token patterns. Reverse engineering through static analysis of the target application binary identified cryptographic implementations and key management approaches. A cryptographic protocol assessment compared the proprietary protocol design against TLS 1.3 and industry best practices, evaluating trust establishment, forward secrecy, message authentication, cipher strength, and key management. STRIDE threat modeling systematically identified feasible attack targets given the testing methodology constraints.

Findings were scored using CVSS v3.1 and documented across severity levels. High severity findings concerned remote denial-of-service via crash conditions. Medium severity findings involved authentication predictability, protocol design choices, and local DoS. The root cause shared across DoS findings was insufficient input validation before parsing.

The target system was found to be surprisingly resilient considering the API was still heavily in development. The application ran well even under an emulation layer and heavy load.

All detailed findings were delivered under NDA restrictions.

Deliverables

Technical Deliverables

The primary deliverable was the HybridHarness testing framework itself, a pytest-based, modular, cross-platform harness supporting both Wine on Linux and native Windows execution. At its core were the RequestWriter and ResponseReader, which together abstracted the file-based XML API into a programmatic interface: the writer generated requests from parameterized templates covering all documented API actions, and the reader polled output directories, parsed XML responses, and returned structured result objects for test assertions. The listener program provided an automated second client for multi-device integration testing of workflows that required two registered participants on the network.

The packet capture integration delivered a full pytest plugin around tshark that recorded per-test or per-session network traffic in pcap or csv format, with automatic session summaries, graceful degradation when tshark was unavailable, and documented permission setup. The captures collected during the project (51 test runs over 83 minutes) were the primary input for the protocol-level security analysis.

The fuzzing infrastructure delivered a complete blackbox fuzzing harness built on the Hypothesis property-based testing library. It included a strategy module with reusable input generators (malformed URLs, extreme port ranges, invalid GUIDs, special characters), four YAML-configurable intensity profiles from quick development checks to multi-day continuous runs, and integrated safety mechanisms for disk space monitoring, crash detection with input preservation, and aggressive cleanup.

Documentation Deliverables

The documentation suite consisted of a README with quick start guide, installation instructions, configuration reference, and test execution examples; a development guide covering architecture, code organization, and patterns for extending the framework; a fuzzing guide explaining methodology, profile selection, result interpretation, and adding new fuzz tests; a listener guide for two-device setup and supported workflows; and a packet capture guide covering tshark integration and capture analysis. A confidential penetration testing report with CVSS-scored findings and reproduction steps was delivered separately, along with a final report covering project overview, methodology, analysis, timeline, and lessons learned.

Tools and Technologies

Tool	Purpose
Python 3.8+	Framework implementation
pytest	Test execution, fixtures, markers, plugins
Hypothesis	Property-based fuzzing
lxml	XML parsing and validation
psutil	Process health monitoring and crash detection
PyYAML	Configuration management
tshark/Wireshark	Packet capture and protocol analysis
Wine	Cross-platform execution of Windows target on Linux
Git	Version control

Key Design Decisions

The framework monitored target application process health rather than using fixed timeouts for response waiting. Tests continued indefinitely while the process ran and failed immediately on crash. This eliminated arbitrary timeout tuning and provided clear failure diagnostics.

Without source code access, fuzzing targeted the local client application rather than backend servers. Fuzzed inputs used random or malformed addresses that never reached production infrastructure. The invariant tested was “the application never crashes regardless of input.”

Response files were preserved after test runs by default for inspection and debugging. An optional --teardown flag enabled automatic cleanup for development workflows.

All tunables including fuzzing intensity, capture settings, and connection parameters were externalized to YAML configuration, allowing different configurations per environment without code changes.

Lessons Learned

Graybox testing of proprietary systems required creative combination of multiple analysis techniques, including packet capture, reverse engineering, and behavioral observation, to compensate for limited visibility. Property-based testing effectively discovered edge cases that manual test design missed, but required careful strategy design when coverage guidance was unavailable. Security analysis of custom protocols highlighted the value of open specifications and independent review; even well-intentioned security-through-obscurity approaches could introduce vulnerabilities that open review would have identified.