Symmetric Inkball Alignment with Loopy Models

Alignment tasks generally seek to establish a spatial correspondence between two versions of a text, for example between a set of manuscript images and their transcript. This paper examines a different form of alignment problem, namely pixel-scale alignment between two renditions of a handwritten word or phrase. Using loopy inkball graph models, the proposed technique finds spatial correspondences between two text images such that similar parts map to each other. The method has applications to word spotting and signature verification, and can provide analytical tools for the study of handwriting variation.


I. INTRODUCTION
Alignment refers to the process of bringing two or more corresponding patterns into juxtaposition, such that areas of similar nature overlap. Within the realm of document analysis, one familiar form of this problem is the alignment of a document image with its transcript text. This is an example of a one-dimensional alignment problem, since the positions of the transcript characters are adjusted only along the horizontal dimension. One-dimensional sequence alignment has been well studied, and may be solved via algorithms such as dynamic time warping, hidden Markov models or connectionist temporal classification. This paper examines a different problem, namely flexible fine-grained alignment in two dimensions at once. We seek alignment at the pixel level between different versions of a handwritten word. Figure 1 shows an example of such an alignment, where each portion of one handwritten version of a word is mapped to a corresponding portion of a second. Some features of a word may be present in only one image; rather than generating spurious matches for these portions, they are flagged in red as unmatched components.

A. Related Work
Pixel-scale alignment of offline handwriting has received some attention, albeit less than other alignment problems and mostly for rigid or affine transformations. Alignment has long been significant for online signature matching [1]. In other contexts such as word spotting, pixel-scale alignment has been treated more as a means to an end than as a goal in its own right.
Some scholars have presented work which generates pixel alignment between handwriting images under a limited set of possible transformations. Manmatha et al. perform affine alignment of word images using the general cost-weighted bipartite matching algorithm of Scott and Longuet-Higgins [2], [3]. Learned-Miller also uses affine alignments of letter instance samples [4]. Work on alignment using flexible (e.g., non-affine) transformations is less common. Hassner et al. use an alignment based upon dynamic time warping to perform transcript alignment, with intermediate steps of the algorithm working at the pixel level [5]. Howe uses tree-structured inkball models as a tool for offline word spotting, and also to solve flexible keypoint alignment problems [6], [7]. Leung and Suen describe a pattern alignment algorithm similar in spirit to that proposed here [8]. It decomposes a signature into line and curve primitives, then matches them symmetrically via an iterative spatial warping process with a gradually decreasing neighborhood of influence. Fang et al. later apply this method to the signature verification task, not for directly measuring similarity but to generate synthetic training samples by interpolation [9]. This paper makes two novel contributions to research on handwriting alignment. First, it drops the requirement found in prior inkball work that all models of handwriting be structured as a tree. Since most handwriting does include loops and other cyclical structures, this allows for better handling of realistic common scenarios. Second, it searches for a match between the two structures in a symmetric bidirectional manner, whereas prior work matches in only one direction at a time. Symmetric bidirectional matching explicitly penalizes manyto-one matches, and provides a powerful corrective against many common errors seen in prior inkball work. Figure 2 shows an example.
The next section proposes a novel algorithm for symmetric flexible matching of handwriting keypoints. Section III describes experiments that explore its properties. The last section concludes by considering the meaning of the results and proposing future work.

II. ALGORITHM
What does it mean to align two images of writing at the pixel level, assuming that both images represent the same word or piece of text? For documents produced by typewriter a natural ground truth definition presents itself: ink pixels should align if they were produced by contact with the same portion of a given typewriter key. This concept may be expanded to all printed text in a single font, under the mild assumption that printed characters intended to be identical in shape are thus functionally interchangeable.
For handwritten characters, whose appearance varies with each instantiation, we can posit the existence of a canonical character form (analagous to a Platonic ideal). Each pixel in a particular written instance of a character may be attributed to an imperfect reproduction of some corresponding point on the ideal form. Pixel-scale alignment between two handwritten characters thus means pairing points from each such that both correspond to the same position on the ideal form.
In practice, ideal character forms are both hypothetical and unknown. Yet they provide conceptual guidance about the qualities of a good alignment between actual character examples: distinctive points marked by curvature, extremity, or juncture should correspond, while assignment of the the less notable points between them should be smoothly continuous and evenly distributed. These considerations guide qualitative assessment of an alignment. Quantitative indicators can be derived from applying the alignment toward some other task, such as word spotting or signature verification.
With these considerations, an alignment of the ink skeleton, or even a finite set of keypoints evenly spaced along it, can stand in for a pixel-by-pixel alignment. Given an alignment of keypoints, interpolation between them extends the alignment to the entire skeleton. Points not on the skeleton may then be aligned by interpolation along the segment between their nearest skeleton point and the ink boundary. Figure 3 illustrates this process. In practice, a full alignment rarely requires computation because the keypoint alignment itself suffices for most purposes. Note that the alignment specified by a keypoint matching is more complex than warping and stretching of the 2D plane, since it allows the relative topology of the points to change.
Keypoint alignment lends itself to an inkball representation of handwriting, where written symbols are modeled as overlapping disks of ink arranged in a 2D spatial pattern. Each keypoint represents the center of a disk of ink in this model. Prior work has studied one-way matches between inkball models of handwriting and observed markings [6], [7]. It employs an asymmetric matching process: a flexible model adapts to a static target, and multiple model parts can match to the same target structure without penalty. The new algorithm proposed herein differs in that both sides of the comparison are inkball models, and the goal is to find something close to a one-to-one matching while also respecting the inherent geometry of the two sides. In this sort of symmetric bidirectional match, each side actively matches to the other, and the strongest bonds  Interpolation of a full pixel-scale alignment from keypoints. P is aligned with P when the proportional distances between E and S and between k 1 and k 2 match those between E and S and between k 1 and k 2 respectively. form where the attraction is mutual on both sides. Figure 2 illustrates the difference between the two approaches.
To formalize this intuition, suppose that L and R represent two inkball models to be compared, referred to respectively as the left and right model. Models may be derived automatically from images of handwriting by thinning to a single pixel width skeleton and placing inkballs (keypoints) at the endpoints, junctions, and at regularly spaced intervals on all branches. Label the keypoints k L i and k R j , where 1 ≤ i ≤ n L and 1 ≤ i ≤ n R . (Throughout this paper the designations L and R will consistently indicate the model to which a particular entity belongs.) Each keypoint has a set of neighbors as indicated by the connectivity of the skeleton and denoted H L i or H R j . A perfect alignment would pair each k L i with exactly one k R j and vice versa. The possibility that n L = n R means that perfect alignment may not be attainable in practice. Furthermore, either image may contain extraneous structure not present in the other, meaning that some keypoints should not be assigned a corresponding node even if the numbers allow it. To handle this situation, we can define alignment as a directional bipartite match between the expanded sets where pairings to ∅ L and ∅ R indicate null matches. We can represent the assignment using two functions A L (k L i ) → k R j and A R (k R j ) → k L i . Many assignments are possible within the framework just described, but few will respect the geometrical arrangement and connectivity of the original keypoints. Qualitatively speaking, the best alignment maps keypoints to locations that preserve their relative positions as compared to neighboring points. Suppose that the keypoint locations in the source image are v L i and v R j respectively, and their new configurations under a proposed alignment are given by c L i and c R j . Simple rigid translation by a vector w is one possible alignment that perfectly preserves geometry.
(1) More likely there will be some deformation δ between the default and aligned configurations but for desirable alignments the magnitude should be small, particularly between neighboring keypoints.
The magnitudes of the observed deformations δ L ii and δ R jj will be used below (see Equation 10) to compute a deformation mismatch energy E ∆ .

A. Measurement
Given proposed configurations for each side, a precise definition of the alignment quality can now be given. Broadly speaking the chosen definition encompasses three different pieces: the proximity of configured nodes to a corresponding target on the opposite side, the consistency of the implied pairing given model connectivity, and the deformations embodied in the configuration itself. It is convenient to formulate quality as an energy to be minimized.
The first term is computed from the configuration vectors under the assumption that each keypoint maps to its closest point on the opposite side. This provides both the keypoint assignment and a set of distances to sum.
The second term captures the self-consistency of the implied alignment. An ideal alignment arranges keypoints in perfect pairs, e.g., , the severity of the mismatch depends on the separation of the matched nodes, i.e., how far one must travel through the keypoint graph to find a return path, quantified via the geodesic distance functions G L and G R .
The functions T L and T R measure the round trip distance from a keypoint to its associated node and back again via geodesic paths (always zero in the case of a perfect pairing). The third term in the energy measures how much the original structure must be deformed to achieve the chosen configuration.
The above equations do not account for keypoints that may have no proper match in the opposite image. Without proper handling these may be assigned a spurious match at very high energy. To achieve more stable results, a modified energy formula limits the maximum effect of unmatched nodes. For brevity, we define ξ L i = h∈H L i δ ih , and also assume below that λ M = λ A = λ ∆ = 1. Thresholds τ 1 and τ 2 set the standard maximum per-node energy contribution.

B. Optimization
Finding a good alignment under the conditions above is not easy, since moving one point in a configuration affects each of its neighbors. Loop connections in written symbols create cycles in the neighbor graph, introducing the potential for complicated feedback. Perfect optimization in similar systems has been shown to be NP hard, yet nevertheless good results have been obtained in practice using message passing algorithms [10], [11]. This paper therefore seeks to develop a set of promising optimization heuristics, with the goal of achieving a high-quality final outcome despite a lack of theoretical guarantees.
The algorithm operates in rounds. Each keypoint maintains a state ψ L i or ψ R j representing a probability distribution for its possible location over 2D image space. (For both convenience and numeric stability, all 2D distributions are represented computationally as negative log probabilities sampled on a pixel-resolution grid.) During a round, each keypoint receives messages from its neighbors and also incorporates information from the oppositely aligned model to update its state. The positions of most keypoints usually stabilize after just a half dozen rounds or so, while a few take longer to settle.
1) Initialization: Because the optimization is heuristic rather than exact, the initialization of the keypoints influences the end result. A slight bias towards likely pairings can help the solution to converge quickly, but overcommitment at this stage can also lead to suboptimal results. The proposed initialization strategy starts with plausible relative probabilities for matching at each of the opposite model's keypoints, and expands this to a full 2D probability distribution via the following process: (1) interpolate squared log probabilities between keypoints on the handwriting skeleton; (2) use a generalized distance transform (GDT) [12] to extend to all other points by adding their squared distance from the skeleton as a penalty; (3) normalize the entire 2D probability distribution to sum to 1.
Let S(...) denote the function from a keypoint value set to 2D probability distribution just described. We use the difference in the local skeleton tangent angle between the left and right keypoints (denoted α below) to generate the keypoint value set, and thence the full probability distributions.
The second term P R (k L i ) is a simple Gaussian potential, centered at the relative x and y percentile position of k L i in the image. It favors solutions that match keypoints in similar positions.
2) Update: During a round, keypoints update sequentially in a randomly chosen order. Each keypoint incorporates the current state of its neighbors, translated by an offset taken from the original model configuration, diffused to account for flex in the model, and renormalized to sum to 1.
Here N (·) signifies normalization, and i and then applies the GDT. Note that the GDT here is used for tractability, in lieu of a more exact computation of the probability diffusion.
At the end of the round, the state of every node is updated further using information from the opposite model. First we find the cross probabilities p L ij and p R ji by evaluating the a keypoint's distribution at the locations of the nodes in the opposite model.
Here N * (·) indicates that normalization is applied over the set of keypoints only. Thus p L ij gives the probability that keypoint k R j aligns with k L i . This information is used in two ways. First, all states are updated by a distribution derived from the partner probabilities.
Second, we can detect keypoints that have no corresponding partner on the other side.
The state of these nodes is updated further, mixing with a uniform distribution over all locations at total probability q L i .

3) Finalization:
The message passing stages run for R rounds, after which the marginal configuration can be read directly from the node states.
Depending upon the complexity of the patterns to be matched, the probability maps for some points may exhibit multiple competing modes even after iteration. In this case the marginal configuration may end up with large discontinuities between the final positions of neighboring nodes where the most likely mode undergoes a transition. To mitigate this, a globally consistent configuration can be estimated by fitting a traditional tree-structured inkball model, using the node states as the data term. This technique essentially applies an asymmetric process on top of the symmetric match results, so we refer to it as hybrid symmetric. It tends to produce configurations with smaller displacements between neighbors, but reintroduces a small possibility of inconsistencies between the matches in each direction.

III. EXPERIMENTS
Pixel-scale keypoint alignment can serve as a diagnostic and analytical tool for handwriting comparison. It also offers potential applications in signature verificaton and word spotting, although current implementations are too slow for practical use in the latter role. This paper aims to demonstrate the basic capabilities of the method in various areas. Exhaustive testing for any particular application is left as future work.

A. Word Spotting
Although current implementations are too slow for full-scale word spotting applications, the proposed technique may be useful as a tool for reranking images retrieved by another faster method. The George Washington dataset [13] makes a useful test case for this hypothesis because it is familiar and well studied. We examine one-shot single-word queries, without training. Each word image can form a query in leave-one-out mode and there is no need for a train/test split. Of the 4857 segmented words in the 20-page set, 4161 appear more than once and therefore can serve as useful test queries.
The experimental protocol begins with an initial ranking of all the target words produced using an asymmetric partstructured inkball model match, as described in prior work [6]. Following this, the top k words are reranked using six rounds of the symmetric two-way match described in Section II. The mean average precision over these k retrievals serves to measure the quality of the reranking. Figure 4 shows the relative performance of the proposed technique as compared to the original ranking, for various values of k. Both the fully symmetric and hybrid symmetric results greatly improve on the asymmetric result, with the fully symmetric match slightly ahead.
Reranking using the proposed technique greatly improves on the earlier inkball method for word spotting. It does not compete with learning-based approaches to this problem, several versions of which achieve mean average precision in  [14]. the high 90s [15], [16]. On the other hand, such methods typically rely on offline training with data from the target collection or something similar. By contrast, inkball methods work on a single example without offline training.

B. Signature Matching
Signatures offer a promising area for alignment analysis. While each instance is unique, genuine signatures come from a single writer and thus presumably share a strong underlying prototype. They are often complex relative to ordinary writing, and may include flourishes and complicated overwriting. These features mean that individual instances may differ markedly in topology, specifically in terms of which strokes cross each other and where.
Signature matching comes in two modes, termed online and offline. The former allows use of temporal sequence information about the strokes that form the writing, while the latter provides only images of a complete signature. Online data typically includes a the pen tip location over time, and may include pressure data as well, or at least pen up/pen down indications.
Online data present an opportunity to develop inkball models in a more realistic manner. With offline images, the stroke order at crossings is unclear, resulting a model with junctions even though the actual pen trajectory is a 1D curve. Although some crossings are intentional, others may result incidentally from a flourish that can intersect at different points in each rendition of a signature. Online data offer a way to model different topologies by using pen trajectory itself as the model. Choosing keypoints at regular intervals along the trajectory, and creating neighbor relations only with the previous and next points on the trajectory, we create a model that closely mimics the behavior of the actual writing. In particular, points that result from distant portions of the pen trace will not be strongly contrained to cross at particular locations.
Given inkball models built from both online and offline signatures, it becomes possible to compare online versions to offline within a single framework. This offers an advantage in realistic applications: even if online signatures are not generally available, procuring a single online sample for each user can be achieved at a much lower cost. Matching an online sample to an offline version may offer many of the advantages of online algorithms, without as many constraints on practical implementation. The computational cost of online and offline models are similar.
We test these hypotheses on a portion of the GPDS synthetic online/offline signature data set [17], [18]. This set comprises 100 writers, with 24 genuine signatures and 30 skilled forgeries for each. In addition to the signature images, online trace information is available, including 2D position and pressure information at regular time intervals. To convert the latter to inkball models, cubic spline interpolation renders the pen trace as a skeleton of 8-connected pixels. Keypoints are selected at regular intervals covering the pen-down regions, defined as any position where pressure exceeds a low threshold (10% of the maximum range). Offline images of signatures are also converted to inkball models using the procedure described in Section II. Although the online and offline models have very different structures, they can still be matched against each other. Table I summarizes the results of several experiments run on the GPDS set, using the first ten genuine and ten forged signatures for each. The first genuine signature is used as the test in each case. Condition 1 does symmetric matching using a left model built from the non-branching online pen trace of the test signature, and a right model built from the offline target image. Condition 2 uses both left and right models built from the offline images. Conditions 3 and 4 repeat experiments using the models from 1 and 2 with a hybrid symmetric match. Finally, conditions 5 through 7 are control experiments based on prior work: conditions 5 and 6 perform two opposite asymmetric inkball matches using the models from conditions 1 and 2 and taking the maximum of their energies, while condition 7 uses only a one-way offline match. (Since asymmetric matches require a tree structure, any loops in the model are broken arbitrarily.) In each case, the model fitting gives an energy score that measures the quality of the match. The table shows the equal error rate accuracy of a threshold classifier based upon the fitting energy, with the appropriate threshold tuned for each writer. A right-tailed T test indicates that condition 3 is significantly better than both condition 1 (p < 0.01) and conditions 5-6 (p < 0.025) but the improvement on conditions 2 and 4 is only marginal (p < 0.1). All conditions 1-6 are significantly better than 7 (p < 0.025).
The results here appear better than published error rates on this data set [18], but no firm conclusions can be drawn because the experiments use just a subset of the available signatures. The one-way offline-to-offline asymmetric technique (condition 7) has previously been found to be competitive with state of the art methods on other datasets [19].
The GPDS is particularly challenging for the proposed technique because many signatures feature overwriting with closely spaced and nearly parallel lines, which are difficult for the algorithm to distinguish from each other. This may explain the difference in results observed between conditions 1 and 3, which both use models built from the online pen trace. Examination of the condition 1 results shows that the matched position occasionally jumps between parallel lines. The hybrid symmetric result in condition 3 suppresses such artifacts and therefore can produce a better result. Note that although these two conditions start with online pen traces, all models are purely geometric and do not make use of velocity information.

C. Analytical Tool
Properties of the keypoint match can be measured and used to draw useful analytical conclusions. One example is keypoint satisfaction, defined as the extent to which a keypoint's match is mutually returned.

IV. CONCLUSION
This paper has proposed to perform pixel-scale alignment via a symmetric keypoint matching algorithm, using message passing on a structural graph. Over all the results look promising, even if the method does present some limitations. It currently runs relatively slowly due to the many image translations that must be computed during the update stage (Equation 13). This computation can be parallelized, and in the future a GPU implementation of the alignment algorithm should perform significantly faster and allow much more thorough experimentation.
Pixel-scale alignment of handwriting images deserves further study on its own merits, besides its use as a step toward accomplishing other tasks. Beyond the methods presented herein, it would be worthwhile to explore alternate techniques using other approaches, perhaps based upon deep learning or other trained methods. Reliable tools for such alignment can provide insight on handwriting style and its variations, and may yet lead to hitherto unforeseen applications.