Reservoir Sampling

Reservoir sampling is a great random sampling algorithm every data engineer should know. It’s an algorithm for extracting a random sample of a specified size, over a large unbounded dataset. The data is such that you cannot pull it all in memory at once, and you don’t know how large it will be when taking the sample.

Over on wikipedia you can check out a nice explanation of the idea behind reservoir sampling. It presents the most simple algorithm of reservoir sampling. Let me regurgitate it in C#:

public static T[] TakeSample<T>(IEnumerable<T> input, int sampleCount) {
	if (sampleCount <= 0)
		throw new ArgumentOutOfRangeException ("sampleCount");

	Random rand = new Random ();
	var samples = new T[sampleCount];
	int i = 0;
	foreach (T item in input) {
		if (i < sampleCount) {
			samples [i] = item;
		} else {
			int j = rand.Next () % i;
			if (j < sampleCount)
				samples [j] = item;
		}
		i++;
	}

	if (i < sampleCount)
		throw new InvalidOperationException ("Input data does not have enough items to sample");

	return samples;
}

To some this algorithm may seem botched, as the last items have the least likelihood of being selected. However the overall likelihood of being selected is roughly distributed evenly across the entire dataset. This is because the items with more likelihood of being selected earlier on (or absolute certainty for the first k samples, where k = sample count) will have more likelihood of being replaced by a successor sample.

Jeffrey Scott Vitter has published some alternative implementations that optimise the performance of the sampling.

Distributed Reservoir Sampling

When you are dealing with BIG data where you have a distribute infrastructure at your disposal, your not going to stream the entire data set into a single sequence. Here is a crafty map-reduce approach to the sampling problem on stack overflow: http://stackoverflow.com/questions/2514061/how-to-pick-random-small-data-samples-using-map-reduce. I really like this approach as it draws on fundamental computer science algorithm know-how with big-data technology to get the job done!

 

 

Augmented Interval Tree in C#

Interval trees are an efficient ADT for storing and searching intervals. It is an ADT that probably doesn’t make the top 10 most commonly used collections in computer science. I often see code where a list/array like structure is used to store and searching interval data. Sometimes that is fine – even preferred for the sake of simplicity – but only in cases where the code is rarely run and is not having to handle large volumes of data.

My C# implementation of the IntervalTree uses generics and has the following features:

  • It uses generics.
  • It is mutable.
  • It is backed by a self balancing BST (AVL).
  • It is xml and binary serializable.
  • It supports duplicate intervals.

Originally I wrote the tree so that the end point selector was simply a lamba. However lamba’s are not serializable, so I changed the selector to become an interface (and implemented my own Xml serialization methods to handle the interface).

If your interval collections can afford to be immutable, then a better performing solution is to use a centered interval tree instead. Although the centered interval tree accomplishes the same average complexity has the augmented BST, the tree construction is generally faster.

Here is the code:

using System;
using System.Collections;
using System.Collections.Generic;
using System.Globalization;
using System.Linq;
using System.Runtime.Serialization;
using System.Xml;
using System.Xml.Schema;
using System.Xml.Serialization;

namespace BrookNovak.Collections
{
	/// <summary>
	/// An interval tree that supports duplicate entries.
	/// </summary>
	/// <typeparam name="TInterval">The interval type</typeparam>
	/// <typeparam name="TPoint">The interval's start and end type</typeparam>
	/// <remarks>
	/// This interval tree is implemented as a balanced augmented AVL tree.
	/// Modifications are O(log n) typical case.
	/// Searches are O(log n) typical case.
	/// </remarks>
	[Serializable]
	public class IntervalTree<TInterval, TPoint> : ICollection<TInterval>, ICollection, ISerializable, IXmlSerializable 
		where TPoint : IComparable<TPoint>
	{
		private readonly object syncRoot;
		private IntervalNode root;
		private ulong modifications;
		private IIntervalSelector<TInterval, TPoint> intervalSelector;

		/// <summary>
		/// Default ctor required for XML serialization support
		/// </summary>
		private IntervalTree()
		{
			syncRoot = new object();
		}

		public IntervalTree(IEnumerable<TInterval> intervals, IIntervalSelector<TInterval, TPoint> intervalSelector) :
			this(intervalSelector)
		{
			AddRange(intervals);
		}

		public IntervalTree(IIntervalSelector<TInterval, TPoint> intervalSelector)
			: this()
		{
			if (intervalSelector == null)
				throw new ArgumentNullException("intervalSelector");
			this.intervalSelector = intervalSelector;
		}

		/// <summary>
		/// Returns the maximum end point in the entire collection.
		/// </summary>
		public TPoint MaxEndPoint
		{
			get
			{
				if(root == null)
					throw new InvalidOperationException("Cannot determine max end point for emtpy interval tree");
				return root.MaxEndPoint;
			}
		}

		#region Binary Serialization

		public IntervalTree(SerializationInfo info, StreamingContext context)
			: this()
		{
			// Reset the property value using the GetValue method.
			var intervals = (TInterval[])info.GetValue("intervals", typeof(TInterval[]));
			intervalSelector = (IIntervalSelector<TInterval, TPoint>)info.GetValue("selector", typeof(IIntervalSelector<TInterval, TPoint>));
			AddRange(intervals);
		}

		public void GetObjectData(SerializationInfo info, StreamingContext context)
		{
			var intervals = new TInterval[Count];
			CopyTo(intervals, 0);
			info.AddValue("intervals", intervals, typeof(TInterval[]));
			info.AddValue("selector", intervalSelector, typeof(IIntervalSelector<TInterval, TPoint>));
		}

		#endregion

		#region IXmlSerializable

		public void WriteXml(XmlWriter writer)
		{
			writer.WriteStartElement("Intervals");
			writer.WriteAttributeString("Count", Count.ToString(CultureInfo.InvariantCulture));
			var itemSerializer = new XmlSerializer(typeof(TInterval));
			foreach (var item in this)
			{
				itemSerializer.Serialize(writer, item);
			}
			writer.WriteEndElement();

			writer.WriteStartElement("Selector");
			var typeName = intervalSelector.GetType().AssemblyQualifiedName ?? intervalSelector.GetType().FullName;
			writer.WriteAttributeString("Type", typeName);
			var selectorSerializer = new XmlSerializer(intervalSelector.GetType());
			selectorSerializer.Serialize(writer, intervalSelector);
			writer.WriteEndElement();
		}

		public void ReadXml(XmlReader reader)
		{
			reader.MoveToContent();
			reader.ReadStartElement();

			reader.MoveToAttribute("Count");
			int count = int.Parse(reader.Value);
			reader.MoveToElement();

			if (count > 0 && reader.IsEmptyElement)
				throw new FormatException("Missing tree items");
			if (count == 0 && !reader.IsEmptyElement)
				throw new FormatException("Unexpected content in tree item Xml (expected empty content)");

			reader.ReadStartElement("Intervals");

			var items = new TInterval[count];

			if (count > 0)
			{
				var itemSerializer = new XmlSerializer(typeof(TInterval));

				for (int i = 0; i < count; i++)
				{
					items[i] = (TInterval)itemSerializer.Deserialize(reader);

				}
				reader.ReadEndElement(); // </intervals>
			}

			reader.MoveToAttribute("Type");
			string selectorTypeFullName = reader.Value;
			if (string.IsNullOrEmpty(selectorTypeFullName))
				throw new FormatException("Selector type name missing");
			reader.MoveToElement();

			reader.ReadStartElement("Selector");

			var selectorType = Type.GetType(selectorTypeFullName);
			if(selectorType == null)
				throw new XmlException(string.Format("Selector type {0} missing from loaded assemblies", selectorTypeFullName));
			var selectorSerializer = new XmlSerializer(selectorType);
			intervalSelector = (IIntervalSelector<TInterval, TPoint>)selectorSerializer.Deserialize(reader);

			reader.ReadEndElement(); // </selector>

			AddRange(items);
		}

		public XmlSchema GetSchema()
		{
			return null;
		}

		#endregion

		#region IEnumerable, IEnumerable<T>

		IEnumerator IEnumerable.GetEnumerator()
		{
			return new IntervalTreeEnumerator(this);
		}

		public IEnumerator<TInterval> GetEnumerator()
		{
			return new IntervalTreeEnumerator(this);
		}

		#endregion

		#region ICollection

		public bool IsSynchronized { get { return false; } }

		public Object SyncRoot { get { return syncRoot; } }

		public void CopyTo(
			Array array,
			int arrayIndex)
		{
			if (array == null)
				throw new ArgumentNullException("array");
			PerformCopy(arrayIndex, array.Length, (i, v) => array.SetValue(v, i));
		}

		#endregion

		#region ICollection<T>

		public int Count { get; private set; }

		public bool IsReadOnly
		{
			get
			{
				return false;
			}
		}

		public void CopyTo(
			TInterval[] array,
			int arrayIndex)
		{
			if (array == null)
				throw new ArgumentNullException("array");
			PerformCopy(arrayIndex, array.Length, (i, v) => array[i] = v);
		}

		/// <summary>
		/// Tests if an item is contained in the tree.
		/// </summary>
		/// <param name="item">The item to check</param>
		/// <returns>
		/// True iff the item exists in the collection. 
		/// </returns>
		/// <remarks>
		/// This method uses the collection’s objects’ Equals and CompareTo methods on item to determine whether item exists.
		/// </remarks>
		public bool Contains(TInterval item)
		{
			if (ReferenceEquals(item, null))
				throw new ArgumentNullException("item");

			return FindMatchingNodes(item).Any();
		}

		public void Clear()
		{
			SetRoot(null);
			Count = 0;
			modifications++;
		}

		public void Add(TInterval item)
		{
			if (ReferenceEquals(item, null))
				throw new ArgumentNullException("item");

			var newNode = new IntervalNode(item, Start(item), End(item));

			if (root == null)
			{
				SetRoot(newNode);
				Count = 1;
				modifications++;
				return;
			}

			IntervalNode node = root;
			while (true)
			{
				var startCmp = newNode.Start.CompareTo(node.Start);
				if (startCmp <= 0)
				{
					if (startCmp == 0 && ReferenceEquals(node.Data, newNode.Data))
						throw new InvalidOperationException("Cannot add the same item twice (object reference already exists in db)");

					if (node.Left == null)
					{
						node.Left = newNode;
						break;
					}
					node = node.Left;
				}
				else
				{
					if (node.Right == null)
					{
						node.Right = newNode;
						break;
					}
					node = node.Right;
				}
			}

			modifications++;
			Count++;

			// Restructure tree to be balanced
			node = newNode;
			while (node != null)
			{
				node.UpdateHeight();
				node.UpdateMaxEndPoint();
				Rebalance(node);
				node = node.Parent;
			}
		}

		/// <summary>
		/// Removes an item.
		/// </summary>
		/// <param name="item">The item to remove</param>
		/// <returns>True if an item was removed</returns>
		/// <remarks>
		/// This method uses the collection’s objects’ Equals and CompareTo methods on item to retrieve the existing item.
		/// If there are duplicates of the item, then object reference is used to remove.
		/// If <see cref="TInterval"/> is not a reference type, then the first found equal interval will be removed.
		/// </remarks>
		public bool Remove(TInterval item)
		{
			if (ReferenceEquals(item, null))
				throw new ArgumentNullException("item");

			if (root == null)
				return false;

			var candidates = FindMatchingNodes(item).ToList();

			if (candidates.Count == 0)
				return false;

			IntervalNode toBeRemoved;
			if (candidates.Count == 1)
			{
				toBeRemoved = candidates[0];
			}
			else
			{
				toBeRemoved = candidates.SingleOrDefault(x => ReferenceEquals(x.Data, item)) ?? candidates[0];
			}

			var parent = toBeRemoved.Parent;
			var isLeftChild = toBeRemoved.IsLeftChild;

			if (toBeRemoved.Left == null && toBeRemoved.Right == null)
			{
				if (parent != null)
				{
					if (isLeftChild)
						parent.Left = null;
					else
						parent.Right = null;

					Rebalance(parent);
				}
				else
				{
					SetRoot(null);
				}
			}
			else if (toBeRemoved.Right == null)
			{
				if (parent != null)
				{
					if (isLeftChild)
						parent.Left = toBeRemoved.Left;
					else
						parent.Right = toBeRemoved.Left;

					Rebalance(parent);
				}
				else
				{
					SetRoot(toBeRemoved.Left);
				}
			}
			else if (toBeRemoved.Left == null)
			{
				if (parent != null)
				{
					if (isLeftChild)
						parent.Left = toBeRemoved.Right;
					else
						parent.Right = toBeRemoved.Right;

					Rebalance(parent);
				}
				else
				{
					SetRoot(toBeRemoved.Right);
				}
			}
			else
			{
				IntervalNode replacement, replacementParent, temp;

				if (toBeRemoved.Balance > 0)
				{
					if (toBeRemoved.Left.Right == null)
					{
						replacement = toBeRemoved.Left;
						replacement.Right = toBeRemoved.Right;
						temp = replacement;
					}
					else
					{
						replacement = toBeRemoved.Left.Right;
						while (replacement.Right != null)
						{
							replacement = replacement.Right;
						}
						replacementParent = replacement.Parent;
						replacementParent.Right = replacement.Left;

						temp = replacementParent;

						replacement.Left = toBeRemoved.Left;
						replacement.Right = toBeRemoved.Right;
					}
				}
				else
				{
					if (toBeRemoved.Right.Left == null)
					{
						replacement = toBeRemoved.Right;
						replacement.Left = toBeRemoved.Left;
						temp = replacement;
					}
					else
					{
						replacement = toBeRemoved.Right.Left;
						while (replacement.Left != null)
						{
							replacement = replacement.Left;
						}
						replacementParent = replacement.Parent;
						replacementParent.Left = replacement.Right;

						temp = replacementParent;

						replacement.Left = toBeRemoved.Left;
						replacement.Right = toBeRemoved.Right;
					}
				}

				if (parent != null)
				{
					if (isLeftChild)
						parent.Left = replacement;
					else
						parent.Right = replacement;
				}
				else
				{
					SetRoot(replacement);
				}

				Rebalance(temp);
			}

			toBeRemoved.Parent = null;
			Count--;
			modifications++;
			return true;
		}

		#endregion

		#region Public methods

		public void AddRange(IEnumerable<TInterval> intervals)
		{
			if (intervals == null)
				throw new ArgumentNullException("intervals");
			foreach (var interval in intervals)
			{
				Add(interval);
			}
		}
		
		public TInterval[] this[TPoint point]
		{
			get
			{
				return FindAt(point);
			}
		}

		public TInterval[] FindAt(TPoint point)
		{
			if (ReferenceEquals(point, null))
				throw new ArgumentNullException("point");

			var found = new List<IntervalNode>();
			PerformStabbingQuery(root, point, found);
			return found.Select(node => node.Data).ToArray();
		}

		public bool ContainsPoint(TPoint point)
		{
			return FindAt(point).Any();
		}

		public bool ContainsOverlappingInterval(TInterval item)
		{
			if (ReferenceEquals(item, null))
				throw new ArgumentNullException("item");

			return PerformStabbingQuery(root, item).Count > 0;
		}

		public TInterval[] FindOverlapping(TInterval item)
		{
			if (ReferenceEquals(item, null))
				throw new ArgumentNullException("item");

			return PerformStabbingQuery(root, item).Select(node => node.Data).ToArray();
		}

		#endregion

		#region Private methods

		private void PerformCopy(int arrayIndex, int arrayLength, Action<int, TInterval> setAtIndexDelegate)
		{
			if (arrayIndex < 0)
				throw new ArgumentOutOfRangeException("arrayIndex");
			int i = arrayIndex;
			IEnumerator<TInterval> enumerator = GetEnumerator();
			while (enumerator.MoveNext())
			{
				if (i >= arrayLength)
					throw new ArgumentOutOfRangeException("arrayIndex", "Not enough elements in array to copy content into");
				setAtIndexDelegate(i, enumerator.Current);
				i++;
			}
		}

		private IEnumerable<IntervalNode> FindMatchingNodes(TInterval interval)
		{
			return PerformStabbingQuery(root, interval).Where(node => node.Data.Equals(interval));
		}

		private void SetRoot(IntervalNode node)
		{
			root = node;
			if (root != null)
				root.Parent = null;
		}

		private TPoint Start(TInterval interval)
		{
			return intervalSelector.GetStart(interval);
		}

		private TPoint End(TInterval interval)
		{
			return intervalSelector.GetEnd(interval);
		}

		private bool DoesIntervalContain(TInterval interval, TPoint point)
		{
			return point.CompareTo(Start(interval)) >= 0
				&& point.CompareTo(End(interval)) <= 0;
		}

		private bool DoIntervalsOverlap(TInterval interval, TInterval other)
		{
			return Start(interval).CompareTo(End(other)) <= 0 &&
				End(interval).CompareTo(Start(other)) >= 0;
		}

		private void PerformStabbingQuery(IntervalNode node, TPoint point, List<IntervalNode> result)
		{
			if (node == null)
				return;

			if (point.CompareTo(node.MaxEndPoint) > 0)
				return;

			if (node.Left != null)
				PerformStabbingQuery(node.Left, point, result);

			if (DoesIntervalContain(node.Data, point))
				result.Add(node);

			if (point.CompareTo(node.Start) < 0)
				return;

			if (node.Right != null)
				PerformStabbingQuery(node.Right, point, result);
		}

		private List<IntervalNode> PerformStabbingQuery(IntervalNode node, TInterval interval)
		{
			var result = new List<IntervalNode>();
			PerformStabbingQuery(node, interval, result);
			return result;
		}

		private void PerformStabbingQuery(IntervalNode node, TInterval interval, List<IntervalNode> result)
		{
			if (node == null)
				return;

			if (Start(interval).CompareTo(node.MaxEndPoint) > 0)
				return;

			if (node.Left != null)
				PerformStabbingQuery(node.Left, interval, result);

			if (DoIntervalsOverlap(node.Data, interval))
				result.Add(node);

			if (End(interval).CompareTo(node.Start) < 0)
				return;

			if (node.Right != null)
				PerformStabbingQuery(node.Right, interval, result);
		}

		private void Rebalance(IntervalNode node)
		{
			if (node.Balance > 1)
			{
				if (node.Left.Balance < 0)
					RotateLeft(node.Left);
				RotateRight(node);
			}
			else if (node.Balance < -1)
			{
				if (node.Right.Balance > 0)
					RotateRight(node.Right);
				RotateLeft(node);
			}
		}

		private void RotateLeft(IntervalNode node)
		{
			var parent = node.Parent;
			var isNodeLeftChild = node.IsLeftChild;

			// Make node.Right the new root of this sub tree (instead of node)
			var pivotNode = node.Right;
			node.Right = pivotNode.Left;
			pivotNode.Left = node;

			if (parent != null)
			{
				if (isNodeLeftChild)
					parent.Left = pivotNode;
				else
					parent.Right = pivotNode;
			}
			else
			{
				SetRoot(pivotNode);
			}
		}

		private void RotateRight(IntervalNode node)
		{
			var parent = node.Parent;
			var isNodeLeftChild = node.IsLeftChild;

			// Make node.Left the new root of this sub tree (instead of node)
			var pivotNode = node.Left;
			node.Left = pivotNode.Right;
			pivotNode.Right = node;

			if (parent != null)
			{
				if (isNodeLeftChild)
					parent.Left = pivotNode;
				else
					parent.Right = pivotNode;
			}
			else
			{
				SetRoot(pivotNode);
			}
		}

		#endregion

		#region Inner classes

		[Serializable]
		private class IntervalNode
		{
			private IntervalNode left;
			private IntervalNode right;
			public IntervalNode Parent { get; set; }
			public TPoint Start { get; private set; }
			private TPoint End { get; set; }
			public TInterval Data { get; private set; }
			private int Height { get; set; }
			public TPoint MaxEndPoint { get; private set; }

			public IntervalNode(TInterval data, TPoint start, TPoint end)
			{
				if(start.CompareTo(end) > 0)
					throw new ArgumentOutOfRangeException("end", "The suplied interval has an invalid range, where start is greater than end");
				Data = data;
				Start = start;
				End = end;
				UpdateMaxEndPoint();
			}

			public IntervalNode Left
			{
				get
				{
					return left;
				}
				set
				{
					left = value;
					if (left != null)
						left.Parent = this;
					UpdateHeight();
					UpdateMaxEndPoint();
				}
			}

			public IntervalNode Right
			{
				get
				{
					return right;
				}
				set
				{
					right = value;
					if (right != null)
						right.Parent = this;
					UpdateHeight();
					UpdateMaxEndPoint();
				}
			}

			public int Balance
			{
				get
				{
					if (Left != null && Right != null)
						return Left.Height - Right.Height;
					if (Left != null)
						return Left.Height + 1;
					if (Right != null)
						return -(Right.Height + 1);
					return 0;
				}
			}

			public bool IsLeftChild
			{
				get
				{
					return Parent != null && Parent.Left == this;
				}
			}

			public void UpdateHeight()
			{
				if (Left != null && Right != null)
					Height = Math.Max(Left.Height, Right.Height) + 1;
				else if (Left != null)
					Height = Left.Height + 1;
				else if (Right != null)
					Height = Right.Height + 1;
				else
					Height = 0;
			}

			private static TPoint Max(TPoint comp1, TPoint comp2)
			{
				if (comp1.CompareTo(comp2) > 0)
					return comp1;
				return comp2;
			}

			public void UpdateMaxEndPoint()
			{
				TPoint max = End;
				if (Left != null)
					max = Max(max, Left.MaxEndPoint);
				if (Right != null)
					max = Max(max, Right.MaxEndPoint);
				MaxEndPoint = max;
			}

			public override string ToString()
			{
				return string.Format("[{0},{1}], maxEnd={2}", Start, End, MaxEndPoint);
			}
		}

		private class IntervalTreeEnumerator : IEnumerator<TInterval>
		{
			private readonly ulong modificationsAtCreation;
			private readonly IntervalTree<TInterval, TPoint> tree;
			private readonly IntervalNode startNode;
			private IntervalNode current;
			private bool hasVisitedCurrent;
			private bool hasVisitedRight;

			public IntervalTreeEnumerator(IntervalTree<TInterval, TPoint> tree)
			{
				this.tree = tree;
				modificationsAtCreation = tree.modifications;
				startNode = GetLeftMostDescendantOrSelf(tree.root);
				Reset();
			}

			public TInterval Current
			{
				get
				{
					if (current == null)
						throw new InvalidOperationException("Enumeration has finished.");

					if (ReferenceEquals(current, startNode) && !hasVisitedCurrent)
						throw new InvalidOperationException("Enumeration has not started.");

					return current.Data;
				}
			}

			object IEnumerator.Current
			{
				get
				{
					return Current;
				}
			}

			public void Reset()
			{
				if (modificationsAtCreation != tree.modifications)
					throw new InvalidOperationException("Collection was modified.");
				current = startNode;
				hasVisitedCurrent = false;
				hasVisitedRight = false;
			}

			public bool MoveNext()
			{
				if (modificationsAtCreation != tree.modifications)
					throw new InvalidOperationException("Collection was modified.");

				if (tree.root == null)
					return false;

				// Visit this node
				if (!hasVisitedCurrent)
				{
					hasVisitedCurrent = true;
					return true;
				}

				// Go right, visit the right's left most descendant (or the right node itself)
				if (!hasVisitedRight && current.Right != null)
				{
					current = current.Right;
					MoveToLeftMostDescendant();
					hasVisitedCurrent = true;
					hasVisitedRight = false;
					return true;
				}

				// Move upward
				do
				{
					var wasVisitingFromLeft = current.IsLeftChild;
					current = current.Parent;
					if (wasVisitingFromLeft)
					{
						hasVisitedCurrent = false;
						hasVisitedRight = false;
						return MoveNext();
					}
				} while (current != null);

				return false;
			}

			private void MoveToLeftMostDescendant()
			{
				current = GetLeftMostDescendantOrSelf(current);
			}

			private IntervalNode GetLeftMostDescendantOrSelf(IntervalNode node)
			{
				if (node == null)
					return null;
				while (node.Left != null)
				{
					node = node.Left;
				}
				return node;
			}

			public void Dispose()
			{
			}
		}

		#endregion
	}

	/// <summary>
	/// Selects interval start and end points for an object of type <see cref="TInterval"/>.
	/// </summary>
	/// <typeparam name="TInterval">The type containing interval data</typeparam>
	/// <typeparam name="TPoint">The type of the interval start and end points</typeparam>
	/// <remarks>
	/// In order for the collection using these selectors to be XML serializable, your implementations of this interface must also be
	/// XML serializable (e.g. dont use delegates, and provide a default constructor).
	/// </remarks>
	public interface IIntervalSelector<in TInterval, out TPoint> where TPoint : IComparable<TPoint>
	{
		TPoint GetStart(TInterval item);
		TPoint GetEnd(TInterval item);
	}
}

Here is an example unit test:

[TestFixture]
public class ExampleTestFixture {
	
	[Test]
	public void FindAt_Overlapping ()
	{
		// ARRANGE
		var int1 = new TestInterval (20, 60);
		var int2 = new TestInterval (10, 50);
		var int3 = new TestInterval (40, 70);

		var intervalColl = new[] { 
			int1, int2, int3
		};
		var tree = new IntervalTree<TestInterval, int>(intervalColl, new TestIntervalSelector());

		// ACT
		var res1 = tree.FindAt (0);
		var res2 = tree.FindAt (10);
		var res3 = tree.FindAt (15);
		var res4 = tree.FindAt (20);
		var res5 = tree.FindAt (30);
		var res6 = tree.FindAt (40);
		var res7 = tree.FindAt (45);
		var res8 = tree.FindAt (50);
		var res9 = tree.FindAt (55);
		var res10 = tree.FindAt (60);
		var res11 = tree.FindAt (65);
		var res12 = tree.FindAt (70);
		var res13 = tree.FindAt (75);

		// ASSERT
		Assert.That (res1, Is.Empty);
		Assert.That (res2, Is.EquivalentTo(new[] { int2 }));
		Assert.That (res3, Is.EquivalentTo(new[] { int2 }));
		Assert.That (res4, Is.EquivalentTo(new[] { int1, int2 }));
		Assert.That (res5, Is.EquivalentTo(new[] { int1, int2 }));
		Assert.That (res6, Is.EquivalentTo(new[] { int1, int2, int3 }));
		Assert.That (res7, Is.EquivalentTo(new[] { int1, int2, int3 }));
		Assert.That (res8, Is.EquivalentTo(new[] { int1, int2, int3 }));
		Assert.That (res9, Is.EquivalentTo(new[] { int1, int3 }));
		Assert.That (res10, Is.EquivalentTo(new[] { int1, int3 }));
		Assert.That (res11, Is.EquivalentTo(new[] { int3 }));
		Assert.That (res12, Is.EquivalentTo(new[] { int3 }));
		Assert.That (res13, Is.Empty);
	}
}

[Serializable]
public class TestInterval 
{
	private TestInterval() {}

	public TestInterval(int low, int hi) 
	{
		if(low > hi)
			throw new ArgumentOutOfRangeException("lo higher the hi");
		Low = low;
		Hi = hi;
	}

	public int Low { get; private set; }
	public int Hi { get; private set; }
	public string MutableData { get; set; }

	public override string ToString ()
	{
		return string.Format ("[Low={0}, Hi={1}, Data={2}]", Low, Hi, MutableData);
	}
}

[Serializable]
public class TestIntervalSelector : IIntervalSelector<TestInterval, int>
{
	public int GetStart (TestInterval item) 
	{
		return item.Low;
	}

	public int GetEnd (TestInterval item) 
	{
		return item.Hi;
	}
}

Note: I have rigorously unit tested this.

PS: Sorry but I should have probably uploaded the code to a online source repo + tests (like github)… but I’m too lazy.

Programming Praxies – Finding Digit Strings In Powers Of Two

Today’s boredom lead me to solving another programming praxies problem:

Search every power of two below 210000 and return the index of the first power of two in which a target string appears. For instance, if the target is 42, the correct answer is 19 because 219 = 524288, in which the target 42 appears as the third and fourth digits, and no smaller power of two contains the string 42.

The naive solution is simple: keep doubling a number (starting at 1) until you find a sequence that matches the target. Of course once you get 2^64 storing the number as a primitive type will not suffice, where you would need to come up with a solution to store larger numbers.

I  improved on the naive approach by caching the number sequences for each exponent. For the cache I used a type of suffix tree:

private class SuffixTreeNode {
	public short MinExponent;
	public SuffixTreeNode[] Children;
}

Unlike your traditional suffix tree, this one does not compress string sequences. It’s really just a trie, where the number in each sequence is stored implicitly as the index in Children (i.e. these arrays are of length 10). However the data structure is populated and access like a suffix tree: where each suffix of a number sequence is inserted into the tree. Each node in the tree is annotated with the smallest exponent that the sequence can be found in.

Here is the full code:

public static class DigitStringPowerTwoSearch
{
	private static SuffixTreeNode suffixRoot;
	private static short currentExponent;
	private static List<int> currentDigits;

	static DigitStringPowerTwoSearch () {
		FlushCache ();
	}

	internal static void FlushCache() {
		// Exposed as internal really for unit tests only
		currentExponent = 0;
		currentDigits = new List<int> { 1 };
		suffixRoot = new SuffixTreeNode ();
		AddDigitsToTree (currentDigits, 0, 0);
	}

	public static int FindMinBaseTwoExponent(ulong target) {
		var targetDigits = ToDigitArray(target);
		while (currentExponent <= 10000) {
			short exponent = FindMinExponentInTree(targetDigits);
			if (exponent >= 0)
				return exponent;
			AddNextExponentToTree();
		}
		throw new ArgumentOutOfRangeException ("target", 
		                                       "target's digits do not exist in a base two number with exponent " +
		                                       "less or equal to 10,000");
	}

	private static void AddNextExponentToTree() {
		DoubleDigits (currentDigits);
		currentExponent++;

		for (int i = 0; i < currentDigits.Count; i++) {
			AddDigitsToTree (currentDigits, i, currentExponent);
		}
	}

	private static void AddDigitsToTree(IList<int> digits, int startIndex, short exponent) {
		SuffixTreeNode current = suffixRoot;
		for (int i = startIndex; i < digits.Count; i++) {
			int digit = digits [i];
			if (current.Children == null) {
				current.Children = new SuffixTreeNode[10];
			}
			if (current.Children [digit] == null) {
				var newNode = new SuffixTreeNode { MinExponent = exponent };
				current.Children [digit] = newNode;
				current = newNode;
			} else {
				current = current.Children [digit];
				// Here we assume exponent is always the largest exponent,
				// so no need to check/update current.MinExponent
			}
		}
	}

	private static short FindMinExponentInTree(int[] targetDigits) {
		SuffixTreeNode current = suffixRoot;
		foreach(var digit in targetDigits) {
			if (current == null || current.Children == null)
				return -1;
			current = current.Children[digit];
		}
		if (current == null)
			return -1;
		return current.MinExponent;
	}

	private static int[] ToDigitArray(ulong n) {
		if (n == 0) 
			return new int[] { 0 };

		var digits = new List<int>();

		for (; n != 0; n /= 10)
			digits.Add((int)(n % 10));

		var arr = digits.ToArray();
		Array.Reverse(arr);
		return arr;
	}

	private static void DoubleDigits(List<int> digits) {
		bool carry = false;
		for (int i = digits.Count - 1; i >= 0; i--) {
			int d = digits [i] * 2;
			if (carry)
				d++;
			if (d >= 10) {
				d -= 10;
				carry = true;
			} else {
				carry = false;
			}
			digits [i] = d;
		}
		if (carry)
			digits.Insert (0, 1);
	}

	private class SuffixTreeNode {
		public short MinExponent;
		public SuffixTreeNode[] Children;
	}
}

When the function is invoked, if a sequence of numbers have been seen before, then we get O(N) performance (where N = amount of digits in target). Effectively the cache becomes a hash-trie.

Non-blocking Writer Collection (C# Example)

A question was posed to me recently:

If you had a thread that produced messages, and pushed those messages one or more consumer threads: how would you write the code to ensure that the producer thread executes as fast as possible? (I.E. no blocking on the producer thread).

An interesting problem to solve! Before looking at the suggested solution below, try and have a go at coming up with your own solution.

Let me present one way to go about this problem. The general idea is to use two message queues: a write queue, and a read queue. The writer (producer) thread always writes to the write queue. The reader thread always reads from the read queue. When the reader thread exhausts the read queue, the reader switches the queues: the write queue becomes the read queue, and vice versa. If the reader finds that the new reader queue (after the switch) is also empty, then it waits until the writer thread writes to the write queue. The reader thread then switches the queues again, and reads from the new read queue to extract the most recent item.

Note: I’ve choicely chosen the name “Collection” for this ADT, since the implementation does not guarantee ordering. It is sort of FIFO, but not strictly (otherwise I would of called it a NonBlockingWriterQueue!). Here is a C# example:

 


public class NonBlockingWriterCollection<T> : IDisposable {

	private Queue<T> leftQueue = new Queue<T>();
	private Queue<T> rightQueue = new Queue<T>();

	private volatile bool isWriting = false;
	private volatile bool writeToLeft = true;
	private object readerLocker = new object ();
	private EventWaitHandle readerWaitHandle = new AutoResetEvent (false);

	private volatile bool disposed = false;

	~NonBlockingWriterCollection()
	{
		Dispose(false);
	}

	#region IDisposable
	public void Dispose()
	{
		Dispose(true);
		GC.SuppressFinalize(this);
	}

	protected virtual void Dispose(bool disposing)
	{
		if(!disposed) {
			disposed = true;
			// Signal reader to wake up. Since closing/disposing the event handle doesnt raise
			// object disposed exceptions for threads in waiting states.
			readerWaitHandle.Set (); 
			if(disposing) {
				readerWaitHandle.Close (); // Can cause in-progress writes to throw a disposed exception.
			}

		}
	}
	#endregion

	/// <summary>
	/// Enqueues the specified item.
	/// </summary>
	/// <param name="item">Item to enqueue</param>
	/// <exception cref="ObjectDisposedException">If the queue has been disposed prior to executing this methed</exception>
	/// <remarks>
	/// Only a single writer thread can enqueue.
	/// This operation is non-blocking.
	/// </remarks>
	public void Write(T item) {
		if (disposed)
			throw new ObjectDisposedException ("NonBlockingWriterQueue");

		try {
			isWriting = true;

			// Queue an item on the write queue
			if(writeToLeft)
				leftQueue.Enqueue(item);
			else
				rightQueue.Enqueue(item);

			// Signal reader thread that an item has been added
			readerWaitHandle.Set();

		} finally {
			isWriting = false;
		}

	}

	/// <summary>
	/// Dequeues an item.
	/// </summary>
	/// <exception cref="ObjectDisposedException">If the queue has been disposed prior-to or during executing this methed</exception>
	/// <remarks>
	/// Multiple reader threads can attempt to dequeue an item.
	/// This operation is blocking (until an item has been enqueued, or the collection has been disposed).
	/// </remarks>
	public T Read() {
		lock (readerLocker) {
			if (disposed)
				throw new ObjectDisposedException ("NonBlockingWriterQueue");

			// Reset the wait handle, at this point we are searching for an item on either queue
			readerWaitHandle.Reset ();

			// Dequeue an item from the queue that is not being written to
			var readQueue = writeToLeft ? rightQueue : leftQueue;
			if (readQueue.Count > 0)
				return readQueue.Dequeue ();

			while (!disposed) {
				// The read queue has been exhausted. Swap read/write queue
				writeToLeft = !writeToLeft;

				// At this point, the writer thread could be writing to either queue, 
				// wait for the write to finish using a spin lock
				while (isWriting) {
				} // busy waiting

				// Try read again from the read queue
				readQueue = writeToLeft ? rightQueue : leftQueue;
				if (readQueue.Count > 0)
					return readQueue.Dequeue ();

				// Both lists have been exhausted, we need to wait for the writer to
				// do something. Block the reader until the writer has signalled.
				// Note: it may have added an item during the read... so this may
				// not block, and continue the read right away
				readerWaitHandle.WaitOne ();
			}

			throw new ObjectDisposedException ("NonBlockingWriterQueue");
		}
	}
}

If you know you are only going to have a single reader, then you can simply remove the reader mutex to improve performance.

Note the uses of volatile members. Volatile read/writes are required for these primitive members, otherwise all-sorts of chaos could happen should the compiler choose to re-arrange/cache the read/write instructions.

And here is a little test harness:

private const int ReaderCount = 5;
private NonBlockingWriterCollection<int> nbQueue;

public void Run ()
{
	Console.WriteLine ("Starting sim...");
	Thread[] readerThreads;
	Thread writerThread;
	using(nbQueue = new NonBlockingWriterCollection<int>()) {

		readerThreads = new Thread[ReaderCount];
		for(int i = 0; i < ReaderCount; i++) {
			readerThreads [i] = new Thread (RunReader);
			readerThreads [i].Start (i); // box i
		}

		writerThread = new Thread (RunWriter);
		writerThread.Start ();

		Thread.Sleep (1000 * 5);
	}

	Console.WriteLine ("Waiting for sim threads to finish...");
	writerThread.Join ();
	foreach (var rt in readerThreads)
		rt.Join ();

	Console.WriteLine ("Finished sim.");
}

private void RunReader(object threadNum) {
	var rand = new Random (((int)threadNum) * 6109425);
	string threadId = "ReaderThread " + (char)('A' + (int)threadNum); // unbox i
	try {
		while (true) {
			var item = nbQueue.Read (); // blocking
			Console.WriteLine (threadId + " read " + item);
			Thread.Sleep(rand.Next() % 10); // Do some "work" to process the data
		}
	} catch(ObjectDisposedException) {}
}

private void RunWriter() {
	string threadId = "WriterThread";
	var rand = new Random ();
	try {
		while (true) {
			for(int i = 1; i < rand.Next () % 10; i++) {
				var item = rand.Next ();
				Console.WriteLine (threadId + " writing " + item);
				nbQueue.Write (item); // non-blocking
			}
			Thread.Sleep(10 + rand.Next() % 100); // Do some "work" to produce more data
		}
	} catch(ObjectDisposedException) {}
}

Programming Praxies – Egyptian Fractions, C# solution.

This post presents C# solutions to a coin change problem as described in http://programmingpraxis.com/2013/06/04/egyptian-fractions.

An Egyptian fraction was written as a sum of unit fractions, meaning the numerator is always 1; further, no two denominators can be the same. As easy way to create an Egyptian fraction is to repeatedly take the largest unit fraction that will fit, subtract to find what remains, and repeat until the remainder is a unit fraction. For instance, 7 divided by 15 is less than 1/2 but more than 1/3, so the first unit fraction is 1/3 and the first remainder is 2/15. Then 2/15 is less than 1/7 but more than 1/8, so the second unit fraction is 1/8 and the second remainder is 1/120. That’s in unit form, so we are finished: 7 ÷ 15 = 1/3 + 1/8 + 1/120. There are other algorithms for finding Egyptian fractions, but there is no algorithm that guarantees a maximum number of terms or a minimum largest denominator; for instance, the greedy algorithm leads to 5 ÷ 121 = 1/25 + 1/757 + 1/763309 + 1/873960180913 + 1/1527612795642093418846225, but a simpler rendering of the same number is 1/33 + 1/121 + 1/363.

Your task is to write a program that calculates the ratio of two numbers as an Egyptian fraction…

As presented in the original post, you can use a greedy algorithm. Thats not fun! Let’s try and improve it: i.e. to return the smallest amount of unit fractions. You cannot devise an algorithm to guarantee the smallest amount of terms, since the problem space has infinite possibilities (genetic algorithm? food for thought).

So what approach can we take to improve on the greedy solution? I began by writing out a few iterations of calculations for 5 ÷ 121: from 1/25 up to 1/33 (part of an optimal solution presented in the problem definition). I noticed that when choosing 1/33, the remaining fraction (that we are pulling apart) can be simplified: where as all other fractions leading up to 1/33 leaves a fraction that cannot be further simplified! If you think about it, simplifying keeps the size of the denominator down, keeping smaller denominators helps yields a smaller amount of terms. This is because when we are dealing with very small numbers (large denominators), we are getting to the final target at a slower rate that we would with larger numbers. Simple huh?

So how can you simplify a fraction? It can be done by calculating the gcd (greatest common divisor) between the numerator and the denominator, then dividing the numerator and the denominator by the gcd. If the gcd is 1, then the fraction cannot be simplified. Hmmm… so maybe we can decide on the next unit fraction (for subtracting) only if the result can be simplified.  Using this informal idea as the basis of our algorithm, we get the following solution:

public class EgyptionFractions
{
	public static List<int[]> GetFractions(int numerator, int denominator) {
		if (numerator >= denominator)
			throw new ArgumentOutOfRangeException ("denominator");
		if (numerator <= 0)
			throw new ArgumentOutOfRangeException ("numerator");

		var fractions = new List<int[]> ();
		int subDenominator = 2;

		do {
			// First find the next fraction to substract from that is small enough
			int leftNumerator = numerator * subDenominator;
			while (leftNumerator < denominator) { // Note: rightNumerator == denominator
				subDenominator++;
				leftNumerator += numerator;
			}

			// Now we have a valid unit fraction to substract with, lets continue
			// searching for the next unit fraction that yeilds a remainder that 
			// can be simplified (to keep the denominators small).
			while (true) {
				int remainingNumerator = leftNumerator - denominator;
				if(remainingNumerator == 0) {
					// The fractions are the same
					numerator = 0;
					fractions.Add (new [] {1, subDenominator});
					break;
				}
				int remainingDenominator = denominator * subDenominator;
				int gcd = GCD (remainingNumerator, remainingDenominator);
				if (gcd > 1 || remainingNumerator == 1) {
					// The resultant fraction can be simplified using this denominator
					numerator = remainingNumerator / gcd;
					denominator = remainingDenominator / gcd;
					fractions.Add (new [] {1, subDenominator});

					// Finished?
					if(numerator == 1) 
						fractions.Add (new [] {1, denominator});
					break;
				}
				subDenominator++;
				leftNumerator += numerator; // i.e. additive version of subDenominator * numerator;
			}

			subDenominator++;
		} while (numerator > 1);

		return fractions;
	}

	private static int GCD(int n1, int n2) {
		if (n2 == 0)
			return n1;
		return GCD (n2, n1 % n2);
	}
}

If you pass in 5, 121, the result will be:
1/33, 1/91, 1/33033

Find the character position using javaScript: FAST, BIG pages, ALL browsers, NO preprocessing.

During my masters year at University, I set out to experiment with a modeless editing model. I decided to use the web at my content environment: the challenge became to making the web editable. I did, for all browsers/OSs (tested and worked for IE 6.0+, All webkit:Chrome/Safari, FF 1.5+, Konqueror, Opera 9+, on apple, windows and linux). One of the interesting problems to solve was finding the nearest character location when a user clicks on the web page. This post explains my solution.

Building Blocks
Microsoft’s Internet Explorer 4 (1997) introduced attributes for all element nodes in the DHTML model that describes the physical dimensions of an element in a web page. These attributes are offsetParent; offsetWidth; offsetHeight;
offsetTop and offsetLeft. Most, if not all, web browsers followed Internet Explorer’s footsteps and the attributes are well supported in contemporary browsers. The W3C drafts for the upcoming CSSOM (CSS Object Model) specification have included these properties and thus must be supported by all modern web browsers in the future.

Figure 1

Figure 1: DHTML/CSSOM offset properties

Figure 1 displays the top part of a web page that contains a <p> element inside a <div> element. The offsetWidth and offsetHeight attributes refer to the dimensions of an element in pixels (excluding margins). The offsetTop and offsetLeft attributes are the distances from the offsetParent in pixels. The offsetParent is the closest positioned containing element. For example, in Figure 1 the offsetParent of the <p> element is the containing <div> element. Note that the hierarchical parent in the DOM tree is not always the same as offsetParent due to CSS positioning schemes such as relative positioning.

The offset attributes can be used to locate the position of an element relative to the document. This is achieved by summing the offsetTop and offsetLeft attributes of all the offsetParent ancestry up to the document’s <body> element. In some cases, depending on the browser, and whether it is in quirksmode or not, the document’s CSS border must be manually added (if one exists). To determine the position of an element relative to the window, the document’s horizontal and vertical scroll-bar positions are subtracted from the calculated x and y positions of the element relative to the document respectively.

Another widely supported method is elementFromPoint(), which is currently included in the CCSOM specification drafts. This method returns an element in a document at the given (x, y) pixel coordinates. These coordinates are relative to the browser window for Gecko and Trident based browsers, where other browsers use coordinates relative to the document. This method is not as well supported as the offset properties (for example Firefox versions below 3 do not support this), so an alternative JavaScript implementation was written for legacy browser support.

Total Isolation Approach (Preprocessing)

One solution for determining a cursor position from a mouse click is to encapsulate every character within an editable section in a <span> element. Thus, when a user clicks in an editable section, the elementFromPoint() method can be used to directly discover the clicked character. Since the characters are isolated in dedicated <span> elements, the position and size of the characters can also be determined using methods described in the previous section. <span> elements are chosen because they are the only element that can reside in any element where there is text (according to HTML 4.01 and XHTML 1.0 specifications). As simple as this approach appears, it has too many pitfalls to be regarded as a practical solution:

  • The pre-processing step to isolate every character with <span> elements takes noticeably long periods of computation time. During this period, the browser becomes unresponsive (except for Opera which runs JavaScript on a separate thread to the window event thread). The pre-processing phase must occur either when a page loads or on the first cursor placement. The former approach would thwart bursty/rapid navigation through seamlessly editable web pages. The latter approach would lose the illusion of direct manipulation since the response time for the first edit would be too long.
  • The amount of memory used is bloated because of the large amount of DOM tree nodes required.
  • The large amount of <span> elements create a large DOM tree. In general, the larger the DOM tree, the slower the performance of the browser. Manipulating the DOM for performing editing operations becomes more expensive.
  • The elementFromPoint() method becomes slower because there is a larger amount of nodes to search.
  • The approach increases complexity in the rest of the framework’s implementation. All operations which manipulate the DOM must ensure that all text nodes always have exactly one character that is encapsulated in a dedicated <span> element.

Rather than total isolation of every character, you can break down this character location problem into a smaller problem effeciently using a more coarse grained preprocessing method: splitting just the lines using <div> tags. Codemirror does this for there code editor. This is a great idea if you don’t want to wrap lines. However if you want to retain text wrapping then you need to use another method. Let’s look more closely at the search space we are scanning to locate the nearest character.

Search space

When a user clicks on a web page, the (x, y) coordinates of the mouse pointer are supplied by the DOM in a mouse event object. The elementFromPoint() method is used to get the element which the mouse cursor is pointing at (if not already supplied by the mouse event object). A list of all the text nodes within the element is collected by traversing the element’s DOM tree in-order. Text nodes where cursor placements cannot occur are excluded. For example, text nodes within <script> or <style> elements, or text nodes consisting purely of white-space for HTML source-code formatting purposes.

In a common case, the element which a user clicks on is a type of container element that is not fully visible by the user. These container elements are only partially visible due to the web browser’s scroll-bar state only revealing part of the element, or the container element happens to be larger than the web browser’s window. An extreme, but typical scenario where this occurs is when a user clicks near the edges of a web page just outside of a <p> element. The element actually clicked is the document <body> element, and therefore includes every text node in the entire web page as part of the search space. To minimise the search space to yield faster performance when searching for the nearest cursor position, the search space is capped to include only nodes that are visible to the user.

Dual Binary Search

The binary search algorithm is an efficient algorithm for locating an item in a sorted collection (on average O(Log N)). It turns out that a two-pass binary search can be used against the search space to discover the closest cursor position.

In the general case when performing an in-order traversal on a DOM tree,the nodes are visited such that each node is either visually below or to the right of the previously visited node. The search space is collected using an in-order traversal, thus forming a spatially sorted collection. Search spaces are readily organised in ascending order by the y coordinate in the web page, therefore the characters in the search space are grouped by the line where they reside.

Figure2

Figure 2: The dual binary algorithm’s perspective of the search space.

Figure 2 conveys this grouping by surrounding characters on the same line with a box in the search space view of a paragraph of text. Furthermore, the x coordinates of the characters are in ascending order within each line-group. For example, the first character in the last group (the right-most group with text “words.”) is at an x coordinate of 6 pixels, and each successive character’s x coordinate increases by roughly 6 pixels all the way up to the last character on the line at an x coordinate of 38 pixels.

Two separate binary searches are used to quickly locate the nearest cursor position to the target coordinate. The first search performs a vertical position based binary search over the whole (y-ordered) search space to locate the line group which the target coordinate resides. The second search performs a horizontal position based binary search over the (x-ordered) line group to locate the nearest character to the target coordinate.

Unfortunately the JavaScript code for this implementation is on CD at home (in another country). However I present to you the pseudo code to whci leaves you the fun of implementing it:

find-cursor-at(target-coordinate) {
  search-space = get-search-space(target-coordinate);
  start = measure(first character in the search-space );
  end = measure(last character in the search-space );
  sample-set = [start,end];
  current-closest = closest(start, end, target-coordinate);
  if (target-coordinate within best.bounds ) {
    return current-closest;
  }
  # First pass
  y-search(start,end);
  (lower, upper) = select-bounds(sample-set);
  # Second pass
  x-search(lower, upper);
  return current-closest;
}

y-search(lower, upper) {
  if (lower.abs-index) == (upper.abs-index - 1) {
    # Upper and lower bounds have met, no more characters to search
    return;
  }
  # Half way point between upper and lower samples
  current.abs-index = (lower-sample.abs-index + upper.abs-index)/2;
  # Get node and relative index from absolute index
  (current.node,current.rel-index) = get-rel-dom-position(current.abs-index);
  # Measure the sample
  (current.x, current.y, current.width, current.height) = measure(current.node, current.rel-index);
  # Store it in the sample set if doing a line search
  sample-set.add(current);
  if (current is closer to target than current-closest ) {
    current-closest = current;
  }
  if (current is same line as the target y ) {
    # Goto second pass: the x binary search
    return;
  } else if (current.y > target-y) {
    upper = current;
  } else {
    lower = current;
  }
  y-search(lower, upper);
}

x-search(upper,lower) {
  if (lower-sample.abs-index) == (upper.abs-index - 1) {
    # Upper and lower bounds have met, no more characters to search
    return;
  }
  # Half way point between upper and lower samples
  current.abs-index = (lower-sample.abs-index + upper.abs-index)/2
  # Determine node and relative index from absolute index
  (current.node,current.rel-index) = get-rel-dom-position(current.abs-index);
  # Measure the sample
  (current.x, current.y, current.width, current.height) = measure(current.node, current.rel-index);
  if (current is closer to target than current-closest ) {
    current-closest = current;
  }
  if(current on line above current-closest ) {
    lower = current;
  } else if (current on line below current-closest ) {
    upper = current
  } else {
    # Current is on same line as current-closest
    if (current.x > target.x) {
      upper = current;
    } else {
      lower = current;
    }
  }
  x-search(lower, upper);
}

That’s a lot of code! Let’s go through it. The search begins by measuring the first and last characters in the search space, becoming the lower and upper
bounds of the search space respectively. If one of the initial measurements is displayed directly at the target coordinate, the search is finished. The first binary search is then entered.

Once the first binary search, the y-search, finds a measurement on the same line as the target coordinate, the search ends. The upper and lower bounds are selected from measurements stored in a sample-set (gathered during the y-search) for the second search, the x-search. The upper and lower bounds for the second search are determined as follows: if the current closest measurement is visually positioned to the left of the target x coordinate, then it becomes the lower bound and the upper bound is set to the closest sample in the sample-set with a larger absolute index to the lower bound. Conversely, if the  current closest measurement lies to the right of the target x coordinate, then it becomes the upper bound and the lower bound is set to the closest sample in the sample-set with a smaller absolute index to the upper bound.

Even though the upper and lower bounds selected for the second pass may contain characters on different lines, the search space is still sorted by distance. The x-search pseudo code shows how samples found to be on different lines to the target y coordinate during the x-search are considered to be more distant regardless of their x coordinates compared to samples that lie on the same line at the target y coordinate.

Some white-space characters are not rendered due to the HTML layout engine collapsing the white-space. When a character is found to be not rendered (where the offset attributes are zero or non-existent), the algorithm sequentially searches in the left direction to discover the nearest occurring rendered character. If the search reaches the lower bound, then the sequential search switches direction and locates the nearest rendered character to the right of the sample. If no other characters are found that are rendered within the upper and lower search space bounds then the search is complete.

The dual binary search algorithm requires the search space to be sorted. There are exceptions to the in-order traversal method for creating a spatially sorted collection of DOM nodes. For example, CSS positioning allows elements to be manually positioned in a web page regardless of where they reside in the DOM tree. CSS Floats are another example that can lead to unordered collections. However positioning styles are usually used for the presentation and structure of a web page rather than actual content. Furthermore the algorithm works within manually positioned elements/floats since an in-order traversal within these elements is spatially ordered.

Homing Dual Binary Search

I had experimented with a homing dual binary search, which is the same as the binary search, except instead of halving the search space for each new sample, the search space is narrowed down to an estimation of the closest absolute index. The estimation step uses the container elements bounds together with previously measured characters to choose a more informed search space split. But this doesnt preform very well, since estimations can’t take in account text wrapping and whitespace collapsing (life would be easy if the characters where uniformly spread throughout the container!). What happens is that the estimations often slightly miss the actual character each split, which sometimes can great reduce the search space, or hardly reduce it at all. So best stick with your vanilla binary search 50/50 split.

Conclusion

You dont have to pre-process the DOM to quickly find a character position in a web page document. If you want a flexible cross browser solution, try using a dual binary search as described by this post. This solution was devised during the days of HTML 4.01 and CSS2.0. You may need to handle more complicated cases with CSS 3.0, I’d imagine it would work just fine with HTML 5, given its basic use of standard DOM features.

Programming Praxis – Coin Change, Part 1, C# Solutions

This post presents C# solutions to a coin change problem as described in http://programmingpraxis.com/2013/05/17/coin-change-part-1.

It is a classic problem from computer science to determine the number of ways in which coins can be combined to form a particular target; as an example, there are 31 ways to form 40 cents from an unlimited supply of pennies (1 cent), nickels (5 cents), dimes (10 cents) and quarters (25 cents), ranging from the 3-coin set (5 10 25) to the 40-coin set consisting only of pennies.

Your task is to write two functions, one to determine the count and one to determine the list of coins. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

As the original post for this problem states, it appears to be most easily solvable using a recursive approach. Let’s start the thought process of solving such a problem like this. I naturally would begin by figuring out a systematic process of coming up with all the possibilities of combinations. Once the process is devised, you can the deduce an algorithm.

Start out with a simple concrete example. Naturally I thought: “at each step in the process, keep subtracting the largest allowable coin from the remaining total, until you reach a solution”. A coin would be allowable if it is equal or less than the remaining total. So if n = 40:

n = 40, set = []

Step 1: The largest allowable coin is 25 (25 < 40)

n = 15 (40 - 25), set = [25]

Step 2: The largest allowable coin is 10 (10 < 15)

n = 5 (10 - 15), set = [25, 10]

Step 3: The largest allowable coin is 5 (5 < 10)

n = 0 (5 - 5), set = [25, 10, 5]

Done!

Where to from here? Well lets enumerate all solutions with the 25 coin:

25, 10, 5

25, 10, 1, 1, 1, 1, 1

25, 5, 5, 5

25, 5, 5, 1,1,1,1,1

25, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

25, 1,1,1,1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

For each successive solution, we are breaking down each coin from the previous solution in as many ways as possible. However you wouldn’t want to break down the same coin in a previous solution more than once. e.g. for set [25,5,5,5] you wouldn’t want break down the coin 5 any more than once: [25, 5, 5, 1,1,1,1,1], [25, 5, 1,1,1,1,1, 5], [25, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1] etc. How do you avoid this? Look at the concrete example of coin sets above. Notice any patterns? There are a few. In relation to these cases we want to avoid, we could avoid duplicating coin sets by ensuring that for each breakdown the new coin selection is less or equal to the previous step’s selected coin (see the concrete example above, all the coins are ordered descendingly). Lets code it using recursion:

public class CoinChange1
{
	private int[] cs = new [] {25, 10, 5, 1};

	private List<ICollection<int>> solutions = new List<ICollection<int>> ();

	public static IList<ICollection<int>> GetCoinSets(int total) {
		// Handle corner case outside of recursive solution
		if (total == 0)
			return new List<ICollection<int>> ();

		// Get all possible sets
		var cc = new CoinChange1 ();
		cc.GetCoinSets (total, 0, new Stack<int>());
		return cc.solutions;
	}

	private void GetCoinSets(int n, int csi, Stack<int> combo) {
	        // Find largest allowable coin (exploiting that cs is ordered descendingly)
		while (cs[csi] > n)
			csi++;
		int c = cs [csi];

		combo.Push (c); // Track coin selection
		if (n == c)
			solutions.Add(combo.ToArray()); // Base case
		else
			GetCoinSets (n - c, csi, combo); // Recursion 1: with a selected coin
		combo.Pop ();

		// Recurse for all other possibilities beyond this point with a combo of smaller coin units
		if(csi < (cs.Length - 1))
			GetCoinSets (n, csi + 1, combo);
	}
}

And there we have it. This problem does have overlapping subsolutions, and so it could possibly be optimized using memoization. However this would add a lot of overhead: as it would mean caching collections of all possible coin subsets, and many memory copy operations to build up new sub-sets off other sub solutions. Sounds more costly than being worthwhile. So we will stick with this solution.

But we are not finished: the problem statement said to write another function to count the total possible coin sets for a given total (n). A naive approach would be to re-use our solution above, and return the length of the collection. But we can do better:

        private int GetCoinSetCount(int n, int csi) {
		while (cs[csi] > n)
			csi++;
		int c = cs [csi];

		int count;
		if (n == c)
			count = 1;
		else
			count = GetCoinSetCount (n - c, csi);

		if(csi < (cs.Length - 1))
			count += GetCoinSetCount (n, csi + 1);

		return count;
	}

I’ve omitted a wrapper function to neatly package the API (passing 0 for csi argument and a sanity check for total = 0). The performance has improved since we don’t need to maintain a list of selected coins throughout the recursive calls. But we can still do better!

Now that we don’t have to deal with passing around complex data per sub solution (lists), and so memoization is worthwhile here:

public class CoinChange1
{
	private Dictionary<ulong, uint> cache = new Dictionary<ulong, uint>();

	public static uint GetCountSetCount(uint total) {
		if (total == 0)
			return 0;
		return new CoinChange1().GetCoinSetCount (total, 0);
	}

	private uint GetCoinSetCount(uint n, byte csi) {
		while (cs[csi] > n)
			csi++;
		uint c = (uint)cs [csi];

		uint count;
		if (n == c) {
			count = 1;
		} else {
			uint nextn = n - c;
			ulong key = ((ulong)nextn << 32) | (ulong)csi;
			if(!cache.TryGetValue(key, out count)) {
				count = GetCoinSetCount (nextn, csi);
				cache[key] = count;
			}
		}

		if(csi < (cs.Length - 1)) {
			ulong key = ((ulong)n << 32) | ((ulong)csi + 1);
			uint subCount;
			if(!cache.TryGetValue(key, out subCount)) {
				subCount = GetCoinSetCount (n, (byte)(csi + 1));
				cache[key] = subCount;
			}
			count += subCount;
		}
		return count;
	}
}

I’ve changed the arguments and return types to unsigned, since I’m making a performance enhancement for the cache key. Since the key is composite (the arguments to the recursive functions), I pack them in a single unsigned long rather than using a Tuple or some custom class/struct. A petty improvement but an improvement nonetheless.