U is cyclic of order 4, generated by [0,1|-1,0]. This is multiplication by i in the complex plane. (U has order 2 if p = 2.)
Let the top row of a matrix in O be [a,b]. If either is 0 then we are back to an element of U. So both a and b are nonzero, and if the two rows are to be orthogonal, both entries in the second row are also nonzero. Scale the bottom row so that the lower right entry equals a. (We'll scale it back later if necessary.) The antidiagonal elements must now be opposite. [a,b|-b,a]. This has a determinant of a2+b2, which is already 1, since the top row has length 1. So we didn't need to scale the bottom row after all. The matrices of O correspond to the pairs a,b such that a2+b2 = 1. You can think of these as vectors in the complex plane, though the analogy is not perfect.
What is the size of O? If p = 2, (or some field of characteristic 2), everything has a unique square root. Select any a and set b to the square root of 1-a2. The order of O is the size of the finite field. But of course this isn't typical.
For p odd, I call upon a lemma that is technical, and too long to copy here. It says the sum of squares distributes evenly across the nonzero values mod p. You can look at it here. The number of vectors with length 1 is then the number of vectors with nonzero length, divided by p-1. How many vectors have length 0? It depends on p mod 4. If p equal 3 mod 4 then a2 cannot equal -b2 mod p, since -1 is not a square. So there is but 1 solution, namely [0,0], leaving p2-1 vectors with nonzero length. Therefore |O| = p+1. If p = 1 mod 4 then any nonzero a can be multiplied by ±sqrt(-1) to find the corresponding b. Combine with [0,0] and find 2p-1 solutions. That leaves p2-2p+1 vectors with nonzero length, and |O| = p-1.
Take arbitrary matrices [a,b|-b,a] and [c,d|-d,c] and verify that multiplication commutes. Thus O is abelian.
In characteristic 2, [a,b|b,a]2 = 1. The group consists of copies of Z2 running in parallel; so that the order of O is the order of our finite field.
Notice that the bottom row contains no new information. The top row is everything, and when it is viewed as a vector in the complex plane, the product of two matrices yields a top row that is in fact (a+bi) * (c+di). So we are really extending the field by i, and multiplying two values in the resulting ring. When p is 3 mod 4, there is no square root of -1. The extension creates a larger field, and the vectors form a finite multiplicative group within this field. Such a group is cyclic, hence O is cyclic.
Bravely move on to the case of p = 1 mod 4. Let m be the square root of -1. If the base field is F, the ring extension is F[x]/(x2+1). By the chinese remainder theorem, this is isomorphic, in a canonical way, to the direct product of the two rings F[x]/(x-m) * F[x]/(x+m). Each of these is essentially F. So we have F cross F. A vector such as a,b maps to a+mb in one field and a-mb in the other. If either image is 0 then a2+b2 = 0, and that can't happen. So a group homomorphism, respecting multiplication, carries O into F*. To show this is a monomorphism, suppose a-mb = c-md. Thus a-c = m(b-d). If d is b-r, then c is a-mr. Find the length of [a-mr,b-r]. The result is a2+b2-2r(b+am). The first two terms sum to 1, and the total length must b 1, so r is 0, or am+b = 0. If the latter, then multiply through by m, and a-mb = 0. We already ruled this out. Therefore O embeds in F*, and is cyclic. Since both groups have the same size, they are in fact isomorphic.
If p = 1 mod 4, then i (the square root of -1) is part of F. Bring in the antidiagonal matrix [0,i|i,0], and get matrices of the form i times [b,a|a,-b]. This coset has orthogonal rows, but the lengths (thanks to i) are -1. That completes the 2 sylow subgroup for 1 mod 4.
When p = 3 mod 4, find x and y such that x2 + y2 = -1. Then bring in the matrix [x,y|y,-x]. This generates matrices of the form [ax+by,ay-bx|-bx+ay,-by-ax]. Again, a dihedral group.
Start with p = 3 mod 4. Adjoin i to F to get a larger field. The norm of a+bi is a2+b2. Norm is a group homomorphism from F(i) onto F. The kernel is O. Let z be a primitive element in F(i). Thus z raised to the p2-1 = 1. Raise z to the p-1 and find a generator for O.
When p = 907, the extended field has order e = 822648. Any z raised to the e equals 1. If z to the e/2, e/3, e/151, and e/227 are not 1, then z is primitive. Set z = 2+i and find a primitive element. Rais this to the 906 and get 182+362i, a generator for O.
Let p = 1 mod 4 and find a primitive element z for F. Raise z to the (p-1)/4 and call the result m. This is the square root of -1. (I used m for this purpose above.) Now we need to reverse engineer the chinese remainder theorem. It is easy, for instance, to go from mod 15 down to mod 3 and mod 5, but a bit trickier to go backwards. Let's do an example.
Let p = 929, and verify that z = 3 is primitive. Set y = 1/z = 310. The images of a+bi must be y and z. Raise z to the 232 and get m = 605. Set a-mb = z and a+mb = y. Solve and find a generator for O, 621+497i.
If you are working mod n, rather than mod p, the above procedure is not possible; yet you might still come across a generator for O. Select c at random, until c2 mod n is a prime equal to 1 mod 4. It's ok if c is a product of such primes, but factoring is often more trouble than it's worth, so just look for a prime number. Then write this as a2+b2. Finally, divide through by c2, so that a2+b2 = 1. Within a dozen trials you should stumble upon a generator for O, though there isn't an easy way to verify this.
for what it's worth, a similar method solves a2+b2 = jc2 for any j.