The CityHash functions would need to modified to update an intermediate state. feeding the functions a chunk at a time and expecting to get the same result as if the entire set of data were fed to the function at once. The reason is that those functions are not set up to accept buffered data, i.e. Note however that the CityHash functions would need to be modified for this purpose, or an arbitrary choice would need to be made in order to define a consistent meaning for the CityHash value on large streams of data. If you don't have the crc32 instruction, then you would be hard-pressed to compute a 32-bit CRC as fast as a CityHash64 or CityHash128. In order to get the CRC-32C calculation that fast, I had to get fancy with three crc32 instructions on three separate buffers in order to make use of the three arithmetic logic units in parallel in a single core, and then writing the inner loop in assembler. CityHashCrc128 makes use of the same hardware instruction, though it does not compute a CRC. CityHash64 took 55 ms, CityHash128 60 ms, and CityHashCrc128 50 ms. The crc32 instruction version (which computes a CRC-32C) took 24 ms of CPU time. In particular, CityHash appears to be very nearly as fast as a CRC-32 calculated using the Intel crc32 hardware instruction! I tested three CityHash routines and the Intel crc32 instruction on a 434 MB file. It is not so likely if very high performance hash functions are used. The above "That is likely to be the CRC- n" is only somewhat likely. The error detection performance between a CRC- n and n bits from a good hash function will be very close to the same, so pick whichever one is faster. Only you can say whether 1-2 -32 is good enough or not for your application. Are we likely to see sparse 1-bit flips or massive block corruptions? Also, given that most storage and networking hardware implements some sort of CRC, should not accidental bit flips be taken care of already? I guess the answer depends on type of errors that might be expected in such sort of operation. A perfect 4-byte hash reliability is always 1-2^(-32), so go figure.Ĩ-byte hash should have a much better overall reliability ( 2^(-64) chance to miss an error), so should it be preferred over CRC32? What about CRC64? Blocks can be 1KB to 1GB in size.Īs far as I understand, CRC32 can detect up to 32 bit flips with 100% reliability, but after that its reliability approaches 1-2^(-32) and for some patterns is much worse. which will have a smaller probability to miss an error? Specifically:ĭata blocks are to be transferred over network and stored on disk, repeatedly. I felt that the thread and answer were both good enough to merit a place on dba.Performance and security considerations aside, and assuming a hash function with a perfect avalanche effect, which should I use for checksumming blocks of data: CRC32 or hash truncated to N bytes? I.e. See here for the excellent question, and here (same thread) for the excellent answer, both of which I (shamelessly!) lifted to answer this question. If security is not a major concern, you might want to use that instead? See the fidddle here for more. You might want to be careful about your choice - MD5 is much less processor intensive than SHA256. With STANDARD_HASH, you get to choose your hashing algorithm. This will tell you if your records are the same or different (well, to a very high probablity! :-) ). Select 2 id, 2 a2, 2 b2, 2 c2, 2 d2, 2 e2, 2 f2 from dual Īnd then calculate a SHA256 checksum: SELECT You might do something like this: - 2 tables Use the checksums in the WHEREĬlause of the UPDATE (in the MERGE statement). Oracle's STANDARD_HASH function "computes a hash value for a givenĮxpression" (see the documentation here). You can use Oracle's STANDARD_HASH function.
0 Comments
Leave a Reply. |