rust-lang · thestinger · Apr 12, 2013
diff --git a/doc/tutorial-ffi.md b/doc/tutorial-ffi.md
@@ -2,255 +2,199 @@
 
 # Introduction
 
-Because Rust is a systems programming language, one of its goals is to
-interoperate well with C code.
+This tutorial will use the [snappy](https://code.google.com/p/snappy/)
+compression/decompression library as an introduction to writing bindings for
+foreign code. Rust is currently unable to call directly into a C++ library, but
+snappy includes a C interface (documented in
+[`snappy-c.h`](https://code.google.com/p/snappy/source/browse/trunk/snappy-c.h)).
 
-We'll start with an example, which is a bit bigger than usual. We'll
-go over it one piece at a time. This is a program that uses OpenSSL's
-`SHA1` function to compute the hash of its first command-line
-argument, which it then converts to a hexadecimal string and prints to
-standard output. If you have the OpenSSL libraries installed, it
-should compile and run without any extra effort.
+The following is a minimal example of calling a foreign function which will compile if snappy is
+installed:
 
 ~~~~ {.xfail-test}
-extern mod std;
-use core::libc::c_uint;
+use core::libc::size_t;
 
-extern mod crypto {
-    fn SHA1(src: *u8, sz: c_uint, out: *u8) -> *u8;
-}
-
-fn as_hex(data: ~[u8]) -> ~str {
-    let mut acc = ~"";
-    for data.each |&byte| { acc += fmt!("%02x", byte as uint); }
-    return acc;
-}
-
-fn sha1(data: ~str) -> ~str {
-    unsafe {
-        let bytes = str::to_bytes(data);
-        let hash = crypto::SHA1(vec::raw::to_ptr(bytes),
-                                vec::len(bytes) as c_uint,
-                                ptr::null());
-        return as_hex(vec::from_buf(hash, 20));
-    }
+#[link_args = "-lsnappy"]
+extern {
+    fn snappy_max_compressed_length(source_length: size_t) -> size_t;
 }
 
 fn main() {
-    io::println(sha1(core::os::args()[1]));
+    let x = unsafe { snappy_max_compressed_length(100) };
+    println(fmt!("max compressed length of a 100 byte buffer: %?", x));
 }
 ~~~~
 
-# Foreign modules
-
-Before we can call the `SHA1` function defined in the OpenSSL library, we have
-to declare it. That is what this part of the program does:
+The `extern` block is a list of function signatures in a foreign library, in this case with the
+platform's C ABI. The `#[link_args]` attribute is used to instruct the linker to link against the
+snappy library so the symbols are resolved.
 
-~~~~ {.xfail-test}
-extern mod crypto {
-    fn SHA1(src: *u8, sz: uint, out: *u8) -> *u8; }
-~~~~
+Foreign functions are assumed to be unsafe so calls to them need to be wrapped with `unsafe {}` as a
+promise to the compiler that everything contained within truly is safe. C libraries often expose
+interfaces that aren't thread-safe, and almost any function that takes a pointer argument isn't
+valid for all possible inputs since the pointer could be dangling, and raw pointers fall outside of
+Rust's safe memory model.
 
-An `extern` module declaration containing function signatures introduces the
-functions listed as _foreign functions_. Foreign functions differ from regular
-Rust functions in that they are implemented in some other language (usually C)
-and called through Rust's foreign function interface (FFI). An extern module
-like this is called a foreign module, and implicitly tells the compiler to
-link with a library that contains the listed foreign functions, and has the
-same name as the module.
+When declaring the argument types to a foreign function, the Rust compiler will not check if the
+declaration is correct, so specifying it correctly is part of keeping the binding correct at
+runtime.
 
-In this case, the Rust compiler changes the name `crypto` to a shared library
-name in a platform-specific way (`libcrypto.so` on Linux, for example),
-searches for the shared library with that name, and links the library into the
-program. If you want the module to have a different name from the actual
-library, you can use the `"link_name"` attribute, like:
+The `extern` block can be extended to cover the entire snappy API:
 
 ~~~~ {.xfail-test}
-#[link_name = "crypto"]
-extern mod something {
-    fn SHA1(src: *u8, sz: uint, out: *u8) -> *u8;
+use core::libc::{c_int, size_t};
+
+#[link_args = "-lsnappy"]
+extern {
+    fn snappy_compress(input: *u8,
+                       input_length: size_t,
+                       compressed: *mut u8,
+                       compressed_length: *mut size_t) -> c_int;
+    fn snappy_uncompress(compressed: *u8,
+                         compressed_length: size_t,
+                         uncompressed: *mut u8,
+                         uncompressed_length: *mut size_t) -> c_int;
+    fn snappy_max_compressed_length(source_length: size_t) -> size_t;
+    fn snappy_uncompressed_length(compressed: *u8,
+                                  compressed_length: size_t,
+                                  result: *mut size_t) -> c_int;
+    fn snappy_validate_compressed_buffer(compressed: *u8,
+                                         compressed_length: size_t) -> c_int;
 }
 ~~~~
 
-# Foreign calling conventions
+# Creating a safe interface
 
-Most foreign code is C code, which usually uses the `cdecl` calling
-convention, so that is what Rust uses by default when calling foreign
-functions. Some foreign functions, most notably the Windows API, use other
-calling conventions. Rust provides the `"abi"` attribute as a way to hint to
-the compiler which calling convention to use:
+The raw C API needs to be wrapped to provide memory safety and make use higher-level concepts like
+vectors. A library can choose to expose only the safe, high-level interface and hide the unsafe
+internal details.
 
-~~~~
-#[cfg(target_os = "win32")]
-#[abi = "stdcall"]
-extern mod kernel32 {
-    fn SetEnvironmentVariableA(n: *u8, v: *u8) -> int;
+Wrapping the functions which expect buffers involves using the `vec::raw` module to manipulate Rust
+vectors as pointers to memory. Rust's vectors are guaranteed to be a contiguous block of memory. The
+length is number of elements currently contained, and the capacity is the total size in elements of
+the allocated memory. The length is less than or equal to the capacity.
+
+~~~~ {.xfail-test}
+pub fn validate_compressed_buffer(src: &[u8]) -> bool {
+    unsafe {
+        snappy_validate_compressed_buffer(vec::raw::to_ptr(src), src.len() as size_t) == 0
+    }
 }
 ~~~~
 
-The `"abi"` attribute applies to a foreign module (it cannot be applied
-to a single function within a module), and must be either `"cdecl"`
-or `"stdcall"`. We may extend the compiler in the future to support other
-calling conventions.
+The `validate_compressed_buffer` wrapper above makes use of an `unsafe` block, but it makes the
+guarantee that calling it is safe for all inputs by leaving off `unsafe` from the function
+signature.
 
-# Unsafe pointers
+The `snappy_compress` and `snappy_uncompress` functions are more complex, since a buffer has to be
+allocated to hold the output too.
 
-The foreign `SHA1` function takes three arguments, and returns a pointer.
+The `snappy_max_compressed_length` function can be used to allocate a vector with the maximum
+required capacity to hold the compressed output. The vector can then be passed to the
+`snappy_compress` function as an output parameter. An output parameter is also passed to retrieve
+the true length after compression for setting the length.
 
 ~~~~ {.xfail-test}
-# extern mod crypto {
-fn SHA1(src: *u8, sz: libc::c_uint, out: *u8) -> *u8;
-# }
-~~~~
+pub fn compress(src: &[u8]) -> ~[u8] {
+    unsafe {
+        let srclen = src.len() as size_t;
+        let psrc = vec::raw::to_ptr(src);
 
-When declaring the argument types to a foreign function, the Rust
-compiler has no way to check whether your declaration is correct, so
-you have to be careful. If you get the number or types of the
-arguments wrong, you're likely to cause a segmentation fault. Or,
-probably even worse, your code will work on one platform, but break on
-another.
+        let mut dstlen = snappy_max_compressed_length(srclen);
+        let mut dst = vec::with_capacity(dstlen as uint);
+        let pdst = vec::raw::to_mut_ptr(dst);
 
-In this case, we declare that `SHA1` takes two `unsigned char*`
-arguments and one `unsigned long`. The Rust equivalents are `*u8`
-unsafe pointers and an `uint` (which, like `unsigned long`, is a
-machine-word-sized type).
+        snappy_compress(psrc, srclen, pdst, &mut dstlen);
+        vec::raw::set_len(&mut dst, dstlen as uint);
+        dst
+    }
+}
+~~~~
 
-The standard library provides various functions to create unsafe pointers,
-such as those in `core::cast`. Most of these functions have `unsafe` in their
-name.  You can dereference an unsafe pointer with the `*` operator, but use
-caution: unlike Rust's other pointer types, unsafe pointers are completely
-unmanaged, so they might point at invalid memory, or be null pointers.
+Decompression is similar, because snappy stores the uncompressed size as part of the compression
+format and `snappy_uncompressed_length` will retrieve the exact buffer size required.
 
-# Unsafe blocks
+~~~~ {.xfail-test}
+pub fn uncompress(src: &[u8]) -> Option<~[u8]> {
+    unsafe {
+        let srclen = src.len() as size_t;
+        let psrc = vec::raw::to_ptr(src);
 
-The `sha1` function is the most obscure part of the program.
+        let mut dstlen: size_t = 0;
+        snappy_uncompressed_length(psrc, srclen, &mut dstlen);
 
-~~~~
-# pub mod crypto {
-#   pub fn SHA1(src: *u8, sz: uint, out: *u8) -> *u8 { out }
-# }
-# fn as_hex(data: ~[u8]) -> ~str { ~"hi" }
-fn sha1(data: ~str) -> ~str {
-    unsafe {
-        let bytes = str::to_bytes(data);
-        let hash = crypto::SHA1(vec::raw::to_ptr(bytes),
-                                vec::len(bytes), ptr::null());
-        return as_hex(vec::from_buf(hash, 20));
+        let mut dst = vec::with_capacity(dstlen as uint);
+        let pdst = vec::raw::to_mut_ptr(dst);
+
+        if snappy_uncompress(psrc, srclen, pdst, &mut dstlen) == 0 {
+            vec::raw::set_len(&mut dst, dstlen as uint);
+            Some(dst)
+        } else {
+            None // SNAPPY_INVALID_INPUT
+        }
     }
 }
 ~~~~
 
-First, what does the `unsafe` keyword at the top of the function
-mean? `unsafe` is a block modifier—it declares the block following it
-to be known to be unsafe.
+For reference, the examples used here are also available as an [library on
+GitHub](https://github.com/thestinger/rust-snappy).
 
-Some operations, like dereferencing unsafe pointers or calling
-functions that have been marked unsafe, are only allowed inside unsafe
-blocks. With the `unsafe` keyword, you're telling the compiler 'I know
-what I'm doing'. The main motivation for such an annotation is that
-when you have a memory error (and you will, if you're using unsafe
-constructs), you have some idea where to look—it will most likely be
-caused by some unsafe code.
+# Linking
 
-Unsafe blocks isolate unsafety. Unsafe functions, on the other hand,
-advertise it to the world. An unsafe function is written like this:
-
-~~~~
-unsafe fn kaboom() { ~"I'm harmless!"; }
-~~~~
+In addition to the `#[link_args]` attribute for explicitly passing arguments to the linker, an
+`extern mod` block will pass `-lmodname` to the linker by default unless it has a `#[nolink]`
+attribute applied.
 
-This function can only be called from an `unsafe` block or another
-`unsafe` function.
-
-# Pointer fiddling
+# Unsafe blocks
 
-The standard library defines a number of helper functions for dealing
-with unsafe data, casting between types, and generally subverting
-Rust's safety mechanisms.
+Some operations, like dereferencing unsafe pointers or calling functions that have been marked
+unsafe are only allowed inside unsafe blocks. Unsafe blocks isolate unsafety and are a promise to
+the compiler that the unsafety does not leak out of the block.
 
-Let's look at our `sha1` function again.
+Unsafe functions, on the other hand, advertise it to the world. An unsafe function is written like
+this:
 
 ~~~~
-# pub mod crypto {
-#     pub fn SHA1(src: *u8, sz: uint, out: *u8) -> *u8 { out }
-# }
-# fn as_hex(data: ~[u8]) -> ~str { ~"hi" }
-# fn x(data: ~str) -> ~str {
-# unsafe {
-let bytes = str::to_bytes(data);
-let hash = crypto::SHA1(vec::raw::to_ptr(bytes),
-                        vec::len(bytes), ptr::null());
-return as_hex(vec::from_buf(hash, 20));
-# }
-# }
+unsafe fn kaboom(ptr: *int) -> int { *ptr }
 ~~~~
 
-The `str::to_bytes` function is perfectly safe: it converts a string to a
-`~[u8]`. The program then feeds this byte array to `vec::raw::to_ptr`, which
-returns an unsafe pointer to its contents.
-
-This pointer will become invalid at the end of the scope in which the vector
-it points to (`bytes`) is valid, so you should be very careful how you use
-it. In this case, the local variable `bytes` outlives the pointer, so we're
-good.
-
-Passing a null pointer as the third argument to `SHA1` makes it use a
-static buffer, and thus save us the effort of allocating memory
-ourselves. `ptr::null` is a generic function that, in this case, returns an
-unsafe null pointer of type `*u8`. (Rust generics are awesome
-like that: they can take the right form depending on the type that they
-are expected to return.)
-
-Finally, `vec::from_buf` builds up a new `~[u8]` from the
-unsafe pointer that `SHA1` returned. SHA1 digests are always
-twenty bytes long, so we can pass `20` for the length of the new
-vector.
-
-# Passing structures
+This function can only be called from an `unsafe` block or another `unsafe` function.
 
-C functions often take pointers to structs as arguments. Since Rust
-`struct`s are binary-compatible with C structs, Rust programs can call
-such functions directly.
+# Foreign calling conventions
 
-This program uses the POSIX function `gettimeofday` to get a
-microsecond-resolution timer.
+Most foreign code exposes a C ABI, and Rust uses the platform's C calling convention by default when
+calling foreign functions. Some foreign functions, most notably the Windows API, use other calling
+conventions. Rust provides the `abi` attribute as a way to hint to the compiler which calling
+convention to use:
 
 ~~~~
-extern mod std;
-use core::libc::c_ulonglong;
-
-struct timeval {
-    tv_sec: c_ulonglong,
-    tv_usec: c_ulonglong
+#[cfg(target_os = "win32")]
+#[abi = "stdcall"]
+extern mod kernel32 {
+    fn SetEnvironmentVariableA(n: *u8, v: *u8) -> int;
 }
+~~~~
 
-#[nolink]
-extern mod lib_c {
-    fn gettimeofday(tv: *mut timeval, tz: *()) -> i32;
-}
-fn unix_time_in_microseconds() -> u64 {
-    unsafe {
-        let mut x = timeval {
-            tv_sec: 0 as c_ulonglong,
-            tv_usec: 0 as c_ulonglong
-        };
-        lib_c::gettimeofday(&mut x, ptr::null());
-        return (x.tv_sec as u64) * 1000_000_u64 + (x.tv_usec as u64);
-    }
-}
+The `abi` attribute applies to a foreign module (it cannot be applied to a single function within a
+module), and must be either `"cdecl"` or `"stdcall"`. The compiler may eventually support other
+calling conventions.
 
-# fn main() { assert!(fmt!("%?", unix_time_in_microseconds()) != ~""); }
-~~~~
+# Interoperability with foreign code
 
-The `#[nolink]` attribute indicates that there's no foreign library to
-link in. The standard C library is already linked with Rust programs.
+Rust guarantees that the layout of a `struct` is compatible with the platform's representation in C.
+A `#[packed]` attribute is available, which will lay out the struct members without padding.
+However, there are currently no guarantees about the layout of an `enum`.
 
-In C, a `timeval` is a struct with two 32-bit integer fields. Thus, we
-define a `struct` type with the same contents, and declare
-`gettimeofday` to take a pointer to such a `struct`.
+Rust's owned and managed boxes use non-nullable pointers as handles which point to the contained
+object. However, they should not be manually because they are managed by internal allocators.
+Borrowed pointers can safely be assumed to be non-nullable pointers directly to the type. However,
+breaking the borrow checking or mutability rules is not guaranteed to be safe, so prefer using raw
+pointers (`*`) if that's needed because the compiler can't make as many assumptions about them.
 
-This program does not use the second argument to `gettimeofday` (the time
- zone), so the `extern mod` declaration for it simply declares this argument
- to be a pointer to the unit type (written `()`). Since all null pointers have
- the same representation regardless of their referent type, this is safe.
+Vectors and strings share the same basic memory layout, and utilities are available in the `vec` and
+`str` modules for working with C APIs. Strings are terminated with `\0` for interoperability with C,
+but it should not be assumed because a slice will not always be nul-terminated. Instead, the
+`str::as_c_str` function should be used.
 
+The standard library includes type aliases and function definitions for the C standard library in
+the `libc` module, and Rust links against `libc` and `libm` by default.