Writing Python inside your Rust code — Part 2

In this part, we’ll extend our python!{}-macro to be able to seamlessly use Rust variables in the Python code within. We explore a few options, and implement two alternatives.

Previously: Part 1A

Next: Part 3

Defining globals for Python

In Part 1, we wrote a run_python function that executes Python code using PyO3’s Python::run function:

fn run_python(code: &str) {
    let py = pyo3::Python::acquire_gil();
    if let Err(e) = py.python().run(code, None, None) {
        e.print(py.python());
        panic!("Python code failed");
    }
}

Let’s take a look again at the signature of pyo3::Python::run:

  pub fn run(
      self,
      code: &str,
      globals: Option<&PyDict>,
      locals: Option<&PyDict>
  ) -> PyResult<()>

Looks like we can give a PyDict of global variables where we give None now.

Setting variables inside a PyDict looks easy enough: PyDict::set_item only requires the key and value to implement ToPyObject, which is already implemented for a lot of types, including strings, integers, vectors, and many more.

Let’s see how it works, by defining five = 5 from Rust:

fn run_python(code: &str) {
    let py = pyo3::Python::acquire_gil();
    let globals = pyo3::types::PyDict::new(py.python());

    // "five" and 5 are automatically converted to a PyObject by ToPyObject.
    globals.set_item("five", 5).unwrap();

    if let Err(e) = py.python().run(code, Some(globals), None) {
        e.print(py.python());
        panic!("Python code failed");
    }
}

fn main() {
    python! {
        print(five + 1)
    }
}

$ cargo r
   Compiling scratchpad v0.1.0
    Finished dev [unoptimized + debuginfo] target(s) in 0.34s
     Running `target/debug/scratchpad`
6

Hey, that works!

Transparent syntax

Now that we know converting objects from Rust to Python is not going to be a problem (thanks to PyO3’s ToPyObject), we can move on to the problem of how the user will indicate which of their variables need to be converted and injected in the globals dictionary.

The most ergonomic way would probably be something that’s completely transparent:

let a = 5;
python! {
    b = 10
    print(a + b)
}

If we could make this work, users could just refer to any variable in scope as if they didn’t even switch languages. Sounds perfect.

However, our procedural macro does not have access to the surrounding code, so would not know that a even exists in the Rust code. In order to know a comes from Rust, but b does not, it’d have to understand the Python code and see that a is used without being initialized, but b is initialized inside the Python code. This goes much further than parsing b = 10 and understanding that defines b, meaning it should not be captured. For example, print is not explicitly defined anywhere, but it does not refer to anything from Rust.

If we think a bit more about this, it only gets worse:

let a = 5;
python! {
    from somelibrary import *
    globals()['c'] = 10
    print(a + b + c)
}

Without fully parsing and running all of the Python code, we can’t possibly do this.

So, there needs to be some way the user can tell the macro which variables need to be captured.

Capture list

Capturing variables.. That sounds like something closures (sometimes referred to as lambdas) do. Maybe we can draw some inspiration from there.

In Rust, a closure is defined using ||:

let a = 5;
let f = |b| a + b;
assert_eq!(f(10), 15);

Here, f behaves like a function. It takes one argument (b), and returns a + b. a was implicitly captured from the environment. This implicit capturing is exactly what we discussed before, which is not feasible for python!{}.

In Rust, closures always implicitly capture whatever they need from their environment, as described in the Rust Book. However, if we look at closures in other languages, we see that C++ is one of the few where we can manually specify what should be captured, using []-syntax:

int main() {
    int a = 5;

    auto x = [a] (int b) { return a + b; };
    //       └┬┘ └──┬──┘
    //        │     └── Parameter list
    //        └── Capture list

    assert(x(10) == 15);
}

Looks like C++ already solved our problem, using a capture list. Let’s just steal the idea:

let a = 3;
let b = 20;
python! {
    [a b] // List of variables to capture
    c = 100
    print(a + b + c)
}

I’ve omitted commas or any other syntax apart from the [] and the names, to keep parsing as simple as possible. If it turns out to work well, we can always improve the syntax later.

Implementation

Alright, let’s see if we can implement this.

We’ll have to parse the capture list and generate code that calls globals.set_item("var", var) for each variable. This code should end up in run_python and executed after making the globals dictionary, but before executing Python::run.

Passing arbitrary code to a function is easy in Rust: using impl Fn, run_python can accept a closure containing the code with all the set_item calls. We’ll have to give the closure access to the PyDict:

fn run_python(
    code: &str,
    f: impl FnOnce(&pyo3::types::PyDict), // new
) {
    let py = pyo3::Python::acquire_gil();
    let globals = pyo3::types::PyDict::new(py.python());

    f(globals); // new

    if let Err(e) = py.python().run(code, Some(globals), None) {
        e.print(py.python());
        panic!("Python code failed");
    }
}

We use FnOnce instead of Fn, because we only call the function once. This means it can also accept closures that move out of their own captures, and therefore cannot be called more than once.

Now on to the hard part, the macro implementation.

First, a function that checks if there’s a capture list, and extracts the names from it:

fn get_captures(input: TokenStream) -> Option<(Vec<String>, TokenStream)> {
    let mut input = input.into_iter();

    let captures = match input.next() {
        Some(TokenTree::Group(g)) if g.delimiter() == Delimiter::Bracket => g.stream(),
        _ => return None,
    };

    let captures = captures
        .into_iter()
        .map(|token| {
            if let TokenTree::Ident(ident) = token {
                ident.to_string()
            } else {
                panic!("Invalid token in capture list");
            }
        })
        .collect();

    Some((captures, TokenStream::from_iter(input)))
}

Let’s go through this step by step:

It returns an Option, because there might not be a capture list, in which case we’ll just return None.
It turns the TokenStream into an iterator, so we can extract the first token from it, while tracking our position in the stream.
It checks if the first token is a Group delimited by [..], and extracts the tokens inside this group.
It loops over these tokens, checking if they are actually identifiers, and converts them to strings.
It returns the vector of captured variable names, together with the rest of the token stream.

Let’s test it by adding it to our procedural macro:

#[proc_macro]
pub fn python(input: TokenStream) -> TokenStream {
    let (captures, input) = get_captures(input.clone()).unwrap_or((vec![], input));

    dbg!(captures);

    ...

We clone the input before giving it to get_captures, so we still have the original one around in case there was no capture list.

We still have to generate the set_item-code, but for now let’s see if parsing the capture list worked:

fn main() {
    python! {
        [a b]
        ...
    }
}

$ cargo b
   Compiling python-macro v0.1.0
   Compiling scratchpad v0.1.0
[python-macro/src/lib.rs:37] captures = [
    "a",
    "b",
]
error[E0061]: this function takes 2 arguments but 1 argument was supplied

That part works. It shows the names of the variables that should be captured, and then errors out because we’re missing the second argument to run_python.

On to generating that part.

For a capture list like [a b], our macro should generate something like:

|globals| {
    globals.set_item("a", a).unwrap();
    globals.set_item("b", b).unwrap();
}

We can do this directly inside our existing quote!():

    quote!(
        run_python(
            #source,
            |globals| {
                #(
                globals
                    .set_item(stringify!(#captures), #captures)
                    .expect("Conversion failed");
                )*
            }
        );
    ).into()

That #(..)*-syntax is a quote!-feature which will repeat its contents as many times as needed. In this case, for each element of captures. We again use stringify!() to turn the variable name into a string.

Let’s try!

fn main() {
    let a = 3;
    let b = 20;
    python! {
        [a b]
        c = 100
        print(a + b + c)
    }
}

$ cargo r
   Compiling python-macro v0.1.0
   Compiling scratchpad v0.1.0
warning: unused variable: `a`
 --> src/main.rs:5:9
  |
5 |     let a = 3;
  |         ^ help: if this is intentional, prefix it with an underscore: `_a`
  |
  = note: `#[warn(unused_variables)]` on by default

warning: unused variable: `b`
 --> src/main.rs:6:9
  |
6 |     let b = 20;
  |         ^ help: if this is intentional, prefix it with an underscore: `_b`

warning: 2 warnings emitted

    Finished dev [unoptimized + debuginfo] target(s) in 0.57s
     Running `target/debug/scratchpad`
Traceback (most recent call last):
  File "<string>", line 10, in <module>
NameError: name 'a' is not defined

Hold up. a unused? b unused? name 'a' not defined? What is going on? Warnings aside, the code compiled fine. So the macro did generate valid code.

But it’s wrong, somehow.

Now how do we debug our procedural macro?

There is an unstable rustc option called --pretty=expanded, which will show us the code after all macro expansions. Unlike what the name suggests, the output is usually not very pretty, so it’s a good idea to pass the output through rustfmt for readability. The cargo rustc command allows us to pass custom options to rustc:

$ cargo rustc -- -Z unstable-options --pretty=expanded | rustfmt
   Compiling scratchpad v0.1.0
    Finished dev [unoptimized + debuginfo] target(s) in 0.11s
<snip>

fn main() {
    let a = 3;
    let b = 20;
    run_python("\n\n\n\n\n\n\n\nc = 100\nprint(a + b + c)", |globals| {
        globals.set_item("\"a\"", "a").expect("Conversion failed");
        globals.set_item("\"b\"", "b").expect("Conversion failed");
    });
}

<snip>

Oh, those set_item calls look off.

// What it generated:
globals.set_item("\"a\"", "a").expect("Conversion failed");

// What we wanted:
globals.set_item("a", a).expect("Conversion failed");

Instead of just the identifier a, quote!() produced the string literal "a". So instead of (stringify!(a), a), we got (stringify!("a"), "a"), which expands to ("\"a\"", "a").

We can’t really blame quote!() here, because we did ask it to insert a String (from captures, which is a Vec<String>). It makes sense that it’ll turn Strings into string literals.

To fix this, we should instead give it Idents, which represent identifiers.

Looking at the documentation of Ident, we see that Idents are made from a string (the name) and a Span. This span is used not only for compiler errors to display their location, but also for macro hygiene. It makes sure that names defined in macros do not get mixed up with names on the outside.

In this case, we do want to refer to variables on the outside, and it’d be nice if errors (e.g. about a variable not existing) would point to the place where the user named it in the capture list.

This means we should not make our own Idents, but simply use the original ones we got out of the TokenStream. This is a pretty simple change to get_captures. All we have to do is change the return type:

fn get_captures(input: TokenStream) -> Option<(Vec<Ident>, TokenStream)>
//                                                 ^^^^^

And replace ident.to_string() by ident.

Easy!

$ cargo r
   Compiling python-macro v0.1.0
error[E0277]: the trait bound `proc_macro::Ident: quote::to_tokens::ToTokens` is not satisfied
  --> python-macro/src/lib.rs:47:5
   |
47 | /     quote!(
48 | |         run_python(
49 | |             #source,
50 | |             |globals| {
...  |
57 | |         );
58 | |     ).into()
   | |_____^ the trait `quote::to_tokens::ToTokens` is not implemented for `proc_macro::Ident`
   |
   = note: required because of the requirements on the impl of `quote::to_tokens::ToTokens` for `&proc_macro::Ident`
   = note: required because of the requirements on the impl of `quote::to_tokens::ToTokens` for `quote::__private::RepInterp<&proc_macro::Ident>`
   = note: required by `quote::to_tokens::ToTokens::to_tokens`
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

Uh. Looks like quote!() can’t handle Ident? quote::ToTokens isn’t implemented for it? But the documentation said—

impl ToTokens for Ident

…

Oh. Wait a minute.

That Ident there links to proc_macro2::Ident, not proc_macro::Ident. What’s up with that? A second proc_macro crate?

The documentation of proc-macro2 explains it is simply a wrapper around proc_macro, but one which also allows using it outside procedural macros, where the compiler-provided proc_macro crate doesn’t exist. Useful for unit testing and more.

For procedural macros, it suggests converting proc_macro::TokenStreams directly to proc_macro2::TokenStreams before doing anything else, since crates like quote and syn use proc_macro2 for everything.

Okay, let’s see.

First we add it as a new dependency in python-macro/Cargo.toml:

[dependencies]
quote = "1.0"
proc-macro2 = { version = "1.0", features = ["span-locations"] }

We enable the span-locations feature to enable line and column information, which is still unstable.

Then, in python-macro/lib.rs, we only have to make a few changes:

use proc_macro2::{Delimiter, Ident, LineColumn, Spacing, TokenStream, TokenTree};
//            ^ Only added the 2 here.

#[proc_macro]
pub fn python(input: proc_macro::TokenStream) -> proc_macro::TokenStream {
    //               ^^^^^^^^^^^^                ^^^^^^^^^^^^

    let input = TokenStream::from(input); // Convert it to proc_macro2's TokenStream.

    ...

Okay, fingers crossed…

$ cargo r
   Compiling python-macro v0.1.0
   Compiling scratchpad v0.1.0
    Finished dev [unoptimized + debuginfo] target(s) in 0.76s
     Running `target/debug/scratchpad`
123

Whoa. It worked!

Thanks to quote, generating the right code was quite easy. Quite ergonomic how it allows placeholders like #a to substitute variables.

Uh. Wait. Placeholders. That gives me an idea.

Placeholder syntax

What if we use quote’s solution, instead of the one we stole from C++’s closures?

let a = 3;
let b = 20;
python! {
    c = 100
    print(#a + #b + c)
}

I like it.

Now, is #a the best option? Or should we use @a, $a, ^a, rust:a, «a», or something else?

First of all, it needs to be something that the Rust tokeniser allows. As we’ve seen in part 1A, there’s no way around that.

Then, it’s important to pick something that doesn’t already have a meaning in Python. # is used for comments, @ is used for annotations, etc.

Another consideration is syntax highlighting. Users of our macro will probably be writing their code in an editor that knows how to syntax-highlight Rust code, but knows nothing about our python!{} macro. It’d be nice if our syntax doesn’t completely break syntax highlighting, or better: if our placeholders would show up in some recognizable way.

To do that, we need to pick some existing Rust syntax, which has no meaning in Python.

The first thing that comes to mind is lifetime syntax: 'a. Most Rust editors will understand the ' and the name afterwards belong together, and although single quoted strings are already a thing in Python, those are already unusable in our macro anyway. (See part 1A for details.)

Sounds like this can work!

Here’s how that would look:

let a = 3;
let b = 20;
python! {
    c = 100
    print('a + 'b + c)
}

Nice. Let’s do this.

Implementation

What do we need to do to implement this? Let’s see…

Throw out the get_captures function and call.
Modify our reconstruction function to detect 'a-syntax, replace it by a variable, and remember the Ident for later.
Modify the quote!() to use our new list of Idents.
Done?

Users might refer to the same variable multiple times, but we should capture them only once. So it’s probably a good idea to use a set or map to store all the placeholders we find:

struct Source {
    // <<snip>
    captures: BTreeMap<String, Ident>,
}

In our Source::reconstruct_from function, we now need to look for '-tokens and extract the token after it.

It currently looks like this:

    fn reconstruct_from(&mut self, input: TokenStream) {
        for t in input {
            // <snip>
        }
    }

We now no longer want to process exactly one token per iteration, but sometimes consume the next one as well (to see what’s after a '). This means we can’t easily use a for-loop anymore, as it doesn’t give us access to the underlying iterator. Using a while let lets us define the iterator ourselves:

        let mut input = input.into_iter();
        while let Some(t) = input.next() {
            // Now we can use input.next() to consume more tokens.
        }

The body of the loop looked like this:

            if let TokenTree::Group(g) = t {
                // <snip>
            } else {
                self.add_whitespace(t.span().start());
                self.add_str(&t.to_string());
            }

There’s one case for handling group tokens and to recurse into them, and another case for all other tokens, including '-tokens.

Right after the add_whitespace in the else case, we add:

                if let TokenTree::Punct(t) = &t {
                    if t.as_char() == '\'' {
                        if let Some(TokenTree::Ident(var)) = input.next() {
                            let varname = format!("_rust_{}", var);
                            self.col += var.to_string().len() + 1;
                            self.source += &varname;
                            self.captures.entry(varname).or_insert(var);
                            continue;
                        } else {
                            unreachable!();
                        }
                    }
                }

If the token we see is a '-token, we consume the next token and assume that’s an identifier. (If it wasn’t, the Rust tokeniser would’ve already rejected the code.)

Instead of 'a, we add _rust_a to the Python code (since we can’t name variables 'a), and remember both this name and the Ident in our capatures map if it wasn’t seen before.

Finally, we update our quote!(), which doesn’t change much:

    let name = s.captures.keys();
    let var = s.captures.values();

    quote!(
        run_python(
            #source,
            |globals| {
                #(
                globals
                    .set_item(#name, #var)
                    .expect("Conversion failed");
                )*
            }
        );
    ).into()

Done?

Let’s try!

fn main() {
    let a = 3;
    let b = 20;
    python! {
        c = 100
        print('a + 'b + c)
    }
}

$ cargo r
   Compiling python-macro v0.1.0
   Compiling scratchpad v0.1.0
    Finished dev [unoptimized + debuginfo] target(s) in 0.72s
     Running `target/debug/scratchpad`
123

Success!

🎉

Identifier spans

Since we used the original Idents everywhere, we should get nice error messages when trying to use a variable that doesn’t exist:

let a = 3;
let b = 20;
python! {
    print('a + 'b + 'c)
}

$ cargo r
   Compiling scratchpad v0.1.0
error[E0425]: cannot find value `c` in this scope
 --> src/main.rs:8:25
  |
8 |         print('a + 'b + 'c)
  |                         ^^ help: a local variable with a similar name exists: `a`

Awesome!

Note how it points to both the ' and c. rustc knows these two tokens belong together, because we used a syntax it is already familiar with.

What’s next

Now we have a way to very easily get data into our python!{} blocks. It works for everything that implements ToPyObject, including strings, numbers, and all kind of collections.

What’s still missing, is a nice way to get data out. We’ll look at that in a later post, but first we’re going to look at a very different topic.

In part 3, we’re going to ‘compile’ the Python code into Python byte-code at compile time, to catch errors like invalid Python syntax before even running it. This also speeds up execution times, by moving part of the work into the compilation step.

Contents

Defining globals for Python

Transparent syntax

Capture list

Implementation

Placeholder syntax

Implementation

Identifier spans

What’s next