Writing Python inside your Rust code — Part 2
Contents
In this part, we’ll extend our python!{}
-macro to be able to seamlessly
use Rust variables in the Python code within.
We explore a few options, and implement two alternatives.
Previously: Part 1A
Next: Part 3
Defining globals for Python
In Part 1,
we wrote a run_python
function that executes Python code using PyO3’s Python::run
function:
fn run_python(code: &str) {
let py = pyo3::Python::acquire_gil();
if let Err(e) = py.python().run(code, None, None) {
e.print(py.python());
panic!("Python code failed");
}
}
Let’s take a look again at the signature of pyo3::Python::run
:
pub fn run(
self,
code: &str,
globals: Option<&PyDict>,
locals: Option<&PyDict>
) -> PyResult<()>
Looks like we can give a PyDict
of global variables where we give None
now.
Setting variables inside a PyDict
looks easy enough: PyDict::set_item
only requires the key and value to implement ToPyObject
,
which is already implemented for a lot of types, including strings, integers, vectors, and many more.
Let’s see how it works, by defining five = 5
from Rust:
fn run_python(code: &str) {
let py = pyo3::Python::acquire_gil();
let globals = pyo3::types::PyDict::new(py.python());
// "five" and 5 are automatically converted to a PyObject by ToPyObject.
globals.set_item("five", 5).unwrap();
if let Err(e) = py.python().run(code, Some(globals), None) {
e.print(py.python());
panic!("Python code failed");
}
}
fn main() {
python! {
print(five + 1)
}
}
$ cargo r
Compiling scratchpad v0.1.0
Finished dev [unoptimized + debuginfo] target(s) in 0.34s
Running `target/debug/scratchpad`
6
Hey, that works!
Transparent syntax
Now that we know converting objects from Rust to Python is not going to be a problem (thanks to PyO3’s ToPyObject
),
we can move on to the problem of how the user will indicate which of their variables need to be converted and injected in the globals
dictionary.
The most ergonomic way would probably be something that’s completely transparent:
let a = 5;
python! {
b = 10
print(a + b)
}
If we could make this work, users could just refer to any variable in scope as if they didn’t even switch languages. Sounds perfect.
However, our procedural macro does not have access to the surrounding code,
so would not know that a
even exists in the Rust code.
In order to know a
comes from Rust, but b
does not,
it’d have to understand the Python code and see that a
is used without being initialized,
but b
is initialized inside the Python code.
This goes much further than parsing b = 10
and understanding that defines b
, meaning it should not be captured.
For example, print
is not explicitly defined anywhere, but it does not refer to anything from Rust.
If we think a bit more about this, it only gets worse:
let a = 5;
python! {
from somelibrary import *
globals()['c'] = 10
print(a + b + c)
}
Without fully parsing and running all of the Python code, we can’t possibly do this.
So, there needs to be some way the user can tell the macro which variables need to be captured.
Capture list
Capturing variables.. That sounds like something closures (sometimes referred to as lambdas) do. Maybe we can draw some inspiration from there.
In Rust, a closure is defined using ||
:
let a = 5;
let f = |b| a + b;
assert_eq!(f(10), 15);
Here, f
behaves like a function. It takes one argument (b
), and returns a + b
.
a
was implicitly captured from the environment.
This implicit capturing is exactly what we discussed before,
which is not feasible for python!{}
.
In Rust, closures always implicitly capture whatever they need from their environment,
as described in the Rust Book.
However, if we look at closures in other languages, we see that C++ is one of the few where
we can manually specify what should be captured,
using []
-syntax:
int main() {
int a = 5;
auto x = [a] (int b) { return a + b; };
// └┬┘ └──┬──┘
// │ └── Parameter list
// └── Capture list
assert(x(10) == 15);
}
Looks like C++ already solved our problem, using a capture list. Let’s just steal the idea:
let a = 3;
let b = 20;
python! {
[a b] // List of variables to capture
c = 100
print(a + b + c)
}
I’ve omitted commas or any other syntax apart from the []
and the names,
to keep parsing as simple as possible.
If it turns out to work well, we can always improve the syntax later.
Implementation
Alright, let’s see if we can implement this.
We’ll have to parse the capture list
and generate code that calls globals.set_item("var", var)
for each variable.
This code should end up in run_python
and executed after making the globals
dictionary,
but before executing Python::run
.
Passing arbitrary code to a function is easy in Rust:
using impl Fn
, run_python
can accept a closure containing the code with all the set_item
calls.
We’ll have to give the closure access to the PyDict
:
fn run_python(
code: &str,
f: impl FnOnce(&pyo3::types::PyDict), // new
) {
let py = pyo3::Python::acquire_gil();
let globals = pyo3::types::PyDict::new(py.python());
f(globals); // new
if let Err(e) = py.python().run(code, Some(globals), None) {
e.print(py.python());
panic!("Python code failed");
}
}
We use FnOnce
instead of Fn
, because we only call the function once.
This means it can also accept closures that move out of their own captures,
and therefore cannot be called more than once.
Now on to the hard part, the macro implementation.
First, a function that checks if there’s a capture list, and extracts the names from it:
fn get_captures(input: TokenStream) -> Option<(Vec<String>, TokenStream)> {
let mut input = input.into_iter();
let captures = match input.next() {
Some(TokenTree::Group(g)) if g.delimiter() == Delimiter::Bracket => g.stream(),
_ => return None,
};
let captures = captures
.into_iter()
.map(|token| {
if let TokenTree::Ident(ident) = token {
ident.to_string()
} else {
panic!("Invalid token in capture list");
}
})
.collect();
Some((captures, TokenStream::from_iter(input)))
}
Let’s go through this step by step:
- It returns an
Option
, because there might not be a capture list, in which case we’ll just returnNone
. - It turns the
TokenStream
into an iterator, so we can extract the first token from it, while tracking our position in the stream. - It checks if the first token is a
Group
delimited by[..]
, and extracts the tokens inside this group. - It loops over these tokens, checking if they are actually identifiers, and converts them to strings.
- It returns the vector of captured variable names, together with the rest of the token stream.
Let’s test it by adding it to our procedural macro:
#[proc_macro]
pub fn python(input: TokenStream) -> TokenStream {
let (captures, input) = get_captures(input.clone()).unwrap_or((vec![], input));
dbg!(captures);
...
We clone the input
before giving it to get_captures
, so we still have the original one around in case there was no capture list.
We still have to generate the set_item
-code, but for now let’s see if parsing the capture list worked:
fn main() {
python! {
[a b]
...
}
}
$ cargo b
Compiling python-macro v0.1.0
Compiling scratchpad v0.1.0
[python-macro/src/lib.rs:37] captures = [
"a",
"b",
]
error[E0061]: this function takes 2 arguments but 1 argument was supplied
That part works.
It shows the names of the variables that should be captured,
and then errors out because we’re missing the second argument to run_python
.
On to generating that part.
For a capture list like [a b]
, our macro should generate something like:
|globals| {
globals.set_item("a", a).unwrap();
globals.set_item("b", b).unwrap();
}
We can do this directly inside our existing quote!()
:
quote!(
run_python(
#source,
|globals| {
#(
globals
.set_item(stringify!(#captures), #captures)
.expect("Conversion failed");
)*
}
);
).into()
That #(..)*
-syntax is a quote!
-feature
which will repeat its contents as many times as needed. In this case, for each element of captures
.
We again use stringify!()
to turn the variable name into a string.
Let’s try!
fn main() {
let a = 3;
let b = 20;
python! {
[a b]
c = 100
print(a + b + c)
}
}
$ cargo r
Compiling python-macro v0.1.0
Compiling scratchpad v0.1.0
warning: unused variable: `a`
--> src/main.rs:5:9
|
5 | let a = 3;
| ^ help: if this is intentional, prefix it with an underscore: `_a`
|
= note: `#[warn(unused_variables)]` on by default
warning: unused variable: `b`
--> src/main.rs:6:9
|
6 | let b = 20;
| ^ help: if this is intentional, prefix it with an underscore: `_b`
warning: 2 warnings emitted
Finished dev [unoptimized + debuginfo] target(s) in 0.57s
Running `target/debug/scratchpad`
Traceback (most recent call last):
File "<string>", line 10, in <module>
NameError: name 'a' is not defined
Hold up. a
unused? b
unused? name 'a'
not defined? What is going on?
Warnings aside, the code compiled fine. So the macro did generate valid code.
But it’s wrong, somehow.
Now how do we debug our procedural macro?
There is an unstable rustc
option called --pretty=expanded
, which will show us the code after all macro expansions.
Unlike what the name suggests, the output is usually not very pretty, so it’s a good idea to pass the output through rustfmt
for readability.
The cargo rustc
command allows us to pass custom options to rustc
:
$ cargo rustc -- -Z unstable-options --pretty=expanded | rustfmt
Compiling scratchpad v0.1.0
Finished dev [unoptimized + debuginfo] target(s) in 0.11s
<snip>
fn main() {
let a = 3;
let b = 20;
run_python("\n\n\n\n\n\n\n\nc = 100\nprint(a + b + c)", |globals| {
globals.set_item("\"a\"", "a").expect("Conversion failed");
globals.set_item("\"b\"", "b").expect("Conversion failed");
});
}
<snip>
Oh, those set_item
calls look off.
// What it generated:
globals.set_item("\"a\"", "a").expect("Conversion failed");
// What we wanted:
globals.set_item("a", a).expect("Conversion failed");
Instead of just the identifier a
, quote!()
produced the string literal "a"
.
So instead of (stringify!(a), a)
, we got (stringify!("a"), "a")
, which expands to ("\"a\"", "a")
.
We can’t really blame quote!()
here, because we did ask it to insert a String
(from captures
, which is a Vec<String>
).
It makes sense that it’ll turn String
s into string literals.
To fix this, we should instead give it Ident
s, which represent identifiers.
Looking at the documentation of Ident
,
we see that Ident
s are made from a string (the name) and a Span
.
This span is used not only for compiler errors to display their location, but also for macro hygiene.
It makes sure that names defined in macros do not get mixed up with names on the outside.
In this case, we do want to refer to variables on the outside, and it’d be nice if errors (e.g. about a variable not existing) would point to the place where the user named it in the capture list.
This means we should not make our own Ident
s, but simply use the original ones we got out of the TokenStream
.
This is a pretty simple change to get_captures
. All we have to do is change the return type:
fn get_captures(input: TokenStream) -> Option<(Vec<Ident>, TokenStream)>
// ^^^^^
And replace ident.to_string()
by ident
.
Easy!
$ cargo r
Compiling python-macro v0.1.0
error[E0277]: the trait bound `proc_macro::Ident: quote::to_tokens::ToTokens` is not satisfied
--> python-macro/src/lib.rs:47:5
|
47 | / quote!(
48 | | run_python(
49 | | #source,
50 | | |globals| {
... |
57 | | );
58 | | ).into()
| |_____^ the trait `quote::to_tokens::ToTokens` is not implemented for `proc_macro::Ident`
|
= note: required because of the requirements on the impl of `quote::to_tokens::ToTokens` for `&proc_macro::Ident`
= note: required because of the requirements on the impl of `quote::to_tokens::ToTokens` for `quote::__private::RepInterp<&proc_macro::Ident>`
= note: required by `quote::to_tokens::ToTokens::to_tokens`
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
Uh. Looks like quote!()
can’t handle Ident
? quote::ToTokens
isn’t implemented for it?
But the documentation said—
impl ToTokens for Ident
…
Oh. Wait a minute.
That Ident
there links to proc_macro2::Ident
,
not proc_macro::Ident
.
What’s up with that? A second proc_macro
crate?
The documentation of proc-macro2
explains it is simply a wrapper around proc_macro
,
but one which also allows using it outside procedural macros, where the compiler-provided proc_macro
crate doesn’t exist.
Useful for unit testing and more.
For procedural macros, it suggests converting proc_macro::TokenStream
s
directly to proc_macro2::TokenStream
s before doing anything else,
since crates like quote
and syn
use proc_macro2
for everything.
Okay, let’s see.
First we add it as a new dependency in python-macro/Cargo.toml
:
[dependencies]
quote = "1.0"
proc-macro2 = { version = "1.0", features = ["span-locations"] }
We enable the span-locations
feature to enable line and column information, which is still unstable.
Then, in python-macro/lib.rs
, we only have to make a few changes:
use proc_macro2::{Delimiter, Ident, LineColumn, Spacing, TokenStream, TokenTree};
// ^ Only added the 2 here.
#[proc_macro]
pub fn python(input: proc_macro::TokenStream) -> proc_macro::TokenStream {
// ^^^^^^^^^^^^ ^^^^^^^^^^^^
let input = TokenStream::from(input); // Convert it to proc_macro2's TokenStream.
...
Okay, fingers crossed…
$ cargo r
Compiling python-macro v0.1.0
Compiling scratchpad v0.1.0
Finished dev [unoptimized + debuginfo] target(s) in 0.76s
Running `target/debug/scratchpad`
123
Whoa. It worked!
Thanks to quote
, generating the right code was quite easy.
Quite ergonomic how it allows placeholders like #a
to substitute variables.
Uh. Wait. Placeholders. That gives me an idea.
Placeholder syntax
What if we use quote
’s solution, instead of the one we stole from C++’s closures?
let a = 3;
let b = 20;
python! {
c = 100
print(#a + #b + c)
}
I like it.
Now, is #a
the best option? Or should we use @a
, $a
, ^a
, rust:a
, «a»
, or something else?
First of all, it needs to be something that the Rust tokeniser allows. As we’ve seen in part 1A, there’s no way around that.
Then, it’s important to pick something that doesn’t already have a meaning in Python.
#
is used for comments, @
is used for annotations, etc.
Another consideration is syntax highlighting.
Users of our macro will probably be writing their code in an editor that knows how to syntax-highlight Rust code,
but knows nothing about our python!{}
macro.
It’d be nice if our syntax doesn’t completely break syntax highlighting,
or better: if our placeholders would show up in some recognizable way.
To do that, we need to pick some existing Rust syntax, which has no meaning in Python.
The first thing that comes to mind is lifetime syntax: 'a
.
Most Rust editors will understand the '
and the name afterwards belong together,
and although single quoted strings are already a thing in Python,
those are already unusable in our macro anyway.
(See part 1A for details.)
Sounds like this can work!
Here’s how that would look:
let a = 3;
let b = 20;
python! {
c = 100
print('a + 'b + c)
}
Nice. Let’s do this.
Implementation
What do we need to do to implement this? Let’s see…
- Throw out the
get_captures
function and call. - Modify our reconstruction function to detect
'a
-syntax, replace it by a variable, and remember theIdent
for later. - Modify the
quote!()
to use our new list ofIdent
s. - Done?
Users might refer to the same variable multiple times, but we should capture them only once. So it’s probably a good idea to use a set or map to store all the placeholders we find:
struct Source {
// <<snip>
captures: BTreeMap<String, Ident>,
}
In our Source::reconstruct_from
function,
we now need to look for '
-tokens and extract the token after it.
It currently looks like this:
fn reconstruct_from(&mut self, input: TokenStream) {
for t in input {
// <snip>
}
}
We now no longer want to process exactly one token per iteration,
but sometimes consume the next one as well (to see what’s after a '
).
This means we can’t easily use a for-loop anymore,
as it doesn’t give us access to the underlying iterator.
Using a while let
lets us define the iterator ourselves:
let mut input = input.into_iter();
while let Some(t) = input.next() {
// Now we can use input.next() to consume more tokens.
}
The body of the loop looked like this:
if let TokenTree::Group(g) = t {
// <snip>
} else {
self.add_whitespace(t.span().start());
self.add_str(&t.to_string());
}
There’s one case for handling group tokens and to recurse into them,
and another case for all other tokens, including '
-tokens.
Right after the add_whitespace
in the else
case, we add:
if let TokenTree::Punct(t) = &t {
if t.as_char() == '\'' {
if let Some(TokenTree::Ident(var)) = input.next() {
let varname = format!("_rust_{}", var);
self.col += var.to_string().len() + 1;
self.source += &varname;
self.captures.entry(varname).or_insert(var);
continue;
} else {
unreachable!();
}
}
}
If the token we see is a '
-token, we consume the next token
and assume that’s an identifier.
(If it wasn’t, the Rust tokeniser would’ve already rejected the code.)
Instead of 'a
, we add _rust_a
to the Python code (since we can’t name variables 'a
),
and remember both this name and the Ident
in our capatures
map if it wasn’t seen before.
Finally, we update our quote!()
, which doesn’t change much:
let name = s.captures.keys();
let var = s.captures.values();
quote!(
run_python(
#source,
|globals| {
#(
globals
.set_item(#name, #var)
.expect("Conversion failed");
)*
}
);
).into()
Done?
Let’s try!
fn main() {
let a = 3;
let b = 20;
python! {
c = 100
print('a + 'b + c)
}
}
$ cargo r
Compiling python-macro v0.1.0
Compiling scratchpad v0.1.0
Finished dev [unoptimized + debuginfo] target(s) in 0.72s
Running `target/debug/scratchpad`
123
Success!
🎉
Identifier spans
Since we used the original Ident
s everywhere, we should get nice error messages
when trying to use a variable that doesn’t exist:
let a = 3;
let b = 20;
python! {
print('a + 'b + 'c)
}
$ cargo r
Compiling scratchpad v0.1.0
error[E0425]: cannot find value `c` in this scope
--> src/main.rs:8:25
|
8 | print('a + 'b + 'c)
| ^^ help: a local variable with a similar name exists: `a`
Awesome!
Note how it points to both the '
and c
.
rustc
knows these two tokens belong together,
because we used a syntax it is already familiar with.
What’s next
Now we have a way to very easily get data into our python!{}
blocks.
It works for everything that implements ToPyObject
, including strings, numbers,
and all kind of collections.
What’s still missing, is a nice way to get data out. We’ll look at that in a later post, but first we’re going to look at a very different topic.
In part 3, we’re going to ‘compile’ the Python code into Python byte-code at compile time, to catch errors like invalid Python syntax before even running it. This also speeds up execution times, by moving part of the work into the compilation step.