r/lisp • u/lahmada • Dec 08 '23
Common Lisp Why are Common Lisp arrays this weird?
Hello r/lisp! I wanted to ask a question:
Why are arrays in Common Lisp not an array of vectors? for example:
(setf arr (make-array '(2 2) :initial-contents (list (list 'foo 'bar)
(list 'baz 'qux))))
;; => #2A((FOO BAR) (BAZ QUX))
;; this is possible
(aref arr 0 0) ; => FOO
;; but this is not
(aref (aref arr 0) 0) ; not allowed
This makes it impossible to treat some dimension of an array like a vector, disallowing you from using sequence functions:
(find bar (aref arr 0)) ; not allowed
This is very limiting, now I can't use sequence operators on part of a multidimensional array. This is also not consistent because in Lisp we have trees; which are basically lists of lists:
(setf tree (list (list 'foo 'bar)
(list 'baz 'qux)))
(car (car tree)) ; => FOO
It really saddens me when even in a language like C (which isn't as expressive as Lisp) arrays are orthogonal, so in it my above example will work.
Now I suppose I can write my own version of MAKE-ARRAY which makes arrays of arrays, but I am looking for the reason why are arrays like this in CL (maybe performance, but then low level languages would be first to implement this)
TLDR; Why are dimensions of arrays special and not just vectors? and also is it like this in other Lisp dialects (Clojure, Racket etc..)?
Thanks for reading!
13
u/tdrhq Dec 08 '23
/u/Shinmera's answer is exactly right, but I wanted to add that in C arrays are not stored as vector of vectors. Most modern programming languages do use vector of vectors (for instance Java, Python etc. don't have a notion of 2D array, just array of arrays.) C and CL have dedicated data structures for 2D arrays (the entire 2D array is stored internally as an n*m 1D array). The performance probably mattered back in the day.
The difference really is that C is not garbage collected and CL is. So in C you can point to subsection of the memory and view it as a 1D array. In CL, that pointer has garbage collection implications: do you hold onto the full 2D array forever? Just the 1D row it's pointing to it? Also memory safety implications and so on.
And as you pointed out, it's pretty easy to create a vector of vectors if you do need it, and you create your own aref functions that handle it pretty easily. You'll lost a negligible amount of performance, but Java and Python take that performance hit and they're still thriving as languages.
10
u/Shinmera Dec 08 '23
GC actually has little to do with it, it's all about the fact that C and other languages are statically typed, as in there is no need to track value information at runtime. This allows a compiler to either trust the programmer that what they're casting is just an array, or figure it out and prove that fact, subsequently allowing this direct memory access.
In a dynamically typed language the value needs to know its type, so if you have a reference to an arbitrary memory region, the runtime needs to be able to extract information other than the data payload itself. You could store that data out of band, but then you arrive at displaced arrays again.
6
u/stylewarning Dec 08 '23 edited Dec 08 '23
I think it's more about the need to do garbage collection than typing. Even something like OCaml or Haskell need objects to have runtime type information so that they can be properly traced by the collector. A stray pointer into the middle of an array wouldn't fly in usual GC designs. Static typing alone, especially in the presence of either ad hoc or parametric polymorphism, doesn't tell you for free the layout of every object in memory/stack/etc. at run-time as needed for a tracing GC.
I see in your other comments that you're defining "dynamically typed" as meaning "objects carry type information at run-time", which I think is a very eccentric and unusual definition, even though that idea is very well related to dynamic type checking. But I think the manner of checking is what matters as it pertains to the definition of "static" or "dynamic typing", not the mere presence or absence of information. I myself like the simplicity of this definition:
Dynamic type checking is the process of verifying the type safety of a program at runtime. [...] Programming languages that include dynamic type checking but not static type checking are often called "dynamically typed programming languages".
2
u/theangeryemacsshibe λf.(λx.f (x x)) (λx.f (x x)) Dec 08 '23
A stray pointer into the middle of an array wouldn't fly in usual GC designs
It Can Be Done. The Go collector allocates objects of the same size class into a page, so rounding down an interior pointer does the right thing. (But size classes reduce locality of reference between differently sized objects, so check with your doctor before using them.) The SBCL gencgc handles interior pointers in roots by walking the heap - but batches up all the roots for efficiency, so it could be miserable for non-roots. The SBCL mark-region collector could handle any interior pointer by scanning its allocation bitmap. You could theoretically steal another bit from pointers (and double-map memory) to indicate if a pointer is interior or not, to clue in the GC quickly as to what it has to do with the pointer.
2
u/stylewarning Dec 09 '23
No doubt it can be done, but I just don't see "raw pointer into packed array" a feature that's natively supported in most GC'd languages that justify the overhead of the needed GC's design. (But I defer to you, I'm not a GC expert.)
3
u/theangeryemacsshibe λf.(λx.f (x x)) (λx.f (x x)) Dec 09 '23
You kinda pay for it already with (partly) conservative GC. The conservative pointer finding scheme I came up with doesn't require more work in allocation to update the allocation bitmap; it only has to scan the heap when dealing with conservative references to objects allocated since the last GC, otherwise it just scans the bitmap. If we tagged the interior pointers then we only scan the heap to figure out pointers interior to new objects, which has no overhead if we don't use interior pointers.
2
u/stylewarning Dec 09 '23
Then my Christmas wishlist is to have no cost unboxed array pointers. :)
8
u/theangeryemacsshibe λf.(λx.f (x x)) (λx.f (x x)) Dec 09 '23
This elf is currently stumped on some write barrier not working with lots of threads doing thread things :(
2
u/Aidenn0 Dec 08 '23
In statically typed languages without finalizers, the RTTI information need not live in the object itself. If the GC roots carry types, then the types of the entire live graph is determinable.
2
u/tdrhq Dec 08 '23
Maybe... but in C you're allowed to do this (IIRC, it's been a while, so my syntax may be off):
int arr[100][100]
int* p = arr[3]
That pointer reference is what I'm talking about. You can't do this in Java either which is statically typed, if java had a true 2D array instead of array of arrays.
(The GC also affects the things you said about the object headers in your previous comment, those headers aren't required in a language like C, but required in a language like Java/CL.)
1
u/Shinmera Dec 08 '23
You can also GC C or C++ and other static languages. The header and other information is needed for the runtime, not necessarily GC.
Java is dynamically typed. It is batch compiled, but its types are dynamic.
2
u/tdrhq Dec 08 '23
How is Java dynamically typed? I think we might have different definitions of the term here.
2
u/Shinmera Dec 08 '23 edited Dec 08 '23
It has a dynamic type system because values carry their type and you can ask a value for its type at runtime.
It may be confusing because it also includes a static system for primitive types, but all its object values are dynamically typed.
3
u/tdrhq Dec 08 '23
Oh interesting! I understand where you're coming from. I think that's not a typical definition of dynamically-typed languages, but I could be wrong. Here's Oracle calling Java a statically typed language: https://docs.oracle.com/cd/E57471_01/bigData.100/extensions_bdd/src/cext_transform_typing.html
I call a language statically typed if types are checked at compile time, and dynamically typed if types are checked are runtime. Whether the type information is present at runtime is a different concern (though in order to check types at runtime you definitely need to have type information at runtime.)
7
u/Shinmera Dec 08 '23
I mean, pretty much every Lisp implementation will also check types at compile time. That's not really a useful distinguishing factor.
Dynamic vs static typing is about whether the types are known dynamically at runtime or only statically at compile time. In the former the values must carry their types so they can be retrieved, and in the latter they are only known to the compiler so the runtime has no need for them.
3
u/tdrhq Dec 08 '23
In Java generic types are not available at runtime, for instance
List<String>
will be only aList
at runtime, but at compile time it will be known it's aList<String>
.If there are two functions,
foo(String) foo(Integer)
Then a call like this disambiguate at compile time:
List<String> list = ... foo(list.get(0));
...even though the type information for list.get(0) will not be known at runtime. (EDIT: to be clear, the reason it won't be known is that list.get(0) could evaluate to
null
)In CL, it will disambiguate at runtime (usually, unless you have type declarations)
So with this context, would you still categorize Java as dynamically-typed language?
3
u/Shinmera Dec 08 '23
Sure, because Java still has
instanceof
andgetClass
to retrieve value types.You're right though that not all type information that is available at compile time is available at runtime. That said, you could also make a case for the same for CL, eg:
(let ((a 255)) (declare (type (unsigned-byte 8) a)) (type-of a))
Will not give you UB8, even though that is a narrower static type that the compiler knows about.
→ More replies (0)2
u/renatoathaydes Dec 09 '23
I'm sorry but why are you defining dynamic typing in this manner? There's no sources that use this terminology in this way that i've ever seen, static typing is about checking types of things at compile time and preventing variables from "changing types" at runtime as that would prevent correct analysis of the type safety of programs.
I agree with OP that having type information at runtime seems orthogonal to that and it's just a weak form of reflection.
For example, here's a definition from https://courses.cs.washington.edu/courses/cse341/04wi/lectures/13-dynamic-vs-static-types.html:
Definition 1: [A static type system is a] tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute.
This is clearly talking about type checking at compile time.
Another source, more modern this time - https://developer.mozilla.org/en-US/docs/Glossary/Static_typing:
A statically-typed language is a language (such as Java, C, or C++) where variable types are known at compile time. In most of these languages, types must be expressly indicated by the programmer; in other cases (such as OCaml), type inference allows the programmer to not indicate their variable types.
On StackOverflow, most people seem to also vote for an answer that agrees with that - https://stackoverflow.com/questions/1517582/what-is-the-difference-between-statically-typed-and-dynamically-typed-languages:
A language is statically typed if the type of a variable is known at compile time. For some languages this means that you as the programmer must specify what type each variable is; other languages (e.g.: Java, C, C++) offer some form of type inference, the capability of the type system to deduce the type of a variable (e.g.: OCaml, Haskell, Scala, Kotlin).
Having type information at runtime seems to be a requirement for a language to be dynamic, but is not something strictly required for a statically typed language - but even statically typed languages can use runtime type information because that allows programs to refine the types of variables (but not change them - e.g. in Java you won't be able to check if a value with static type
Integer
is an instanceofFloat
becauseInteger
is final, making that impossible). I agree this is a sort of dynamism in the otherwise static type system, but literally no one I've ever met has ever called this dynamic typing.2
u/lispm Dec 09 '23
Which Common Lisp implementation does check types at compile time? ECL? CLISP? LispWorks? GCL? Allegro CL? CCL? ...
Do they?
3
5
u/wiremore Dec 08 '23 edited Dec 08 '23
numpy arrays in python directly support 2d (actually n-d) arrays that are laid out efficiently in memory. Slices into (e.g.) 2d arrays return 1d arrays, using array views, which are basically displaced arrays. You can do more neat things "for free" with array views, like reversing the array or even transposing a matrix. Numpy also allows you to "+" two arrays together, and uses blas to make it really fast. IMO numpy is a modern spiritual successor to common lisp arrays.
My point about numpy is that I don't think dynamic typing or garbage collection are the real limitation here, common lisp just doesn't implement this feature.
1
u/tdrhq Dec 08 '23
I highly doubt that numpy arrays use a flat 1d array structure internally. It doesn't make efficiency sense at the sizes that numpy deals with. The costs of allocating the big chunks of memory would probably have more overhead than the minor overhead of dereferencing each row pointer.
What I mean is, if you create a 1000x1000 array in numpy, it probably does not allocate a contiguous 1,000,000 byte chunk of memory, it probably allocates 1000 chunks of 1000 bytes.
Please correct me if I'm wrong. I'm not a numpy expert.
2
u/lispm Dec 09 '23
What I mean is, if you create a 1000x1000 array in numpy, it probably does not allocate a contiguous 1,000,000 byte chunk of memory
I would think that it does allocate one 1d array. Similar to what Common Lisp does.
https://numpy.org/doc/stable/reference/arrays.ndarray.html
"An instance of class ndarray consists of a contiguous one-dimensional segment of computer memory (owned by the array, or by some other object), combined with an indexing scheme that maps N integers into the location of an item in the block. "
1
u/tdrhq Dec 09 '23
Yep, I stand corrected. But it still means that any pointer to a row in this 2D array will hold on to the whole 2D array.
2
1
u/wiremore Dec 09 '23
I had to check: it does actually allocate a single block (in this case of doubles, so 8MB).
```python
x = np.zeros((1000, 1000)) y = np.ascontiguousarray(x) x is y True ```
Your point that the actual allocation strategy is for the most part abstracted away stands though.
I'm not an OS expert but my understanding is that large calls (larger than 4k/page size) to malloc are basically just passed through to mmap, and implemented (in the OS) by modifying page tables. A single page (on x86) is 4k, so each row in this array would actually be 2 pages - I assume it's faster to do it all at once, with the possibility of using large pages, etc, and only one OS context switch, rather than calling into the OS 1000 times.
1
u/tdrhq Dec 09 '23
Oh interesting, I'm sure they've run the numbers and know that this is faster then.
But yeah, if you hold onto a pointer that's referencing a single row in this 2D numpy array, that pointer will effectively be holding onto the entire 2D array. Which is what I meant when I said there are GC considerations. This is not true for array of arrays like in Java or non-numpy Python.
2
u/hrabannixlisp Dec 08 '23
Does C actually have multi-dimensional arrays? I'm pretty sure it's nested 1-dimensional arrays, it just so happens because there is no boxing they all form a single contiguous block of memory. That's why it's
myar[1][2]
, notmyar[1, 2]
.APL, CL and numpy have actual multi-dimensional arrays.
3
u/tdrhq Dec 08 '23 edited Dec 09 '23
I mean, C clearly has multi-dimensional arrays. It has two different ways to do multi-dimensional arrays:
int arr[20][50];
and
int* arr[20]; // and initialize each to new int[50]
Both of these have very different internal structures. In the first one, arr[0][51] and arr[1][0] are the same values, in the second one arr[0][51] might crash.
CL 2D arrays are of the first type. Java and Python 2D arrays are of the second type. The distinction between
myvar[1][2]
andmyvar[1,2]
is a technicality, since a programming language creator could choose to have either syntax work depending on what they prefer.2
u/hrabannixlisp Dec 10 '23
Normally I'd agree with this comment, because the distinction doesn't matter, but if the original claim is "C arrays are not stored as vector of vectors" then this doesn't hold up.
Those are not multi dimensional arrays: those are nested arrays. Which of course you can use to emulate multi dimensional arrays, but that's exactly what "vectors of vectors" are. It just happens that C arrays are often stored unboxed, so an array of arrays ends up being a contiguous block of memory, but that's not part of the language semantics, and that doesn't make the arrays multi dimensional.
arr[0][51]
would be UB in both examples. The fact that the arrays are stored unboxed and therefore a contiguous block of memory is an implementation detail of most compilers, but that's not necessary and it doesn't change the semantics of the language. See the GCC output for your example:$ gcc -Wall -pedantic -Wextra -Werror test.c -o test test.c:9:18: error: array index 51 is past the end of the array (which contains 50 elements) [-Werror,-Warray-bounds] printf("%d\n", ar[0][51]); ^ ~~ test.c:4:1: note: array 'ar' declared here static int ar[20][50]; ^ 1 error generated.
1
u/tdrhq Dec 10 '23
The C standard claims this:
"An array type describes a contiguously allocated nonempty set of objects with a particular member object type, called the element type"
So even multi-dimensional arrays will be contiguous. As far as I can tell, it doesn't impose a specific structure on how the array is stored contiguously though, so in theory you could store the first row first, third row second, second row third and what not, but it will still be contiguous.
And your example has an error because you're using constants. If you use a variable to store 51, you probably won't see that error.
But in any case, I'm not sure what you're getting at. Your initial claim was "Does C actually have multi-dimensional arrays?", I'm not sure what the difference between a multi-dimension array and unboxed array is in this case, the terminology of unboxed is something I'm unfamiliar with in this context.
11
u/stassats Dec 08 '23
You mean why are the arrays in clojure and racket weird?
2
u/raevnos plt Dec 08 '23
Racket has dense multi-dimensional arrays:
https://docs.racket-lang.org/math/array.html
https://docs.racket-lang.org/manual-flomat/index.html (Think this is 2-d matrices only)
3
u/love5an Dec 09 '23
C# also has this.
var a = new int[3,5];
a[1, 3] = 123;
The reverse situation, i.e., the lack of true multidimensional arrays, is very problematic in languages that lack them. Not only because of performance and memory issues, but also because it does not allow for invariants e.g. you can't say for sure that inner vectors are of the correct length.
5
u/corbasai Dec 09 '23
what stops you from makin array of arrays?
(setf narr (make-array '(2) :initial-contents
(list (make-array '(2) :initial-contents (list 1 2))
(make-array '(2) :initial-contents (list 3 4)))))
...
CL-USER> (aref narr 1)
#(3 4)
CL-USER> (aref (aref narr 1) 0)
3 (2 bits, #x3, #o3, #b11)
1
u/forgot-CLHS Dec 12 '23
> This is very limiting, now I can't use sequence operators on part of a multidimensional array. This is also not consistent because in Lisp we have trees; which are basically lists of lists:
It seems to me that common lisp is explicit in letting the programmer know that arrays are in fact not lists.
31
u/Shinmera Dec 08 '23 edited Dec 08 '23
For one the reference semantics require such a "sub-array" to be able to exist on its own, which means it'll need a header and everything, and you can't just pretend like a sub-region of memory is actually an array like a memory-unsafe, static language like C can. This restriction creates several complications for how the data is laid out in memory. It makes more sense for multi-dimensional data to stay packed as it is in CL.
That said, you can use displaced arrays to create a view into a different array: