Factorizers for Distributed Sparse Block Codes

Tracking #: 713-1693

Flag : Sent to Publisher

Authors:

Michael Hersche

Aleksandar Terzić

Geethan Karunaratne

Jovin Langenegger

Angéline Pouget

Giovanni Cherubini

Luca Benini

Abu Sebastian

Abbas Rahimi

Responsible editor:

Qingxing Cao

Submission Type:

Regular Paper

Full PDF Version:

nai-paper-713.pdf

Cover Letter:

Dear Tarek Besold, please find attached our manuscript, "Factorizers for Distributed Sparse Block Codes," which we would like to submit for publication as an original research article in your journal Neurosymbolic Artificial Intelligence. We present a new approach to disentangle, or factorize, data structures into their constituent elements represented by distributed sparse block codes. We are looking forward to hearing from you. Sincerely, Michael Hersche

Approve Decision:

Approved

Tags:

Reviewed

Decision:
Accept

Solicited Reviews:

Review #1 submitted on 29/Jan/2024

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Excellent
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

Overall I enjoyed reading this paper - some very interesting ideas and mostly written and structured in a way that was easy to follow. However I think some terms need to be introduced a bit sooner to understand some of the finer points, and for a journal paper the conclusion feels a bit sudden following the results section. I think the results and ideas leading up to them present an opportunity for interesting analysis and discussion; mainly the opportunities for future work, but also how one might interpret the factors (I elaborate later). With or without a discussion section, this is still a good paper worth publication in my opinion.

Significance: The work is significant as a means of overcoming the variable binding problem in an efficient way. The relevance to neurosymbolic is clear: the vectors correspond to symbols that may be manipulated and bound in vector-space, and a means of improving the efficiency of a neural network using this representation is proposed.

Background: The method certainly fits in nicely to the background methods listed. The relevance of cited work is clear and comparisons are made later in the paper. That said, I'm not familiar enough with VSAs to comment on whether more could have been included.

Novelty: The new methods are a natural progression from and improvement over those cited in related work.

Technical quality:

Experiment descriptions very detailed to the extent that I believe anybody interested should be able to replicate them. Tests are performed using common architectures and datasets.

I did however wonder if a more suitable dataset might be available. RAVEN has attribute labels (shape, position, etc.) but no class labels for the combinations of attributes; ImageNet and CIFAR have class labels but no attribute labels. The results with these datasets are interesting but what about a dataset where attributes and classes are both defined? Are there no such datasets?

Presentation:
- Figures in themselves are fine though may need to be better placed near relevant text in places.
- Structure mostly okay except section 3 which would be better placed before section 2.
- Otherwise a very pleasant read and well written. The "story" is very clear.

Length: Length is appropriate for the work conducted, though I do feel it could use a page of discussion and/or future work sections.

Data availability: Uses public datasets for all experiments, so availability good.

Possible discussion section:
=====================

What are outstanding issues to be addressed future work?

How do we interpret a "factor" for e.g. ImageNet and CIFAR? Does it have any symbolic meaning? The interpretation of factors and products I think is a particularly interesting discussion point, especially as there are various efforts to identify the concepts learned by neural networks and map them to symbols. (See all references at end). In particular, if RAVEN binds values assigned to A, B, C and D where these are A shape, B position etc... what might A and B correspond to for ImageNet and CIFAR? I appreciate the authors may not have the answer, and perhaps slips outside the scope of the paper, but it would make for fascinating future work and some initial thoughts or discussion on how one might go about finding it would be welcome

Introduction of terms
================

As somebody new to VSA, I would have found it easier to read section 3 before section 2. I read up to section 3, then decided to start from the beginning of the paper again having read 3, before continuing the rest of the paper.

Operational Capacity
----------------------------

This term is defined twice:
- "The ratio between the maximum factorizable problem size and the required vector dimensionality" (p 3)
- "The largest problem size for which BCF achieves an accuracy higher than 99% (p 8)."

I appreciate they may mean different things in practice for different methods but what would be a general definition of the term?

Bundling
-----------

Maybe it's just me but I couldn't quite get an intuitive grasp on what bundling was even though at some level I understood it enough to follow the rest of the paper!

One definition given is "The bundling of two or more vectors is defined as their elementwise addition, followed by a selection function that retains the sparsity by setting the largest element of each block to 1 and the remaining elements to 0" and this makes sense in terms of how one calculates it, but to me it doesn't clarify what it's for or in what sense it is a "bundle"?

Am I correct in thinking it's a sort of block-wise summary or average of vectors in the "bundle"? I.E. a closest possible representation of all vectors?

By extension I also struggled to understand what "bundling capacity" was - is it related to operational capacity?

Sampling Width
---------------------

Sampling width is important to section 4.4. but a definition isn't given until section 4.7: "The sampling width (A) determines how many codevectors will be randomly sampled and bundled in case the thresholded similarity is an all-zero vector"

Other minor points:
===============

Fig 1:

I wasn't completely sure how to interpret the layout of the figure - Does the Binary SBC correspond to the top-right from 1-4 and the GSBC to the bottom-right? In other words, are you binding a binary SBC to a GSBC or are you trying to show that the method could be applied to either? Could it work with 2 GSBCs or 2 Binary SBCs?

Also I thought Fig1. and Table 1 were a single figure at first. I would suggest rearranging the layout

p6, eq. 4: Should the ~ above the x to the right of the equation be a ^?

Section 4.6: Some might argue F=2,3 is a small range to test. How high realistically could it go?

References:
=========

[Chen et al., 2018] Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C. and Su, J.K., 2019. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32.

[Townsend et al. 2020] Townsend, J, Kasioumis, T, Inakoshi, H, (2020) ERIC: Extracting Relations Inferred from Convolutions. In: Ishikawa H, Liu CL, Pajdla T, Shi J, (eds) Computer Vision – ACCV 2020. Lecture Notes in Computer Science, 12624. Springer, Cham.

[Zhang et al. 2018b] Zhang, Q., Yang, Y., Ma, H. and Wu, Y.N., 2019. Interpreting cnns via decision trees. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6261-6270).

[Zhang et al. 2018b] Q. Zhang, R. Cao, F. Shi, Y. N. Wu, and S. Zhu, “Interpreting CNN knowledge via an explanatory graph,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Review #2 submitted on 28/Nov/2023

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Detailed Comments:

This paper discusses the challenges of disentangling data structures in noisy distributed sparse block codes and proposes a method for factorizing generalized SBCs. The proposed method involves training deep convolutional layers with a novel blockwise additive loss that can directly use BCF in place of an FCL classifier. The results show that the proposed BCF reduces the total number of parameters of a wide range of deep CNNs and datasets while maintaining high accuracy compared to the baseline CNNs using the large FCL. Additionally, the BCF lowers the computational cost of the classifier layer. Overall, this article presents a promising approach for addressing the challenges of disentangling data structures in noisy distributed sparse block codes. However, there are some weaknesses, and please refer to the details below.

1. The Introduction is challenging to follow. It would be beneficial for the author to provide a high-level summary or examples to help reviewers or readers understand the current issues that need addressing and the significance of these problems. The use of some symbols, such as $D_i$, $\mathbf{X}^f$, etc., without prior definition, further complicates the reading.

2. Why does the proposed method show a significant performance improvement when the nonlinearity and batch norm of the last convolutional layer are removed from all CNN architectures? How can we interpret this phenomenon? Does it mean that the BCF-based replacement approach requires specific conditions to work effectively?

3. The results for CIFAR-100 and RAVEN in Table 4 lack persuasiveness, as they only include results for ResNet-18. More analysis is needed, especially for RAVEN, considering that Had. and Id. are marked as not applicable (N/A) for ResNet-18, making the comparison baselines insufficient. Additionally, for CIFAR-100 results, although parameters are saved by only 5%, there is an almost 1% performance drop (considering CIFAR-100, 1% is significant). How can this phenomenon be explained? Does it imply that BCF may not be suitable for this dataset or ResNet-18?

4. As mentioned in Line 36, FCL has also been applied to transformers. In reality, in transformers, FCL is used within the blocks, not just at the end of the network as considered in this paper. Moreover, in large transformer structures, the FCL network has a significant impact on performance. On one hand, it becomes a bottleneck for the inference speed of large transformer structures, and on the other hand, transformers heavily rely on FCL layers for knowledge representation and memory. Therefore, I am curious about the effectiveness of the proposed BCF method in reducing the parameter count of FCL in transformers, especially large ones, while maintaining performance.

5. Can the proposed BCF be applied to the FCL of the self-attention module in transformers? Why or why not?

Tracking #: 713-1693

Flag : Sent to Publisher

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 713-1693

Flag : Sent to Publisher

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links