With the spread of cloud services and Internet of Things concept, there is a popularization of machine learning and artificial intelligence based analytics in our everyday life. However, an efficient deployment of these data-intensive services requires performing computations closer to the edge. In this context, in-cache computing, based on bitline computing, is promising to execute data-intensive algorithms in an energy efficient way by mitigating data movement in the cache hierarchy and exploiting data parallelism. Nevertheless, previous in-cache computing architectures contain serious circuit-level deficiencies (i.e., low bitcell density, data corruption risks, and limited performance), thus report high multiplication latency, which is a key operation for machine learning and deep learning. Moreover, no previous work addresses the issue of way misalignment, strongly constraining data placement not to reduce performance gains. In this work we drastically improve the previously proposed BLADE architecture for in-cache computing to efficiently support multiplication operations by enhancing the local bitline circuitry, enabling associativity-agnostic operations as well as in-place shifting inside local bitline groups. We implemented and simulated the proposed architecture in CMOS 28nm bulk technology from TSMC, validating its functionality and extracting its performance, area, and energy per operation. Then, we designed a behavioral model of the proposed architecture to assess its performance with respect to the latest BLADE architecture. We show a 17.5 and 22% area and energy reduction thanks to the proposed LG optimization. Finally, for 16bits multiplication, we demonstrate 44% cycle count, 47% energy and 41% performances gain versus BLADE and show that 4 embedded shifts is the best trade-off between energy, area and performances.