Skip to content

ellyzaveta/course-work-pc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inverted Index

This repository contains an implementation of a multi-module inverted index system developed as part of a university course on Parallel Computing.

The project focuses on efficient text indexing and search over large document collections, with an emphasis on multithreading, scalability and performance evaluation.

Project Overview

The system is built around a custom implementation of an inverted index data structure and supports concurrent indexing and querying using Java multithreading tools.

The solution follows a client–server architecture and includes a dedicated module for performance comparison under different workloads and thread configurations.

Table of Contents

Architecture

The system consists of the following modules:

  • invertedindex – core data structure and indexing logic
  • server – handles client requests and manages indexing processes
  • client – sends search and indexing requests
  • api – defines the application-level communication protocol
  • performanceComparison – evaluates execution time under varying parameters

System modules interaction

Client–Server Architecture

The system follows a client–server architecture with asynchronous request handling.

  • The client connects to the server via network sockets and sends search or indexing requests.
  • The server manages multiple client connections concurrently using a thread pool.
  • If the inverted index is not yet built, the server initiates the indexing process and continuously reports progress back to the client.
  • Communication between client and server is implemented via a custom application-level protocol defined in the api module.

This architecture enables scalable concurrent access to the inverted index while maintaining efficient resource utilization.

Client-Server Architecture

Testing

The system was tested using:

  • Unit testing (core services and concurrent data structures)
  • Integration testing (client–server interaction)
  • Concurrency safety testing
  • End-to-end scenarios

Results

A dedicated performance analysis module was implemented to evaluate the impact of parallelization.

General plot Overall execution time VS number of threads and input size

Separate plots Execution time VS number of threads and input size

Key findings:

  1. Parallel indexing is effective for medium and large datasets

  2. On medium-sized inputs, performance improves by up to 2×

  3. For small datasets, parallelization may be inefficient due to thread overhead

  4. The optimal number of threads equals the number of logical CPU cores

  5. Increasing threads beyond this limit does not improve performance and may degrade execution time

Dataset

Experiments were conducted using subsets of the IMDB movie reviews dataset with varying input sizes to ensure realistic workload conditions.

Conclusion

The developed system demonstrates that:

  • Properly designed parallel data structures significantly improve scalability

  • Multithreading must be applied selectively, depending on input size

  • Custom concurrency control can outperform naive parallel implementations

Running the Project

Prerequisites

Before running the project, ensure the following tools are installed:

  • Java Development Kit (JDK) (version specified in pom.xml)
  • Apache Maven
  • Git

Verify installation:

java -version
mvn -version
git --version

Getting the Project

Clone the repository from GitHub:

git clone https://github.com/ellyzaveta/course-work-pc.git
cd course-work-pc

Build the Project

The project is a multi-module Maven project. To build all modules and download dependencies, run:

mvn clean install

Configuration

Each module contains its own configuration file located at:

<module-name>/src/main/resources/application.properties

Before running any module, make sure all required properties are properly configured.

Client–Server Mode

Server Configuration

Edit:

server/src/main/resources/application.properties

Configure the following properties:

  • directory.path — path to the directory containing input text files
  • server.port — port on which the server will run

Client Configuration

Edit:

client/src/main/resources/application.properties

Configure:

  • server.host — server host (e.g., localhost)
  • server.port — must match the server port defined in the server configuration

Run the Server

cd server
mvn spring-boot:run

Run the Client (in a separate terminal)

cd client
mvn spring-boot:run

Performance Comparison Mode

Configuration

Edit:

performancecomparison/src/main/resources/application.properties

Configure:

  • performance.testdata.paths — list of input file paths of different sizes
  • server.port — port for the performance comparison server

Run Performance Comparison

cd performancecomparison
mvn spring-boot:run

After the application starts, open a browser and navigate to:

http://localhost:{port}

where {port} is the value specified in application.properties.

Repository Structure

course-work-pc/ 
│
├── docs/                         # Documentation and results
│   ├── architecture/             # UML and architecture diagrams
│   └── results/                  # Performance evaluation charts
│
├── invertedindex/                # Inverted index core module
├── server/                       # Server-side implementation
├── client/                       # Client application
├── api/                          # Application-level protocol
├── performanceComparison/        # Performance analysis module
│
├── README.md                     # Project overview and results

Releases

No releases published

Packages

 
 
 

Contributors