Building an Advanced File Management System with ElasticSearch

Creating a Feature-Rich File Management Solution

Welcome to this comprehensive guide on building an advanced file management system. In this article, we will cover the implementation of various functionalities that cater to seamless file organization, manipulation, and collaboration. The system will integrate seamlessly with cloud services, leverage Optical Character Recognition (OCR) for advanced search capabilities, and ensure user-specific permission-based actions. By the end of this guide, you’ll have a fully functional application that combines frontend and backend technologies to create a powerful tool for efficient file management.

Table of Contents

  1. Introduction 
    • The Evolution of File Management Systems
  2. Technical Stack Overview 
    • Understanding the Technologies at Play
  3. Setting Up the Backend with ExpressJS and SQL 
    • Creating a Solid Foundation for Your Application
    • Designing a Database Schema to Support Key Features
  4. Building the Angular Frontend 
    • Structuring Your Angular Application for Maximum Efficiency
    • Exploring the Advantages of Angular Bootstrap
  5. Permission-Based Actions: Ensuring Data Security 
    • Implementing Role-Based Permissions
    • Rendering UI Elements Based on User Roles
  6. Creating a Windows-Like Directory Structure 
    • Crafting a Hierarchical Directory Representation
    • Displaying Files and Folders in a Familiar Manner
  7. Implementing File Manipulation Functionalities 
    • A Step-by-Step Guide to Create, Copy, Move, Rename, and Delete Actions
  8. Document Previews and Annotations with PSPDFKit 
    • Leveraging PSPDFKit for Document Preview and Annotation
    • Storing Annotations in SQL Database for Consistency
  9. Integrating Advanced Search with OCR 
    • Utilizing Google Document AI for OCR
    • Enhancing Search with OCR Annotations and Highlighting
  10. Overcoming Limitations with GCP Cloud Store 
    • Addressing PDF Processing Limitations with GCP Store
    • Achieving Efficient Processing and Collaboration
  11. Bringing It All Together: User Workflow 
    • Navigating the Fully Functional File Management System
  12. Conclusion 
    • Empowering Users with a Robust File Management Solution

1. Introduction

Modern file management systems have evolved to cater to the needs of users in the digital age. Managing files efficiently is crucial for productivity and collaboration. In this guide, we’ll delve into the intricacies of building an advanced file management system that goes beyond the basics. Whether you’re a seasoned developer or just starting, this guide will provide you with a roadmap to create a feature-rich application that addresses the complex requirements of modern file management.

2. Technical Stack Overview

Before we dive into the implementation details, let’s take a moment to understand the technologies we’ll be using to build this advanced file management system. Our chosen technical stack includes:

  • Angular: A frontend framework for building dynamic and responsive user interfaces.
  • ExpressJS: A backend framework for creating robust APIs and handling HTTP requests.
  • MS SQL: A relational database management system for storing user data, file metadata, and annotations.
  • AWS S3: A cloud storage service for storing files and folders securely.
  • GCP: Google Cloud Platform, used for its cloud services, including GCP Cloud Store.
  • Document AI: A powerful tool for Optical Character Recognition (OCR) capabilities.
  • Elasticsearch: A high-performance, full-text search and analytics engine for efficient document indexing and searching.

These technologies will work together to create a cohesive and powerful solution.

3. Setting Up the Backend with ExpressJS and SQL

In this section, we’ll set up the backend of our application using ExpressJS and MS SQL. The backend will handle requests, authenticate users, and manage data storage.

Creating a Solid Foundation: ExpressJS simplifies backend development by providing a minimalistic framework for building APIs. Let’s start by setting up the basic structure of our Express application:

// Import required modules

const express = require('express');

const app = express();

const port = 3000;


// Configure middleware

app.use(express.json());


// Define API routes

app.get('/', (req, res) => {

  res.send('Welcome to our file management system');

});


// Start the server

app.listen(port, () => {

  console.log(`Server is running on port ${port}`);

});

 

Designing a Database Schema: MS SQL will be our database management system. We’ll design a database schema to store user information, file metadata, and annotations:

-- User table for authentication and authorization

CREATE TABLE Users (

  id INT PRIMARY KEY,

  username VARCHAR(50),

  password VARCHAR(100),

  role VARCHAR(20)

);


-- File table to store file metadata

CREATE TABLE Files (

  id INT PRIMARY KEY,

  filename VARCHAR(100),

  path VARCHAR(255),

  owner_id INT,

  FOREIGN KEY (owner_id) REFERENCES Users(id)

);


-- Annotations table to store PSPDFKit annotations

CREATE TABLE Annotations (

  id INT PRIMARY KEY,

  file_id INT,

  annotation_data JSON,

  FOREIGN KEY (file_id) REFERENCES Files(id)

);

This schema will serve as the foundation for storing essential data related to users, files, and annotations.

4. Building the Angular Frontend

Now that we have our backend foundation set up, let’s move on to building the frontend of our application using Angular. Angular provides a structured approach to frontend development, which makes it easier to create maintainable and scalable applications.

Structuring Your Angular Application: Angular applications are organized into components, services, and modules. Let’s create a basic structure for our Angular application:

src/

|-- app/

|   |-- components/

|   |   |-- file-list/

|   |   |   |-- file-list.component.ts

|   |   |   |-- file-list.component.html

|   |-- shared/

|   |   |-- components/

|   |   |   |-- pdf-viewer/

|   |   |   |   |-- pdf-viewer.component.ts

|   |   |   |   |-- pdf-viewer.component.html

|   |   |-- services/

|   |   |   |-- files-management.service.ts

|   |   |   |-- user.service.ts

|   |-- app.module.ts

|   |-- app.component.ts

|   |-- app.component.html

|-- assets/

|-- index.html

In this structure, we have components for the file list, services for managing files and users, and the main app.module.ts and app.component.ts files.

Exploring Angular Bootstrap: Angular Bootstrap is a popular library that provides a set of UI components compatible with Angular applications. Let’s explore how to integrate Angular Bootstrap into our application for consistent and appealing UI elements.

First, install the necessary packages:

npm install bootstrap ngx-bootstrap

In your styles.scss file, add the following line to import the Bootstrap styles:

@import '~bootstrap/dist/css/bootstrap.min.css';

Now, you can use Bootstrap components in your Angular templates:

<button class="btn btn-primary">Create File</button>

By utilizing Angular Bootstrap, we ensure a unified and visually pleasing user interface throughout our application.

5. Permission-Based Actions: Ensuring Data Security

One of the critical aspects of any application is ensuring data security and access control. In our file management system, we’ll implement permission-based actions to control what users can do based on their roles.

Implementing Role-Based Permissions: We’ll define different roles for users, such as “admin” and “user,” and associate certain actions with these roles. For example, only an “admin” role should have the permission to delete files or folders.

// user.service.ts

@Injectable({

  providedIn: 'root',

})

export class UserService {

  userRoles: { [username: string]: string };


  constructor() {

    // Simulate user roles (in a real application, fetch from the backend)

    this.userRoles = {

      'admin': 'admin',

      'user1': 'user',

      'user2': 'user',

    };

  }


  getUserRole(username: string): string {

    return this.userRoles[username] || 'user'; // Default role is "user"

  }

}

Rendering UI Elements Based on User Roles: To render UI elements based on user roles, we’ll use Angular’s built-in structural directives like ngIf. For instance, the “Delete” button should only be visible to users with the “admin” role:

<button class="btn btn-danger" *ngIf="userRole === 'admin'">Delete</button>

By implementing role-based permissions, we ensure that only authorized users can perform certain actions, enhancing data security and integrity.

6. Creating a Windows-Like Directory Structure

In this section, we’ll focus on creating a familiar directory structure that mimics the hierarchy seen in Windows Explorer.

Crafting a Hierarchical Directory Representation: To achieve this, we’ll design a hierarchical structure using Angular components. Each component will represent a folder, and nesting these components will create the desired directory layout.

// directory.component.ts

import { Component, Input } from '@angular/core';


@Component({

  selector: 'app-directory',

  templateUrl: './directory.component.html',

  styleUrls: ['./directory.component.css'],

})

export class DirectoryComponent {

  @Input() folderName: string;

  @Input() subfolders: string[];

}


<!-- directory.component.html -->

<div class="folder">

  <p>{{ folderName }}</p>

  <div class="subfolders">

    <app-directory *ngFor="let subfolder of subfolders" [folderName]="subfolder"></app-directory>

  </div>

</div>

Displaying Files and Folders: We’ll use the DirectoryComponent to display both files and subfolders. Here’s how you might structure your template to display a Windows-like directory structure:

<!-- file-manager.component.html -->

<app-directory [folderName]="'Root'" [subfolders]="['Documents', 'Pictures']"></app-directory>

With this approach, we create a nested structure that mirrors the hierarchy seen in Windows Explorer, providing users with a familiar and intuitive navigation experience.

7. Implementing File Manipulation Functionalities

Now, let’s move on to implementing essential file manipulation functionalities, including creating, copying, moving, renaming, and deleting files and folders.

A Step-by-Step Guide to Create, Copy, Move, Rename, and Delete Actions: We’ll guide you through each functionality with code snippets and explanations.

a. Create File or Folder

// file.service.ts

@Injectable({

  providedIn: 'root',

})

export class FileService {

  createFile(filename: string, path: string) {

    // Logic to create a new file

  }

  createFolder(folderName: string, path: string) {

    // Logic to create a new folder

  }

}

b. Copy File or Folder

// file.service.ts

@Injectable({

  providedIn: 'root',

})

export class FileService {

  copyFile(sourcePath: string, destinationPath: string) {

    // Logic to copy a file

  }

  copyFolder(sourcePath: string, destinationPath: string) {

    // Logic to copy a folder

  }

}

c. Move File or Folder

// file.service.ts

@Injectable({

  providedIn: 'root',

})

export class FileService {

  moveFile(sourcePath: string, destinationPath: string) {

    // Logic to move a file

  }

  moveFolder(sourcePath: string, destinationPath: string) {

    // Logic to move a folder

  }

}

 

d. Rename File or Folder

// file.service.ts

@Injectable({

  providedIn: 'root',

})

export class FileService {

  renameFile(oldName: string, newName: string, path: string) {

    // Logic to rename a file

 }

  renameFolder(oldName: string, newName: string, path: string) {

    // Logic to rename a folder

  }

}

 

e. Delete File or Folder

// file.service.ts

@Injectable({

  providedIn: 'root',

})

export class FileService {

  deleteFile(filePath: string) {

    // Logic to delete a file

 }

  deleteFolder(folderPath: string) {

    // Logic to delete a folder

  }

}

By following these code snippets, you can implement a comprehensive set of file manipulation actions that provide users with a seamless experience for organizing their files and folders.

8. Document Previews and Annotations with PSPDFKit

In this section, we’ll integrate PSPDFKit to enable document previews and annotations within our file management system.

Document previews and annotations significantly enhance the user experience in our file management system. Integrating PSPDFKit into our application allows users to preview PDF documents seamlessly and collaborate through annotations. Let’s explore how to use the provided PdfViewerComponent to achieve these functionalities, including both the frontend and backend aspects.

Leveraging PSPDFKit for Document Preview and Annotation

PSPDFKit is a robust PDF library that provides extensive features for document rendering, annotation, and manipulation. By integrating PSPDFKit, we can enable users to view PDFs directly within the application and add annotations for collaboration.

Frontend Integration

The PdfViewerComponent serves as the core element for document previews and annotations. It accepts various inputs, such as the PDF URL, search term, item ID, and file data. Additionally, it emits an event indicating whether the system is currently saving annotations.

Here’s a breakdown of the key components of the PdfViewerComponent:

1. Import PSPDFKit: Begin by importing the necessary modules from PSPDFKit and other required components.

import PSPDFKit, { HighlightAnnotation } from 'pspdfkit';

import { FilesManagementService } from '../../shared/services/filesManagementHype';

 

2. Input Properties: The component accepts input properties such as the PDF URL, search term, item ID, and file data. These inputs enable the component to customize its behavior based on user interactions.

@Input() websiteUrl: string;

@Input() searchTerm: string;

@Input() itemId: string;

@Input() fileData: any;

 

3. Output Event: The isSaving event emitter is used to render a saving loader during the auto-saving process. This is a vital UI indicator to keep users informed.

@Output() isSaving = new EventEmitter<boolean>();

 

4. PSPDFKit Instance and Annotations: The component manages the PSPDFKit instance and annotation-related variables.

private pspdfkitInstance: any;

private annotationsChanged = false;

private searchAnnotations = [];

private changeCheckTimer: any;

private annotationsChangeListener: any;

 

5. Lifecycle Hooks: The ngOnInit and ngOnDestroy lifecycle hooks manage the component’s initialization and cleanup.

ngOnInit(): void {

  if (this.websiteUrl) {

    this.loadPSPDFKitInstance(this.websiteUrl);

  }

}

ngOnDestroy(): void {

  // Cleanup logic

}

 

6. Annotations Change Timer: The startChangeCheckTimer function initiates a timer to check for changes in annotations. If changes are detected, it triggers an auto-saving process.

private startChangeCheckTimer() {

  // Timer logic

}

 

7. Saving Annotations to Server: The saveInstantJSONToServer function handles the process of saving annotations to the backend server.

async saveInstantJSONToServer(): Promise<boolean> {

  // Saving logic

}

 

8. Loading PSPDFKit Instance: The loadPSPDFKitInstance function initializes the PSPDFKit instance with the provided PDF URL and any existing annotations.

private loadPSPDFKitInstance(websiteUrl: string): void {

  // PSPDFKit loading logic

}

 

9. Performing Search and Annotations: The performSearch function highlights searched terms in the PDF by creating highlight annotations.

private performSearch(): void {

  // Search and annotations logic

}

Backend Flow

  1. Storing Annotations: When users add annotations, these annotations are stored in the backend along with the corresponding document’s ID.
  2. Fetching Annotations: Upon loading a document, the PdfViewerComponent fetches annotations associated with the provided item ID from the backend.
  3. Exporting Instant JSON: PSPDFKit’s exportInstantJSON method generates an Instant JSON representation of the document, including annotations.
  4. Saving Annotations: The saveInstantJSONToServer function sends the generated Instant JSON, the item ID, and user information to the backend for storage.

Auto-Saving: The component employs a timer to periodically check for changes in annotations. If changes are detected, the annotations are auto-saved to the server.

9. Integrating Advanced Search with Elastic Search

Incorporating advanced search capabilities within our file management system significantly enhances its usability. By coupling Optical Character Recognition (OCR) technology with search functionalities powered by Elastic Search, users can effortlessly search for specific terms within document content. Let’s dive into how this process works and explore the components involved.

Elastic Search: A Robust Search Engine

Elastic Search is a powerful open-source search and analytics engine. It’s well-suited for handling full-text search and provides advanced features for efficient searching within large datasets. In our file management system, Elastic Search will serve as the core search engine for searching through OCR-processed documents.

OCR Processor: Bringing Text to Searchable Form

OCR is a technology that converts scanned documents, images, or PDFs into machine-readable text. In our system, the ocrprocessor.js module is responsible for processing files and extracting their textual content for advanced searching. This module utilizes Google Cloud Document AI for OCR and Elastic Search for indexing and searching.

Here’s how the OCR process works:

  1. Loading Dependencies: The necessary modules and libraries, such as the Google Cloud Document AI client and Elastic Search client, are imported to enable various functionalities.
  2. Document Processor Service: The processFile function handles the OCR process. It takes a list of files, processes them using the Document AI processor for text extraction, and stores the processed documents in Elastic Search for indexing.
  3. Generating Instant JSON: The processed documents are stored in JSON format with extracted text. This JSON representation is known as Instant JSON and includes text and coordinate information for each token.
  4. Text Extraction and Coordinates: The OCR process involves extracting all tokens’ textual content and their corresponding coordinates. These coordinates are used for highlighting and displaying search results accurately.
  5. Document Storage and Ingestion: The processed documents are then ingested into the OpenSearch index for efficient searching.

Elasticsearch Integration for Advanced Search

To enable advanced searching capabilities, we integrate Elasticsearch into our system. Elasticsearch is a powerful, open-source search and analytics engine that provides high-performance full-text search capabilities.

Connecting to Elasticsearch

The connectOpenSearchClient function establishes a connection to the Elasticsearch cluster. This connection is crucial for indexing and searching documents efficiently.

async function connectOpenSearchClient() {
return new Promise(async (resolve, reject) => {
ClientOS = new Client({
node:
"https" +
"://" +
config.constants.OPEN_SEARCH_AUTH +
"@" +
config.constants.OPEN_SEARCH_HOST,
ssl: {
rejectUnauthorized: false,
},
});

resolve(ClientOS);
});
}

Ingesting Documents into Elasticsearch

The ingestDocument function handles the process of ingesting processed documents into the Elasticsearch index. It reformats the documents and prepares them for indexing.

async function ingestDocument(Client, documents) {
// INGEST MULTIPLE DOCUMENTS
const params = {
body: documents
.map((document) => {
return [
{
index: {
_index: config.constants.OPEN_SEARCH_INDEX,
_id: document.uniqueId,
},
},
document,
];
})
.flat(),
};



try {
const { body } = await Client.bulk(params);
console.log(body);
} catch (error) {
console.log(error);
}
}

File Ingestor: Indexing for Effective Search

The fileIngestor.js module is responsible for ingesting processed documents into the OpenSearch index. OpenSearch is an open-source search and analytics engine designed for powerful search capabilities.

Here’s how the indexing process works:

  1. Connecting to OpenSearch: The connectOpenSearchClient function establishes a connection to the OpenSearch cluster, enabling the ingestion of documents.
  2. Document Ingestion: The ingestAllDocuments function is responsible for ingesting the processed documents into the OpenSearch index. It reformats the documents and prepares them for indexing. 
  3. Bulk Ingestion: The documents are ingested in bulk, with each document’s content indexed for efficient searching.

Bringing Advanced Search to the Frontend

Integrating the OCR-processed content into the frontend’s advanced search requires the following steps:

  1. Search Component Enhancement: Enhance the search component to include the option for advanced search using OCR-processed content.
  2. Search Query and Highlighting: When users perform an advanced search, the search query is sent to the backend. The backend searches through the OCR-processed content for matches and sends the highlighted tokens’ coordinates back to the frontend.
  3. Displaying Search Results: Utilize the highlighted coordinates to display search results accurately. The highlighted tokens within the document provide users with context and ease of navigation.
  4. User Interaction: Users can interact with the search results to navigate directly to the relevant parts of the document.

let’s integrate the code snippets into the explanation to provide a comprehensive understanding of how the advanced search with OCR works in our file management system.

OCR Processor: Bringing Text to Searchable Form

// ocrprocessor.js

// ... (Other imports and configurations)

async function processFile(files) {

  return new Promise(async (resolve, reject) => {

    // Process the documents using Document AI

    const documentProcessorName = `projects/${config.constants.GOOGLE_PROJECT_ID}/locations/us/processors/${config.constants.GOOGLE_PROCESSOR_ID}`;

    const requestObj = {

      name: documentProcessorName,

      inputDocuments: {

        gcsDocuments: {

          documents: files.map((file) => {

            return {

              gcsUri: `gs://${config.constants.GOOGLE_BUCKET_NAME}/${file.ID}-${file.S3FileName}`,

              mimeType: file.Extension,

            };

          }),

        },

      },

      // ...

   };

    try {

      const [operation] = await client.batchProcessDocuments(requestObj);


      // Wait for operation to complete.

      await operation.promise();


      // ... (Fetching results, preparing JSON, etc.)


      resolve(queueResult);

    } catch (error) {

      console.log(error);

      reject(error);

    }

  });

}

// ... (Other functions and exports)

 

File Ingestor: Indexing for Effective Search

// fileIngestor.js

// ... (Other imports and configurations)

async function ingestAllDocuments(documents) {

  return new Promise(async (resolve, reject) => {

    try {

      // CONNECT TO OPENSEARCH

      const ClientOS = await connectOpenSearchClient();


      const documentsReformatted = documents.reduce((acc, document) => {

        // ... (Formatting documents for indexing)

      }, []);


      // INGEST MULTIPLE DOCUMENTS

      const params = {

        body: documentsReformatted

          .map((document) => {

            return [

              {

                index: {

                  _index: config.constants.OPEN_SEARCH_INDEX,

                  _id: document.uniqueId,

                },

              },

              document,

            ];

          })

          .flat(),

      };


      // Bulk ingest documents into OpenSearch

      await ClientOS.bulk(params);


      // ... (Other processing and updates)


      resolve();

    } catch (error) {

      console.log(error);

      reject(error);

    }

  });

}

// ... (Other functions and exports)

 

Bringing Advanced Search to the Frontend

// pdf-viewer.component.ts

// ... (Other imports and class declaration)

export class PdfViewerComponent implements OnInit, OnDestroy {

  // ... (Other properties and methods)


  // Function to perform advanced search

  private performAdvancedSearch(searchTerm: string): void {

    // Send search query to backend

    this.filesManagementService.performAdvancedSearch(searchTerm).subscribe(

      (searchResults: any) => {

        // Highlight search results in the PDF using coordinates

        this.highlightSearchResults(searchResults);

      },

      (error: any) => {

        console.error('Error performing advanced search:', error);

      }

    );

 }

  // Function to highlight search results in the PDF

  private highlightSearchResults(searchResults: any): void {

    if (this.pspdfkitInstance && searchResults) {

      // Iterate through search results and extract coordinates

      for (const result of searchResults) {

        const { page, coordinates } = result;


        // Create a HighlightAnnotation using coordinates

        const highlightAnnotation = new PSPDFKit.Annotations.HighlightAnnotation({

          pageIndex: page - 1,

          rects: coordinates.map((coord: any) => {

            return new PSPDFKit.Geometry.Rect({

              left: coord.x,

              top: coord.y,

              width: coord.width,

              height: coord.height,

            });

          }),

        });


        // Add the highlight annotation to the PDF

        this.pspdfkitInstance.create([highlightAnnotation]);

      }

    }

  }


  // ... (Other methods and lifecycle hooks)

}

By integrating these code snippets into our explanation, we’ve provided a detailed walkthrough of how advanced search with OCR works in our file management system. This approach enhances the search capabilities, allowing users to find specific terms within documents quickly and accurately.

10. Overcoming Limitations with GCP Cloud Store

When dealing with limitations in processing large PDFs with Google Document AI, GCP Cloud Store comes to the rescue.

Addressing PDF Processing Limitations with GCP Cloud Store: Google Document AI has limitations on processing PDFs with more than 15 pages.

To overcome this limitation, you can use GCP Cloud Store to preprocess large PDFs before sending them to Document AI.

// file.service.ts

@Injectable({

  providedIn: 'root',

})

export class FileService {

  preprocessLargePdf(pdfData: string): string {

    // Logic to preprocess large PDF using GCP Cloud Store

    return preprocessedPdfData;

  }

}

By utilizing GCP Cloud Store for preprocessing, you ensure that large PDFs are optimized for Document AI processing, enabling efficient collaboration and search capabilities.

11. Bringing It All Together: User Workflow

Now that we’ve covered the implementation details of various functionalities, let’s walk through a user’s interaction with the fully functional file management system.

  1. Login: Users log in to the application using their credentials.
  2. View Directory Structure: Upon logging in, users see the hierarchical directory structure resembling Windows Explorer.
  3. File Manipulation: Users can create, copy, move, rename, and delete files and folders seamlessly.
  4. Document Previews and Annotations: Users can preview documents using PSPDFKit and add annotations for collaboration.
  5. Advanced Search: The search bar allows users to search by file name and content using OCR-enhanced search capabilities.

GCP Cloud Store Integration: When dealing with large PDFs, GCP Cloud Store preprocesses them to overcome Document AI limitations.

Conclusion

Congratulations! You’ve reached the end of this comprehensive guide on building an advanced file management system. By leveraging the power of Angular, ExpressJS, SQL, cloud services, and OCR technology, you’ve created a robust solution that empowers users to efficiently manage, collaborate on, and search for their files. The integration of PSPDFKit and GCP Cloud Store ensures a seamless experience for document preview and large PDF processing. Remember, the key to successful software development lies in understanding and addressing complexities, and this guide has equipped you with the tools to do just that. Best of luck as you embark on your journey to build innovative applications that address real-world challenges!

Raza Anis

Senior Software Engineer

Muhammad Hammad Ghani

Software Engineer

Leave a Reply