
about
Sinjoy is a first year Masters student in Computer Science and Engineering at the Pennsylvania State University. His courses are focused on design and analysis of algorithms, computer architecture, machine learning, computer vision and NLP.
He joined the Human Language Technologies Lab, led by Dr. Shomir Wilson, as a student researcher. He is working on improving PrivaSeer, a privacy policy search engine of ~3.1 million privacy policies. He is also involved in the analysis of legal documents using LLMs to find contradicitons and inconsistencies in such documents. His research interests broadly span the domains of Natural Language Processing, Machine Learning and AI in large-scale document analysis.
Prior to joining Penn State, Sinjoy worked as a Senior Research Engineer at Siemens Healthineers. As part of the Data Analytics team, his contributions spanned multiple projects in image and language processing. He designed and developed an automated grading system of gamma camera images, which led to a patent filing.
He has prior experience in fine-tuning existing LMs for downstream tasks like entity recognition, clustering, retrieval and re-ranking for semantic search. He led the deployment of an on-prem search and summarization engine for recommending corrective actions to service engineers, based on historic service tickets, which helped in reducing downtime and saving license costs. His contributed to automated analysis of failure modes across the installed base to monitor trends in KPIs over time such as service costs, parts replaced and downtime, leading to the redesign of a SPECT subcomponent.
He had been a research intern at Sensordrops Networks (incubated at IIT Kharagpur), where he worked on improving federated learning for non-IID datasets, and at the Plant Vision Lab, University of Nebraska-Lincoln, where he worked on hyperspectral image analysis which led to a journal publication.
Check out his GitHub profile!
work
experience

Senior Engineer - Research and Technology
Siemens Healthineers, Bangalore, KA
Mar 2023 - Aug 2024
- Fine-tuned a Siamese BERT network on query-result pairs for re-ranking search results and obtained NDCG at 10 of 0.71.
- Led the deployment of a search API using fastAPI on AWS cloud, leveraging on-premise LLM and Milvus VectorDB to serve RAG pipelines and low-code platforms like Power App and Teams chatbot, saving $20,000 per annum in license costs.
- Designed semi-supervised system for identifying failure modes from service tickets using Named Entity Recognition (NER), HDBSCAN clustering and generative labelling by Orca-2, leading to an invention disclosure.
- Created a Power BI dashboard for visualizing failure modes across installed base and monitor trends in KPIs over time such as service costs, parts replaced and downtime, leading to the redesign of a SPECT subcomponent.
- Developed histopathology WSI segmentation model utilizing DeepLabV3+ and ResNet50 (Dice 0.85, IoU 0.63). Optimized using ONNX, pruning, quantization, normalization and background separation, achieving sub-15 minute inference.

Engineer - Research and Technology
Siemens Healthineers, Bangalore, KA
Dec 2021 - Feb 2023
- Implemented end-to-end Cause and Action phrase extraction from service tickets leveraging roBERTa and NER, increasing average ticket coverage in search results from 14% to 52% compared to earlier POS tagging model.
- Developed an unsupervised method leveraging Word2Vec embeddings and DBSCAN to cluster domain-specific words into a thesaurus, increasing clusters from 15 to 346 and enhancing full-text SQL search for PET/SPECT service tickets.
- Engineered a novel image feature set for Random Forest classifier and an ROI-agnostic artifact segmentation model for automated grading of SPECT QC images, improving F1-score from 0.46 to 0.75, resulting in a patent filing.

Programmer Analyst Trainee
Cognizant Technology Solutions, Kolkata, WB
Nov 2020 - Sep 2021
- Developed and executed automated testing scripts using Java Selenium, TestNG, and JUnit to ensure quality of API and web apps.
- Performed requirement analysis and test design for web/mobile apps and worked with developers to resolve defects.
research
experience

Graduate Researcher
Human Language Technologies Lab, Penn State
Sep 2024 - Present
- Optimized visualization rendering latency for PrivaSeer, a privacy policy search engine containing ~3.1 million policies.
- Leveraged on-premise Llama 3.1-8B-it to extract contradictory and inconsistent statements from privacy policies in JSON format.
- Designed output parsing and self-verification systems to filter incorrectly formatted outputs and reduce hallucinations.

Research Intern
Plant Vision Lab, University of Nebraska-Lincoln
Jul 2021 - Dec 2021
- Developed two methods to predict onset of plant drought stress using Dynamic Time Warping (DTW) algorithm on computed phenotypes and 1D-CNN for temporal stress propagation on hyperspectral images (F1: 0.98, mean soil water content (SWC) corr: -0.85), leading to a journal publication.

Research Intern
Sensordrops Networks Pvt. Ltd., IIT Kharagpur
Jul 2021 - Dec 2021
- Introduced a novel algorithm for Federated Learning in non-IID data by clustering on client data statistics, surpassing FedAvg accuracy by 2% and reducing aggregation time by 67% compared to clustering on local weights.
education

Masters of Science in Computer Science and Engineering
Pennsylvania State University - University Park, PA
Aug 2024 - May 2026
- Courses: Machine Learning - Tools and Algorithms, Design and Analysis of Algorithms, Fundamentals of Computer Architecture

Bachelor of Technology in Electronics and Communication Engineering
Pennsylvania State University - University Park, PA
Oct 2016 - Sep 2020
- Courses: Engineering Math – I & II, Data Structures & Algorithms, Computer Organization & Architecture, Microprocessors, Analog & Digital Circuits, Control Theory, Digital Signal Processing and Communication Theory.
- Thesis: Environmental Monitoring using SFCW Radar for Minor Crack Detection
publications
(and patent under review)
projects

Machine-Generated Text Attribution
Trained BERT Sequence classifier on a novel dataset created using 5 generative models (GPT-2-small, GPT-2XL, Phi-2, Falcon-7B, Mistral-7B-it) by prompting with text sourced from Wikipedia and GSM8K Math dataset, achieving F1-score of 0.65. and studied the impact of input length, prompt domain, and LLM parameter size on attribution accuracy.
