The role of artificial intelligence in drug discovery and development
Elisa Stefanini
Portolano Cavallo, Italy
estefanini@portolano.it
Claudio Todisco
Portolano Cavallo, Italy
ctodisco@portolano.it
The role of artificial intelligence in drug discovery and drug development
The research and development of a new drug is a particularly complex industrial process for companies; it is extremely costly and its outcomes are often uncertain. Indeed, the development of a new drug takes on average between 10 and 15 years, with costs running into the hundreds of millions of euros, and sometimes even several billion, per approved molecule. Conversely, the percentage of compounds that achieve regulatory approval is very low, currently standing at around 12 per cent of the total number of compounds entered into pharmaceutical development programmes.1
Conventionally, two main phases are distinguished, which are conceptually distinct but operationally interconnected: drug discovery and drug development.
Drug discovery encompasses all activities aimed at identifying a new active ingredient with therapeutic potential. It begins with the identification of a biological target – that is, a molecule capable of influencing a pathological process in various ways – and of a molecule capable of interacting with that target by selectively binding to it and modulating its activity, which may, if appropriate, evolve into a new drug. Drug development, on the other hand, encompasses all the activities necessary to demonstrate the compound’s efficacy and safety in humans, to transform it into a stable and reproducible pharmaceutical product, and to obtain marketing authorisation.2
In the discovery phase, artificial intelligence is radically transforming the identification of new molecules, drastically reducing the time and costs of traditional screening. The most established application concerns the identification and validation of targets. Machine learning models – and, in particular, deep neural networks (deep learning)3 – are used to analyse vast amounts of data in order to identify the proteins or biological pathways involved in a given disease, as well as to understand the mechanisms underlying diseases.
A leading example, widely cited by the international scientific community, is AlphaFold, developed by Google DeepMind:4 AlphaFold has made it possible to predict the three-dimensional structure of almost all known proteins with extremely high accuracy, opening up new frontiers in the understanding of diseases and the design of drugs.
AI systems also enable the analysis of libraries containing millions of chemical compounds, evaluating their properties (absorption, distribution, metabolism, excretion and toxicity) and the likelihood of interaction with the target in timescales that would be unthinkable using traditional methods.
During the development phase, AI applications are equally widespread. AI can optimise the design of clinical trials, improve patient selection and identify biomarkers predictive of treatment response, helping to reduce the time and cost of clinical trials. In clinical practice, AI is used to analyse data and information from multiple sources – including real-world data, electronic health records and biomarkers – with positive effects on trial efficiency and a reduction in the number of patients exposed to ineffective or unsafe treatments.
Secondary use of data: the new Italian law on artificial intelligence
AI requires the extensive use of data in order to generate reliable models, a requirement that is all the more essential in a sensitive context such as healthcare. This raises the issue of the secondary use of personal data and compliance with the regulations applicable to such processing. In Italy, in particular, the approach has been particularly restrictive, and the national legislation transposing the General Data Protection Regulation (GDPR) (Regulation (EU) 2016/679) has effectively identified data subjects’ consent as the legal basis for any secondary use of personal data for scientific research purposes, except in certain residual circumstances.5
In this regard, Law No 132 of 23 September 2025 Disposizioni e deleghe al Governo in materia di intelligenza artificiale (Provisions and delegated powers to the Government concerning artificial intelligence), introduced a significant change specifically regarding the use of AI in drug discovery and development. The law provides that the processing of personal data – including data belonging to special categories – carried out by public bodies, private non-profit organisations and IRCCS,6 and also by private entities ‘operating in the healthcare sector within the context of research projects in which public and private non-profit entities participate’, shall be classified as being of substantial public interest pursuant to Article 9(2)(g) of the GDPR, where necessary for the development of AI systems for the purposes of prevention, diagnosis, treatment, and the development of medicines and medical devices, as well as for the creation of databases and baseline models.
In such cases, the regulation authorises the secondary use of data even in the absence of consent, provided that the data subject receives an appropriate information, including through a ‘general privacy notice’ published on the data controller’s website, and the data are devoid of direct identifiers. The sole exception concerns cases where knowledge of the identity is unavoidable or necessary for the protection of the data subject’s health.
The law represents a significant paradigm shift in the Italian legal system and is expected to have a major impact on the development of AI systems in the healthcare sector, overcoming the rigidity of the Italian legislation on this point.7
Synthetic data
Among the most debated issues regarding the use of AI systems in the healthcare context is the use of synthetic data, that is, data artificially generated by statistical or machine learning models, structured to reflect the statistical properties of real datasets without containing information relating to identified or identifiable individuals.
In the context of drug discovery and drug development, the use of synthetic data addresses a twofold need. On the one hand, it allows us to overcome the quantitative limitations of available clinical datasets, which have historically been insufficient to ensure the statistical representativeness required to train reliable AI models, for example, in the case of rare diseases. On the other hand, it enables the development, validation and sharing of AI models whilst minimising the use of real personal data, with clear advantages in terms of compliance with the regulatory framework on data protection.
At European level, the European Medicines Agency (EMA) Reflection Paper on the use of AI in the lifecycle of medicinal products explicitly recognises synthetic data as a relevant data augmentation technique for expanding training datasets and, in certain cases, supporting objectives of non-discrimination and fairness in AI models used in the pharmaceutical sector. The AI Act also contains explicit references to synthetic data, confirming its regulatory significance.8
At national level, Article 8(3) of the aforementioned Artificial Intelligence Act No 132/2025 stipulates that in the healthcare sector, for the development of AI systems for the purposes of prevention, diagnosis and treatment:
‘it is always permitted, subject to informing the data subject in accordance with Article 13 of Regulation (EU) 2016/679, the processing for the purposes of anonymisation, pseudonymisation or synthesis of personal data, including data belonging to the special categories referred to in Article 9(1) of the same Regulation (EU) 2016/679’.
This provision has significant practical implications. Firstly, it expressly qualifies ‘synthesis’ as an independent purpose of personal data processing in the healthcare sector. Secondly, the provision explicitly extends this right to the special categories of data referred to in Article 9(1) of the GDPR – including health-related data – subject to compliance with the minimum safeguard of providing information to the data subject in accordance with Article 13 of the GDPR. Furthermore, the wording ‘is always permitted’ eliminates the need for case-by-case authorisation, simplifying the procedural framework for operators engaged in pharmaceutical research who intend to generate synthetic datasets from real clinical data.
The position of the regulatory authorities
The use of AI systems in drug discovery and development within the EU is guided by the EMA’s Reflection Paper on the use of AI in the lifecycle of medicinal products, first published in 2023 and subsequently updated in September 2024,9 which remains the primary reference at European level to this day. The document identifies criteria and standards applicable to AI systems used in drug discovery and development processes, favouring a risk-based approach. The degree of risk depends not only on the technology and the quality of the data, but also on the context of use and the degree of influence exerted by the technology, and may vary over the course of the AI system’s lifecycle.
A key principle is that the clinical trial sponsor and/or the applicant/marketing authorisation holder is responsible for ensuring that all algorithms, models, datasets and data processing pipelines are fit for purpose and comply with applicable standards (legal, ethical, technical, scientific and regulatory).
In terms of data transparency and governance, the EMA considers the use of transparent and interpretable models to be preferable, whilst recognising that the use of ‘black box’ models may be acceptable in cases where transparent models demonstrate unsatisfactory performance or robustness, provided this is supported by a predefined risk monitoring and management plan. Particular attention is devoted to the quality of training data: AI/Machine Learning (ML) models are inherently data-driven and therefore vulnerable to the introduction of bias, which is why efforts must be made to acquire training datasets that are balanced and sufficiently large in relation to the context of use.
The level of risk associated with the use of AI and ML systems may vary depending on the stage of the medicinal product’s lifecycle. For example, the risks arising from the use of AI are generally considered to be lower in drug discovery activities, which precede the identification of potentially relevant molecules. In the pre-clinical development phase of medicinal products – where the use of AI can not only generate more robust and reliable data but also limit or even avoid the use of laboratory animals – it is recommended, as far as possible, that GLP (Good Laboratory Practice) be applied to these technologies as well, given the importance of pre-clinical data in assessing the risk-benefit ratio of the drug. As regards clinical trials, however, AI systems must first and foremost follow the guidelines derived from GCP (Good Clinical Practice) and be specifically evaluated within the scope of the trial; consequently, their architecture and functioning must be made clear by the sponsor to the authorities and involved bodies, as well as to the trial participants themselves. The relevant information must therefore be included in the trial protocol.
The principles contained in the Reflection Paper appear fully aligned with those expressed in current legislative framework, with particular reference to the AI Act, emphasising the requirements for transparency, the risk-based approach, the importance of the quality and representativeness of the source data, given the risk of bias, and the ever-necessary human oversight.
Reference is also made to other European regulations relevant to these issues, such as, of course, the aforementioned GDPR and the Medical Device Regulation, where the algorithms used may fall within the definition of a medical device.
In summary, the EMA’s position reflects a stance of cautious openness towards the use of AI and synthetic data to support data collection necessary for the registration of medicinal products for human use, particularly in those phases of development where the benefits of their use outweigh the associated risks, based on a case‑by‑case assessment. For this reason, developers are encouraged to engage with regulatory authorities at a very early stage of the research project, especially where no specific guidance is available.
Finally, mention is made of a very recent document, ‘Guiding Principles for Good AI Practice in Drug Development’, published jointly by the EMA and the Food and Drug Administration (FDA) in January 2026,10 which aims to provide a harmonised framework for the responsible use of AI throughout the entire medicinal product development cycle. The document identifies a series of cross-cutting principles – including the centrality of human oversight, data quality and governance, model robustness, transparency, risk management and continuous performance monitoring – which should guide sponsors and developers in the design, validation and implementation of AI systems. In line with what was already noted in the EMA’s Reflection Paper, this document also places particular emphasis on the need to ensure that algorithms are fit for purpose, that the datasets used are representative and free from avoidable bias, and that the entire process is documented in a way that allows for independent verification. The principles also highlight the importance of assessing the impact of AI on the risk-benefit balance of the medicinal product, as well as ensuring that the outputs of the models are understandable and verifiable by the operators involved.
At national level, the Italian Medicines Agency (AIFA) published the report ‘Intelligenza Artificiale e Salute’ (Artificial Intelligence and Health) in March 2026,11 which provides a systematic overview of the main applications of AI in the pharmaceutical sectors, highlighting their opportunities, limitations and regulatory implications. The Report pays particular attention to the applications of AI in clinical research, pharmacovigilance, personalised medicine and production processes, highlighting how the use of predictive models and advanced analytical tools can improve the efficiency of decision-making processes, reduce development times and support a more proactive approach to risk management. AIFA also emphasises the importance of a multidisciplinary approach involving regulatory, clinical, IT and bioethical expertise, as well as the need to invest in digital infrastructure and appropriate professional skills.
Notes
1 AIFA Report: ‘Intelligenza Artificiale e Salute - Come l’IA sta rivoluzionando la ricerca farmaceutica, la medicina di precisione e il futuro della salute globale’, March 2026, available at: www.aifa.gov.it/en/-/intelligenza-artificiale-e-salute-le-agenzie-regolatorie-al-centro-della-rivoluzione-farmaceutica.
2 The development phase is divided into preclinical studies (conducted on cellular models, frequently in vitro, and on animals): Phase I clinical trials (safety and tolerability in healthy subjects or patients); Phase II (efficacy and dosing in small patient populations); Phase III (large-scale efficacy and safety, via randomised controlled trials); and, following authorisation, Phase IV studies (pharmacovigilance and post-marketing monitoring).
3 A Deep Neural Network (DNN) is an AI model composed of multiple layers of interconnected artificial neurons. DNNs are capable of identifying extremely complex non-linear relationships between inputs, outperforming many traditional machine learning algorithms in complex tasks.
4 See https://deepmind.google/science/alphafold.
5 In particular, pursuant to Legislative Decree No 196/2003 adapting the GDPR into Italian law: (1) the processing of health data for research purposes does not require the data subject’s consent where, inter alia, informing the data subjects is impossible, disproportionate or prejudicial, provided that the research programme is approved by the competent ethics committee, the data controller takes all appropriate measures (including a DPIA) and complies with the guidelines of the Italian Data Protection Authority; (2) the further processing of personal data for research or statistical purposes does not require the data subject’s consent where informing the data subjects is impossible, disproportionate or prejudicial, provided that prior authorisation is obtained from the Italian Data Protection Authority.
6 Istituto di Ricovero e Cura a Carattere Scientifico (Scientific Institutes for Research, Hospitalisation and Healthcare). These are public or private hospitals that carry out highly specialised clinical activities integrated with biomedical and health research of national relevance.
7 See note 5.
8 In particular, for the purpose of correcting biases in high-risk AI systems, the AI Act clarifies at Article 10(5) that the processing of special categories of personal data, including health data, is permitted only where the same result cannot be effectively achieved through the use of other data, ‘including synthetic or anonymised data’. This provision effectively introduces a principle of subsidiarity regarding the use of real personal data: the processing of such data is justified only if and to the extent that less intrusive alternatives, including synthetic data, prove unsuitable for achieving the purpose.
9 EMA, ‘Reflection paper on the use of Artificial Intelligence (AI) in the medicinal product lifecycle’, 9 September 2024, available at: www.ema.europa.eu/en/documents/scientific-guideline/reflection-paper-use-artificial-intelligence-ai-medicinal-product-lifecycle_en.pdf.
10 See www.ema.europa.eu/en/news/ema-fda-set-common-principles-ai-medicine-development-0.
11 See www.aifa.gov.it/documents/20142/3346516/Dossier_stampa_IA_e_Salute.pdf.